Advanced Machine Learning - CPSC 440 (UBC)

Some of my CPSC 440 (2023W) final notes. Information source: The University of British Columbia

Probability review

https://www.cs.ubc.ca/~schmidtm/Courses/Notes/probability.pdf

Binary Density estimation

For binary data,

ex. X = [1 0 0 0 1 0].T -> Pr(X = 1) = 0.4

Inference: conditional probability Pr(X = 0 | θ) + Pr(X = 1 | θ) = 1

Likelihood: $Pr(X^{(1)} = x^{(1)}, X^{(2)} = x^{(2)}, . . . , X^{(n)} = x^{(n)} | θ) ==$

Likelihood/inference: O(n) for X with n rows Working in log space helps prevent underflow + floating point errors: np.log1p(t) is log(1 + t)

Sampling helps us check whether a model is reasonable

MLE

Maximum likelihood estimation (MLE): pick the θ with the highest likelihood

MLE is asymptotically optimal as n → ∞

Can be really bad for small $n$

Tends to overfit with more complex models Solution: Laplace smoothing: (n1+1)/(n+2) Solution 2: Use MAP

For Bernoullis

Maximize log-likelihood, which has the same solution as maximizing likelihood, and is easier mathematically. Maximum has derivative=0.

MAP (posteriors)

p(θ | X) Backwards from MLE, which is p(X | θ) p(X) doesn't matter: it’s the same for all θ I think this uses Bayes?

Common prior for Bernoulli is beta distribution, pro: beta is "flexible enough"

As n → ∞, the prior stops mattering and MAP → MLE But using a prior means we behave better when we have relatively small n

Bernoulli MLE and MAP; product of Bernoullis

Multivariate models; generative classifiers

Discriminative models

Neural Networks

SGD is not guaranteed to reach a global minimum for non-convex problems.
For standard objectives, there is a global min function value f*, but this may be achieved at multiple different parameter values.
These training-error global minima may have very different test errors.
Some of these global minima may be “more regularized” than others.
using SGD regularizes parameters "implicit regularization"
With small models, “minimize training error” leads to unique (or similar) global mins
With larger models, there is a lot of flexibility in the space of global mins (gap between best/worst)
Implicit regularization of SGD increases with dimension

For Binary classification:

With W fixed, is convex, but with W and v as variables it is non-convex.

For Logistic Regression w/ linearly separable data:

Converges to max-margin solution of the problem (minimum L2-norm solution; SVM).
So using gradient descent is equivalent to encouraging large margin.

Deep networks: Issue: vanishing gradients/ reason we use ReLU instead of sigmoid

Away from the origin, the gradient is nearly zero.
The problem gets worse when you take the sigmoid of a sigmoid
in deep networks, many gradients can be nearly zero everywhere, and numerically they will be set to 0. Solutions:
Relu: set negative values to 0
skip connections
- shortcuts from input to output
- ResNet: Adds activations from “2 layers ago”.
- DenseNet:
  - Each layer can see all the values from many previous layers. – Significantly reduces vanishing gradients.

Tuning tricks:

bias variables
data standardization (center, whitening)
Parameter initialization: “small but different", standardizing within layers.
Step-size selection: “babysitting", Bottou trick.
Momentum: heavy-ball and Nesterov-style modfications.
Step size for each coordinate: AdaGrad, RMSprop, Adam.
Rectified linear units (ReLU): replace sigmoid with max{0,h} to avoid gradients close to 0.
Makes objective non-differentiable, but we now know SGD still converges in this setting.
Batch normalization: adaptive standardizing within layers.
Often allows sigmoid activations in deep networks.
Residual/skip connections: connect layers to multiple previous layers.
We now know that such connections make it more likely to converge to good minima.
Neural architecture search: try to cleverly search through the space of hyper-parameters.
This gets expensive!

Reduce overfitting tricks:

early stopping
dropout
implicit regularization of SGD
Weight decay: L2 or L1 regularization
- Recent work shows this can introduce bad local optima.
Automatic differentiation
- input: code computing a function
- output: code computing the gradient of that function (just code, not formula)
- AD basically just writes every operation as instance of the chain rule.
- pros
  - get gradient for same cost as function (pro for deep neural nets, but con if gradient can be computed at lower cost than function value)
- cons
  - lots of storage (store all intermediate calculations)

Convolutional neural networks:

Include layers that apply several (learned) convolutions.
Significantly decreases number of parameters.
Achieves a degree of translation invariance.
Often combined with pooling operations like max pooling.

Autoencoders (unsupervised)

Autoencoders try to make their output the same as the input

Usually have a bottleneck layer with dimension k < input d
First layers “encode” the input into bottleneck
Last layers “decode” the bottleneck into a (hopefully valid) input
Relationship to principal component analysis (PCA):
- With squared error and linear network, equivalent to PCA
  - Size of bottleneck layer gives number of latent factors k in PCA
- With non-linear transforms: a non-linear/deep generalization of PCA

Encoding-Decoding for Multi-Label Classification Number of parameters and cost is O(dm + mk) for k classes and m hidden units. – If we trained a separate network for each class, number of parameters and cost would be O(kdm) (for ‘W’ for each class) • Might have multiple layers, convolution layers, and so on. • No need to have a “bottleneck” layer – ”encoder”/”decoder” is just terminology