- Published on
Advanced Machine Learning - CPSC 440 (UBC)
- Authors
- Name
- Christina Yang
Some of my CPSC 440 (2023W) final notes. Information source: The University of British Columbia
Probability review
Binary Density estimation
For binary data,
ex. X = [1 0 0 0 1 0].T -> Pr(X = 1) = 0.4
Inference: conditional probability Pr(X = 0 | θ) + Pr(X = 1 | θ) = 1
Likelihood:
Likelihood/inference: O(n) for X with n rows Working in log space helps prevent underflow + floating point errors: np.log1p(t) is log(1 + t)
Sampling helps us check whether a model is reasonable
MLE
Maximum likelihood estimation (MLE): pick the θ with the highest likelihood
MLE is asymptotically optimal as n → ∞
Can be really bad for small
Tends to overfit with more complex models Solution: Laplace smoothing: (n1+1)/(n+2) Solution 2: Use MAP
For Bernoullis
Maximize log-likelihood, which has the same solution as maximizing likelihood, and is easier mathematically. Maximum has derivative=0.
MAP (posteriors)
p(θ | X) Backwards from MLE, which is p(X | θ) p(X) doesn't matter: it’s the same for all θ I think this uses Bayes?
Common prior for Bernoulli is beta distribution, pro: beta is "flexible enough"
As n → ∞, the prior stops mattering and MAP → MLE But using a prior means we behave better when we have relatively small n
Bernoulli MLE and MAP; product of Bernoullis
Multivariate models; generative classifiers
Discriminative models
Neural Networks
- SGD is not guaranteed to reach a global minimum for non-convex problems.
- For standard objectives, there is a global min function value f*, but this may be achieved at multiple different parameter values.
- These training-error global minima may have very different test errors.
- Some of these global minima may be “more regularized” than others.
- using SGD regularizes parameters "implicit regularization"
- With small models, “minimize training error” leads to unique (or similar) global mins
- With larger models, there is a lot of flexibility in the space of global mins (gap between best/worst)
- Implicit regularization of SGD increases with dimension
For Binary classification:
- With W fixed, is convex, but with W and v as variables it is non-convex.
For Logistic Regression w/ linearly separable data:
- Converges to max-margin solution of the problem (minimum L2-norm solution; SVM).
- So using gradient descent is equivalent to encouraging large margin.
Deep networks: Issue: vanishing gradients/ reason we use ReLU instead of sigmoid
- Away from the origin, the gradient is nearly zero.
- The problem gets worse when you take the sigmoid of a sigmoid
- in deep networks, many gradients can be nearly zero everywhere, and numerically they will be set to 0. Solutions:
- Relu: set negative values to 0
- skip connections
- shortcuts from input to output
- ResNet: Adds activations from “2 layers ago”.
- DenseNet:
- Each layer can see all the values from many previous layers. – Significantly reduces vanishing gradients.
Tuning tricks:
- bias variables
- data standardization (center, whitening)
- Parameter initialization: “small but different", standardizing within layers.
- Step-size selection: “babysitting", Bottou trick.
- Momentum: heavy-ball and Nesterov-style modfications.
- Step size for each coordinate: AdaGrad, RMSprop, Adam.
- Rectified linear units (ReLU): replace sigmoid with max{0,h} to avoid gradients close to 0.
- Makes objective non-differentiable, but we now know SGD still converges in this setting.
- Batch normalization: adaptive standardizing within layers.
- Often allows sigmoid activations in deep networks.
- Residual/skip connections: connect layers to multiple previous layers.
- We now know that such connections make it more likely to converge to good minima.
- Neural architecture search: try to cleverly search through the space of hyper-parameters.
- This gets expensive!
Reduce overfitting tricks:
- early stopping
- dropout
- implicit regularization of SGD
- Weight decay: L2 or L1 regularization
- Recent work shows this can introduce bad local optima.
- Automatic differentiation
- input: code computing a function
- output: code computing the gradient of that function (just code, not formula)
- AD basically just writes every operation as instance of the chain rule.
- pros
- get gradient for same cost as function (pro for deep neural nets, but con if gradient can be computed at lower cost than function value)
- cons
- lots of storage (store all intermediate calculations)
Convolutional neural networks:
- Include layers that apply several (learned) convolutions.
- Significantly decreases number of parameters.
- Achieves a degree of translation invariance.
- Often combined with pooling operations like max pooling.
Autoencoders (unsupervised)
Autoencoders try to make their output the same as the input
Usually have a bottleneck layer with dimension k < input d
First layers “encode” the input into bottleneck
Last layers “decode” the bottleneck into a (hopefully valid) input
Relationship to principal component analysis (PCA):
- With squared error and linear network, equivalent to PCA
- Size of bottleneck layer gives number of latent factors k in PCA
- With non-linear transforms: a non-linear/deep generalization of PCA
- With squared error and linear network, equivalent to PCA
Encoding-Decoding for Multi-Label Classification Number of parameters and cost is O(dm + mk) for k classes and m hidden units. – If we trained a separate network for each class, number of parameters and cost would be O(kdm) (for ‘W’ for each class) • Might have multiple layers, convolution layers, and so on. • No need to have a “bottleneck” layer – ”encoder”/”decoder” is just terminology