Published on

Advanced Machine Learning - CPSC 440 (UBC)

Authors
  • avatar
    Name
    Christina Yang
    Twitter

Some of my CPSC 440 (2023W) final notes. Information source: The University of British Columbia

Probability review

Binary Density estimation

For binary data,

ex. X = [1 0 0 0 1 0].T -> Pr(X = 1) = 0.4

Inference: conditional probability Pr(X = 0 | θ) + Pr(X = 1 | θ) = 1

Likelihood: Pr(X(1)=x(1),X(2)=x(2),...,X(n)=x(n)θ)==Pr(X^{(1)} = x^{(1)}, X^{(2)} = x^{(2)}, . . . , X^{(n)} = x^{(n)} | θ) ==

Likelihood/inference: O(n) for X with n rows Working in log space helps prevent underflow + floating point errors: np.log1p(t) is log(1 + t)

image

Sampling helps us check whether a model is reasonable

image

MLE

Maximum likelihood estimation (MLE): pick the θ with the highest likelihood

MLE is asymptotically optimal as n → ∞

Can be really bad for small nn

Tends to overfit with more complex models Solution: Laplace smoothing: (n1+1)/(n+2) Solution 2: Use MAP

For Bernoullis

Maximize log-likelihood, which has the same solution as maximizing likelihood, and is easier mathematically. Maximum has derivative=0.

MAP (posteriors)

p(θ | X) Backwards from MLE, which is p(X | θ) p(X) doesn't matter: it’s the same for all θ I think this uses Bayes?

Common prior for Bernoulli is beta distribution, pro: beta is "flexible enough"

As n → ∞, the prior stops mattering and MAP → MLE But using a prior means we behave better when we have relatively small n

Bernoulli MLE and MAP; product of Bernoullis

Multivariate models; generative classifiers

Discriminative models

Neural Networks

  • SGD is not guaranteed to reach a global minimum for non-convex problems.
  • For standard objectives, there is a global min function value f*, but this may be achieved at multiple different parameter values.
  • These training-error global minima may have very different test errors.
  • Some of these global minima may be “more regularized” than others.
  • using SGD regularizes parameters "implicit regularization"
  • With small models, “minimize training error” leads to unique (or similar) global mins
  • With larger models, there is a lot of flexibility in the space of global mins (gap between best/worst)
  • Implicit regularization of SGD increases with dimension

For Binary classification:

  • With W fixed, is convex, but with W and v as variables it is non-convex.

For Logistic Regression w/ linearly separable data:

  • Converges to max-margin solution of the problem (minimum L2-norm solution; SVM).
  • So using gradient descent is equivalent to encouraging large margin.

Deep networks: Issue: vanishing gradients/ reason we use ReLU instead of sigmoid

  • Away from the origin, the gradient is nearly zero.
  • The problem gets worse when you take the sigmoid of a sigmoid
  • in deep networks, many gradients can be nearly zero everywhere, and numerically they will be set to 0. Solutions:
  • Relu: set negative values to 0
  • skip connections
    • shortcuts from input to output
    • ResNet: Adds activations from “2 layers ago”.
    • DenseNet:
      • Each layer can see all the values from many previous layers. – Significantly reduces vanishing gradients.

Tuning tricks:

  • bias variables
  • data standardization (center, whitening)
  • Parameter initialization: “small but different", standardizing within layers.
  • Step-size selection: “babysitting", Bottou trick.
  • Momentum: heavy-ball and Nesterov-style modfications.
  • Step size for each coordinate: AdaGrad, RMSprop, Adam.
  • Rectified linear units (ReLU): replace sigmoid with max{0,h} to avoid gradients close to 0.
  • Makes objective non-differentiable, but we now know SGD still converges in this setting.
  • Batch normalization: adaptive standardizing within layers.
  • Often allows sigmoid activations in deep networks.
  • Residual/skip connections: connect layers to multiple previous layers.
  • We now know that such connections make it more likely to converge to good minima.
  • Neural architecture search: try to cleverly search through the space of hyper-parameters.
  • This gets expensive!

Reduce overfitting tricks:

  • early stopping
  • dropout
  • implicit regularization of SGD
  • Weight decay: L2 or L1 regularization
    • Recent work shows this can introduce bad local optima.
  • Automatic differentiation
    • input: code computing a function
    • output: code computing the gradient of that function (just code, not formula)
    • AD basically just writes every operation as instance of the chain rule.
    • pros
      • get gradient for same cost as function (pro for deep neural nets, but con if gradient can be computed at lower cost than function value)
    • cons
      • lots of storage (store all intermediate calculations)

Convolutional neural networks:

  • Include layers that apply several (learned) convolutions.
  • Significantly decreases number of parameters.
  • Achieves a degree of translation invariance.
  • Often combined with pooling operations like max pooling.

Autoencoders (unsupervised)

Autoencoders try to make their output the same as the input

  • Usually have a bottleneck layer with dimension k < input d

  • First layers “encode” the input into bottleneck

  • Last layers “decode” the bottleneck into a (hopefully valid) input

  • Relationship to principal component analysis (PCA):

    • With squared error and linear network, equivalent to PCA
      • Size of bottleneck layer gives number of latent factors k in PCA
    • With non-linear transforms: a non-linear/deep generalization of PCA

Encoding-Decoding for Multi-Label Classification Number of parameters and cost is O(dm + mk) for k classes and m hidden units. – If we trained a separate network for each class, number of parameters and cost would be O(kdm) (for ‘W’ for each class) • Might have multiple layers, convolution layers, and so on. • No need to have a “bottleneck” layer – ”encoder”/”decoder” is just terminology

Create an ecard at CelebrateThisMortal.comBe more productive with Planda