Show HN: The Hessian of tall-skinny networks is easy to invert

(github.com)

14 points | by rahimiali 2 hours ago

4 comments

Lerc 26 minutes ago
I am not a mathematician, but I do enough weird stuff that I encounter things referring to Hessians, yet I don't really know what they are, because everyone who writes about them does so in terms that assumes the reader knows what they are.
Any hints? The Battenburg graphics of matrices?
[-]
- Nevermark 4 minutes ago
  In the context of optimizing parameters of a model, the Gradient consists of all the derivatives of the output being optimized (i.e. the total error measure) with respect to each of the models parameters.
  This in effect is a simplified version of the model, linearized around its current parameter values, making it easy to see which direction to take a small step to move the ultimate output in the direction that is desired.
  The Hessian consists of all 2nd order derivatives, i.e. not just slope, but the curvature of the model, around the current parameter values.
  Calculating all the first and 2nd degree derivatives takes more calculation, but allows for more information as to which direction to take a learning step. As not only do we know how the output will respond linearly to a small parameter change, but whether larger changes will produce higher or lower than linear responses.
  This can allow for the calculation of much larger changes to parameters, with high output improvements, speeding up training considerably, per training step.
  But the trade off, is each learning step requires far more derivative calculations and memory. So a conducive model architecture, and clever tricks, are needed to make the Hessian worth using.
  --
  Another derivative type is the Jacobian, which is the derivate of every individual output (i.e. all those numbers we normally think of as the outputs, not just the final error measure), with respect to every parameter.
  Jacobians can become enormous matrices. For billions of parameters, on billions of examples, with 100's of output elements, we would get a billions x 100's of billions derivative. So the Jacobians calculation can take enormous amounts of extra computation and memory. But there are still occasions (much fewer) when using it can radically speed up training.
- stevenae 5 minutes ago
  This helped me, coming from an ml background: https://randomrealizations.com/posts/xgboost-explained/
MontyCarloHall 1 hour ago
>If the Hessian-vector product is Hv for some fixed vector v, we're interested in solving Hx=v for x. The hope is to soon use this as a preconditioner to speed up stochastic gradient descent.
Silly question, but if you have some clever way to compute the inverse Hessian, why not go all the way and use it for Newton's method, rather than as a preconditioner for SGD?
[-]
- rahimiali 1 hour ago
  Good q. The method computes Hessian-inverse on a batch. When people say "Newton's method" they're often thinking H^{-1} g, where both the Hessian and the gradient g are on the full dataset. I thought saying "preconditioner" instead of "Newton's method" would make it clear this is solving H^{-1} g on a batch, not on the full dataset.
  [-]
  - hodgehog11 45 minutes ago
    Just a heads up in case you didn't know, taking the Hessian over batches is indeed referred to as Stochastic Newton, and methods of this kind have been studied for quite some time. Inverting the Hessian is often done with CG, which tends to work pretty well. The only problem is that the Hessian is often not invertible so you need a regularizer (same as here I believe). Newton methods work at scale, but no-one with the resources to try them at scale seems to be aware of them.
    It's an interesting trick though, so I'd be curious to see how it compares to CG.
    [1] https://arxiv.org/abs/2204.09266 [2] https://arxiv.org/abs/1601.04737 [3] https://pytorch-minimize.readthedocs.io/en/latest/api/minimi...
    [-]
    - semi-extrinsic 41 minutes ago
      For solving physics equations there is also Jacobian-free Newton-Krylov methods.
  - MontyCarloHall 1 hour ago
    I'd call it "Stochastic Newton's Method" then. :-)
    [-]
    - rahimiali 57 minutes ago
      fair. thanks. i'll sleep on it and update the paper if it still sounds right tomorrow.
      probably my nomenclature bias is that i started this project as a way to find new preconditioners on deep nets.
jeffjeffbear 1 hour ago
I haven't looked into it in years, but would the inverse of a block bi-diagonal matrix have some semiseperable structure? Maybe that would be good to look into?
[-]
- rahimiali 59 minutes ago
  just to be clear, semiseparate in this context means H = D + CC', where D is block diagonal and C is tall & skinny?
  If so, it would be nice if this were the case, because you could then just use the Woodbury formula to invert H. But I don't think such a decomposition exists. I tried to exhaustively search through all the decompositions of H that involved one dummy variable (of which the above is a special case) and I couldn't find one. I ended up having to introduce two dummy variables instead.
  [-]
  - jeffjeffbear 52 minutes ago
    > just to be clear, semiseparate in this context means H = D + CC', where D is block diagonal and C is tall & skinny?
    Not quite, it means any submatrix taken from the upper(lower) part of the matrix has some low rank. Like a matrix is {3,4}-semiseperable if any sub matrix taken from the lower triangular part has at most rank 3 and any submatrix taken from the upper triangular part has at most rank 4.
    The inverse of an upper bidiagonal matrix is {0,1}-semiseperable.
    There are a lot of fast algorithms if you know a matrix is semiseperable.
    edit: link https://people.cs.kuleuven.be/~raf.vandebril/homepage/public...
    [-]
    - rahimiali 46 minutes ago
      thanks for the explanation! sorry i had misread the AI summary on "semiseparable".
      i need to firm my intuition on this first before i can say anything clever, but i agree it's worth thinking about!
Swoerd 1 hour ago
[dead]