Show HN: The Hessian of tall-skinny networks is easy to invert

(github.com)

13 points | by rahimiali 2 hours ago

4 comments

Lerc 19 minutes ago
I am not a mathematician, but I do enough weird stuff that I encounter things referring to Hessians, yet I don't really know what they are, because everyone who writes about them does so in terms that assumes the reader knows what they are.
Any hints? The Battenburg graphics of matrices?
MontyCarloHall 1 hour ago
>If the Hessian-vector product is Hv for some fixed vector v, we're interested in solving Hx=v for x. The hope is to soon use this as a preconditioner to speed up stochastic gradient descent.
Silly question, but if you have some clever way to compute the inverse Hessian, why not go all the way and use it for Newton's method, rather than as a preconditioner for SGD?
[-]
- rahimiali 56 minutes ago
  Good q. The method computes Hessian-inverse on a batch. When people say "Newton's method" they're often thinking H^{-1} g, where both the Hessian and the gradient g are on the full dataset. I thought saying "preconditioner" instead of "Newton's method" would make it clear this is solving H^{-1} g on a batch, not on the full dataset.
  [-]
  - hodgehog11 38 minutes ago
    Just a heads up in case you didn't know, taking the Hessian over batches is indeed referred to as Stochastic Newton, and methods of this kind have been studied for quite some time. Inverting the Hessian is often done with CG, which tends to work pretty well. The only problem is that the Hessian is often not invertible so you need a regularizer (same as here I believe). Newton methods work at scale, but no-one with the resources to try them at scale seems to be aware of them.
    It's an interesting trick though, so I'd be curious to see how it compares to CG.
    [1] https://arxiv.org/abs/2204.09266 [2] https://arxiv.org/abs/1601.04737 [3] https://pytorch-minimize.readthedocs.io/en/latest/api/minimi...
    [-]
    - semi-extrinsic 34 minutes ago
      For solving physics equations there is also Jacobian-free Newton-Krylov methods.
  - MontyCarloHall 54 minutes ago
    I'd call it "Stochastic Newton's Method" then. :-)
    [-]
    - rahimiali 50 minutes ago
      fair. thanks. i'll sleep on it and update the paper if it still sounds right tomorrow.
      probably my nomenclature bias is that i started this project as a way to find new preconditioners on deep nets.
jeffjeffbear 59 minutes ago
I haven't looked into it in years, but would the inverse of a block bi-diagonal matrix have some semiseperable structure? Maybe that would be good to look into?
[-]
- rahimiali 52 minutes ago
  just to be clear, semiseparate in this context means H = D + CC', where D is block diagonal and C is tall & skinny?
  If so, it would be nice if this were the case, because you could then just use the Woodbury formula to invert H. But I don't think such a decomposition exists. I tried to exhaustively search through all the decompositions of H that involved one dummy variable (of which the above is a special case) and I couldn't find one. I ended up having to introduce two dummy variables instead.
  [-]
  - jeffjeffbear 45 minutes ago
    > just to be clear, semiseparate in this context means H = D + CC', where D is block diagonal and C is tall & skinny?
    Not quite, it means any submatrix taken from the upper(lower) part of the matrix has some low rank. Like a matrix is {3,4}-semiseperable if any sub matrix taken from the lower triangular part has at most rank 3 and any submatrix taken from the upper triangular part has at most rank 4.
    The inverse of an upper bidiagonal matrix is {0,1}-semiseperable.
    There are a lot of fast algorithms if you know a matrix is semiseperable.
    edit: link https://people.cs.kuleuven.be/~raf.vandebril/homepage/public...
    [-]
    - rahimiali 39 minutes ago
      thanks for the explanation! sorry i had misread the AI summary on "semiseparable".
      i need to firm my intuition on this first before i can say anything clever, but i agree it's worth thinking about!
Swoerd 1 hour ago
[dead]