Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does gradient descent really do well for deep learning when the gradient is computed with respect to the whole dataset? I assumed that the noise in SGD played an important role for escaping local minima.


There aren't really local minima in most deep networks. When you get into millions/billions of parameters, there will essentially always be some directions that point downwards. You have to get really really close to the true minimum for there to be no direction to go that improves the loss.

Incidentally this same phenomenon is IMO how evolution is able to build things like the eye. Naively you'd think that since you need so many parts arranged so well, it's impossible to find a step by step path where fitness goes up at every step, i.e. if you just have a retina with no lens or vice-versa, it doesn't work. However, due to the high dimensionality of DNA, there is essentially guaranteed to be a path with monotonically increasing fitness just because there are so many different possible paths from A to B in the high dimensional space that at least one is bound to work.

Now this isn't strictly true for every high dimensional system. You need to have a degree of symmetry or redundancy in the encoding for it to work. For example, in convolutional neural networks, you see this phenomenon where some filters get "stuck" in training, and those are local minima for that subspace. What happens though is that if one filter gets stuck, the network will just use another one that had a better initialization. This is why pruning works, lottery tickets, etc. Things like residual connections enhance this effect since you can even be stuck in a whole layer and the training process can just bypass it.

You see the same thing with life, where you could put a sequence for the same protein in different parts of the genome and it could still be produced, regardless of the position. There are also many different ways to encode the exact same protein, and many different possible proteins that will have the same shape in the critical areas. Life finds a way.


If that was the case we would be finding globally optimal solutions for complicated non-convex optimization problems.

The reality is different, you need to really explore the space to find the truly global optimal solution.

A better explanation is that for ml you don't want a globally optimal solution that overindexes on your training set. You want a suboptimal solution that might also fit an unseen data set.


That is what people thought until around 2018, but it was wrong. It turns out that deep learning optimization problems have many global optima. In fact, when the #parameters exceeds the #data, SGD reliably finds parameters that interpolate the training data with 0 loss. Surprisingly, most of these generalize well and overfitting is not a big problem.

In other words, deep learning is a very special nonconvex optimization problem. A lot of our old intuition about optimization for ML is invalid in the overparameterized regime.


Why DL generalizes well is still an open research problem AFAIK. I've read numerous papers that tries to argue one way or another why this works, and they are all interesting! One paper (that I found compelling, even though it didn't propose a thorough solution) showed that SGD successfully navigated around "bad" local minimas (with bad generalization) and ended up in a "good" local minima (that generalized well), and their interpretation was that due to the S in SGD, it will only find wide loss basins, and thus the conclusion was that solutions that generalize well seem to have wider basins (for some reason).

Another paper showed that networks trained on roughly the same dataset but initialized from different random initializations, had a symmetry relation in the loss landscape by a permutation of all the weights. You could always find a permutation that allowed you to then linearly interpolate between the two weight sets without climbing over a loss mountain. Also very interesting even if it wasn't directly related to generalization performance. It has potential applications in network merging I guess.


I have read this in several places and want to learn more. Do you have a reference handy?


[1] Was an empirical paper that inspired much theoretical follow-up.

[2] Is one such follow-up, and the references therein should point to many of the other key works in the years between.

[3] Introduces the neural tangent kernel (NTK), a theoretical tool used in much of this work. (Not everyone agrees that reliance on NTK is the right way towards long-term theoretical progress.)

[4] Is a more recent paper I haven't read yet that goes into more detail on interpolation. Its authors were well known in more "clean" parts of ML theory (e.g. bandits) and recently began studying deep learning.

---

[1] Understanding deep learning requires rethinking generalization. Zhang et al., arXiv, 2016. https://arxiv.org/abs/1611.03530

[2] Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization. Azizan et al., arXiv, 2019. https://arxiv.org/abs/1906.03830.

[3] Neural Tangent Kernel: Convergence and Generalization in Neural Networks. Jacot et al., NeurIPS, 2018. https://proceedings.neurips.cc/paper/2018/hash/5a4be1fa34e62...

[4] A Universal Law of Robustness via Isoperimetry. Bubeck et al., NeurIPS, 2021. https://proceedings.neurips.cc/paper/2021/hash/f197002b9a085...


Awesome, thank you!


This is something I saw a talk about a while ago. There are probably more recent papers on this topic, you might want to look browse the citations of this one

https://arxiv.org/abs/2003.00307


> There aren't really local minima in most deep networks.

How so?

If there are no local minima other than the global one there are convex optimization methods that are far faster than SGD or Adam. The only reason these methods exist is because deep networks is a non-convex optimization problem.


There are many nonconvex functions where every local minimum is global, and even many nonconvex functions with a unique local minimum (which is de facto global). Convex methods can fail on those.

The reason why GD and friends are a good choice in deep learning is that computing the gradient is cheap (and approximating it even more so). Every descent method relies on solving a subproblem of sorts, typically projecting the current iterate on a sublevel set of an approximation of the function, for some definition of projection. With GD, it's as cheap as it gets, just subtract a shrinked version of the gradient. Subproblems in other algorithms are a lot more expensive computationally, particularly at high dimensions. So more efficient as in requiring fewer function evaluations, yes, but at the cost of doing a lot more work at each step.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: