Discussion about this post

User's avatar
Suman Suhag's avatar

I’m delighted to see high-caliber mathematicians and theoretical physicists getting interested in the theory behind deep learning.

One theoretical puzzle is why the type of non-convex optimization that needs to be done when training deep neural nets seems to work reliably. A naive intuition would suggest that optimizing a non-convex function is difficult because we can get trapped in local minima and get slowed down by plateaus and saddle points. While plateaus and saddle points can be a problem, local minima never seem to cause problems. Our intuition is wrong, because we picture an energy landscape in low dimension (e.g. 2 or 3). But the objective function of deep neural nets is often in 100 million dimensions or more. It’s hard to build a box in 100 million dimensions. That’s a lot of walls. There is a number of theoretical work from my NYU lab (look for Anna Choromanska as first author) and in Yoshua Bengio’s lab in this direction. This uses mathematical tools from random matrix theory and statistical mechanics.

Another interesting theoretical question is why multiple layers help. All boolean functions of a finite number of bits can be implemented with 2 layers (using the conjunctive of disjunctive normal form of the function). But the vast majority of boolean functions require an exponential number of minterms in the formulas (ie.e. an exponential number of hidden units in a 2-layer neural net). As computer programmers, we all know that many functions become simple if we allow ourselves to run multiple sequential steps to compute the function (multiple layers of computation). That’s a hand-wavy argument for having multiple layers. It’s not clear how to make a more formal argument in the context of neural net-like architectures.

No posts

Ready for more?