DeepMath Conference 2020 – Conference on the Mathematical Theory of DNN's

a-nikolaev · on Nov 7, 2020

"Deep" is such a good prefix for all sorts of Deep Learning, Deep Math, Deep Thinking, Deep Engineering etc. Wonder if the networks were originally called Thick neural networks, would the ML/AI revolution as we know it still happened?

subtypefiddler · on Nov 7, 2020

Whilst I agree with the general sentiment, in this particular instance it has to do with the depth of network that could be trained efficiently thanks to hardware advances. LeNet was 7 layers deep, Dan's 9, VGG's 13, GoogleNet's 22, etc.

There is theory w.r.t to thick networks as well (e.g the link to Gaussian processes require infinite width).

Deep makes sense here.

The_rationalist · on Nov 7, 2020

Well except that most neural network are not deep, they have a very low number of layers but each layer can be tremendously wide. This should have been called wide learning. But we could imagine some learning algorithm that exploit more depth than wideness. A more correct naming would take into account both dimensions: depth and wideness.

Note that this is hortogonal to sparsedness vs density

antognini · on Nov 8, 2020

The depth seems to matter more than the width, at least as long as the layers are sufficiently wide. In fact, in the limit that the layer becomes infinitely wide, you just end up with a Gaussian process. In practice a width of ~100--1000 is sufficient to get behavior that is pretty close to a Gaussian process, so in general doubling the width of a layer doesn't gain you all that much compared to using those parameters for an extra layer. The real representational power seems to come from increasing depth.

canjobear · on Nov 7, 2020

Around the time the phrase "deep learning" came into vogue, the advances were indeed in training deeper networks, not wider. Later on it turned out that shallow wide networks are sufficient for many problems. (Also, it turned out the pre-training tricks that people came up with for training deep networks weren't really necessary either.)

subtypefiddler · on Nov 7, 2020

It's also important to note that they work despite being wide, you can see that with the efficiency of pruning, and ideas such as the lottery ticket hypothesis that state that "successful" sub-networks within the wide network account for most of the performance.

In the theory literature, if you have a K-deep network, K=1 is the shallow case, K>1 is deep. Agreed naming could be better, but it's not like "deep work" or "deep thoughts" as the parent was stating.

laleopue · on Nov 7, 2020

The adjective "deep" came from deep belief networks, which are a variation on restricted boltzmann machines. RBNs have one visible and one hidden layers, DNBs have more hidden layers - hence "deep". So it's not exactly based on a distinction between "deep" and "shallow" models.

sdenton4 · on Nov 7, 2020

I dunno, in the resnet age, many and perhaps most networks are 20+ layers. I feel like the shallowest networks I see these days are RNNs being used for fast on device ML, which trends not to be terribly wide due to the same hardware constraints.

TrackerFF · on Nov 8, 2020

Of all the nouns available (deep, dense, thick, big, condensed, etc.), "deep" is def. the on that brings most hype - makes it sound very advanced, and more marketable.

Deep Learning with Big Data

That stuff sells itself.

WanderPanda · on Nov 7, 2020

If thick neural networks would indeed exist they would probably be a special case of something Schmidhuber invented 30 years ago.

webmaven · on Nov 8, 2020

> If thick neural networks would indeed exist they would probably be a special case of something Schmidhuber invented 30 years ago.

Heh. A burn both pointed and subtle. A+.

naringas · on Nov 8, 2020

it's the the new e-

thecleaner · on Nov 7, 2020

> ....Thick neural networks, would the ML/AI revolution as we know it still happened?

Cancel culture will ensure that the conference doesnt happen in the name of fat shaming.

cool_dude85 · on Nov 7, 2020

Finger on the pulse, my man - people HATE being called thick these days.

la_fayette · on Nov 7, 2020

Is there any good reason why a fully-connected network needs more than one hidden layer? Theoretically, any non-linear function could be mapped into a fqn with only one hidden layer. Does deep has anything to do with fqns or only with cnns?

fantod · on Nov 7, 2020

IIRC the number of hidden units required scales exponentially in the desired approximation error.

webmaven · on Nov 8, 2020

> Is there any good reason why a fully-connected network needs more than one hidden layer?

A sufficiently (infinitely?) wide shallow fcnn can work in theory, but is basically impossible to train[0].

It might be feasible to transform or 'compile' a trained deep network into a shallow & wide one, but I'm not sure there would be any benefits, absent sufficiently wide parallel hardware[1]

[0] Then again, deep networks were impossible to train for quite a while too, due to the exploding gradient problem.

[1] Although, the Cerebras Wafer-Scale chip does exist. Hmm.

bobbylarrybobby · on Nov 8, 2020

If you have only a single layer, aren’t you limiting the functions you can approximate to linear combinations of your activation function on the inputs? With deep networks you can take functions of linear combinations of functions of linear combinations of...

cosmic_ape · on Nov 7, 2020

This is like a workshop at a usual conference, no proceedings, right?

aborsy · on Nov 7, 2020

As soon as I saw the word Deep I stopped continuing.

Nnets have always been multi layer since they were invented. That’s the whole idea of progressive feature extraction, and the analogy with biological brain. Theoreticians referred to them properly as nnets or multilayer nnets. Later experimentalists simulated them, thanks to the availability of the computing resources, and experimentally verified that a multi layer nnet can be more efficient than a single layer one. They added superficial terms “deep” and “AI,” “singularity,” etc., which the media and tech industry amplified for obvious reasons.

beagle3 · on Nov 7, 2020

That’s not wrong but does misrepresent history.

For many years a 3 layer network (1 input, 1 hidden, 1 output) was the standard and considered sufficient, thanks to several relevant but non constructive approximation theorems (mostly one due to Kolmogorov and one due to Cybenko, if I am not mistaken).

They were also considered practically required because everyone was using sigmoids which have a vanishing gradient problem.

Several things were needed to break away to where we are now: unbounded functions (like ReLU) to avoid vanishing moments; a lot more layers; a lot more parameters and compute power.

When Schmidhuber (and later Hinton) showed that many-layer nets work well, that was non trivial and almost revolutionary.

It is now trivial and all nets are “deep”. But that wasn’t the case when the breakthroughs were made, and the terminology stuck.

aborsy · on Nov 7, 2020

Take a look at the papers of Hubel and Wisel and mcculloch and Pits in 40s. If I recall correctly, you can see diagrams of a collection layered neurons. The architecture of a layer of simple cells and a layer of complex cells modeling the visual cortex is a layered architecture. The multi layer perceptron was proposed right after the perceptron. That was biological inspiration, and the analogy with layered “deep” brain. Fukushima revisited this layered architecture in the context of convolutional nnets.

Cybenko,Hornik etc studied the mathematical properties of multi layer feed forward nnets late 80s.

Obviously, in the field of systems control equivalent “deep learning” and “reinforcement learning “ were studied since 50s. This includes what’s called “back propagation algorithm”

It was all multi layer nnets until it was simulated.

beagle3 · on Nov 7, 2020

Indeed, this is all true. But do remember that the kolmogorov-arnold theorem says a 3-layer (n:2n+1:m) network is a universal continuous approximator (using an unknown neuron transfer function) -- people in the '80s were looking at 3 layer networks as sufficient partly because of that.
I have no time to go look at all those sources now, but having dabbled in nets since the late '80s myself, I remember vanishing gradients were sort-of a surprise, because everyone was under the impression that simple backpropagation should just work, and it didn't.
A lot of that early work you refer to was also mostly about linear transfer functions, and though the exact type of non-linearity doesn't matter, some of its properties do - and as I mentioned, sigmoids - which were all the rage in the '80s - are a dead end with the wrong kind of nonlinearity.

Nothing about the structure* of multilayer models is new. But successfully training them - which didn't happen until Schmidhuber and Hinton (depends on who you ask ...) - is relatively new; and that advance is responsible for the term "deep learning".

We do not disagree about the details; but we do seem to disagree about the historical context and narrative.

ozankabak · on Nov 7, 2020

You are talking about Sprecher's modification to the original Kolmogorov-Arnold theorem, right? This version, and its implications, have been a lingering wondering for me for quite a while. Are you aware of any research on 3-layer networks where the unknown transfer function is also learnable? I suspect such an approach does not result in good models (otherwise we would have known about them!), but I can not articulate why. Where exactly does the K-A reasoning fail when we try to apply it in practice?

beagle3 · on Nov 7, 2020

Yes. I haven't touched this since 1995, so I had to refresh my memory. I was indeed, talking about Sprecher's modification. Back when I studied this, the proofs I found were not constructive.

I was unaware, but apparently Gribel gave a constructive proof in 2009 (link from Wikipedia article about KA rep theorem). I would have to read it and hope I am not too rusty to understand it before I could really ponder your question...

But I could offer two places I would have looked:

1. The approximation is of a continuous function, and such approximations (e.g. chebychev, bernstein) usually require that you be able to sample the function at specific points - but learning usually gives you training data that does not correspond to those specific points. It's possible that construction fails here somehow.

2. The approximation is too hard in practice. This is the too often the case for Breiman's beautiful ACE (Alternating Conditional Expectation) which, if you squint hard enough, looks like a two-layer network where each neuron has its own transfer function. The algorithm is incredibly simple in theory, but very hard to use in practice.

im3w1l · on Nov 8, 2020

You are right, that deep networks were always there. But university textbooks dismissed them, and the common thought was that they were not necessary.

notsuoh · on Nov 7, 2020

Forgive me, but that seems like an excessively pedantic outright dismissal of what I am seeing to be the presentation of interesting results in the field. Whatever you want to call it—deep learning, multilayer nnets, whatever—it’s the same thing and results are exciting. Also, I believe that there is a difference between deep learning and multilayer nnets (though I can't quite articulate it), but throwing in things like “singularity” is a bit much as it’s not really related at all to the math results being presented.

Der_Einzige · on Nov 7, 2020

While in practice this is true, a lot of work is being done to try to figure out how to make shallow neural networks train as fast as deep ones. We know by the UAT that in principle a shallow network with similar paramaters counts to a deep one should be capable of learning the same decision boundary...

https://en.m.wikipedia.org/wiki/Universal_approximation_theo...

aborsy · on Nov 7, 2020

It depends on the function that you are trying to approximate.

I can give you a function that a shallow nnet would approximate better and functions that deep nets approximate exponentially better even with one more layer (in terms of number of neurons n). In the limit n->\infty, both reach arbitrary small errors (obviously often with different number of parameters).

falcor84 · on Nov 7, 2020

Is there a particular theorem that you're invoking here?

marcinzm · on Nov 7, 2020

Multi-layer does not convey how many layers (3? 12? 100?) while deep does somewhat. If someone is talking about networks with a certain large-ish number of layers then deep seems a better shorthand to use.