Your naive understanding is supported by at least one deep learning authority:
> I haven’t found a way to properly articulate this yet but somehow everything we do in deep learning is memorization (interpolation, pattern recognition, etc) instead of thinking (extrapolation, induction, etc). I haven’t seen a single compelling example of a neural network that I would say “thinks”, in a very abstract and hard-to-define feeling of what properties that would have and what that would look like.
> All the while I'm thinking: this thinking process this person goes through as he analyzes this data: THAT is what Machine Learning SHOULD do
-- Andrej Karpathy
Deep learning for image recognition works because our visual world is made up of structured hierarchical features: Dark/Light, Texture, Edge, Part of Object, Object, Scene. Deep learning layers create increasingly higher-level features in a computationally feasible way.
I personally prefer 'generic hashing/parsing'; deep learning excels at the automatic creation of a mapping of unstructured information to structured information, after a sufficient period of training.
Hmm... but isn't that what our brains do as well? Unstructured intensities of light bouncing off our retinas which becomes a structured recognized object.
It definitely seems to be part of what our brain does. The visual cortex is an apt comparison since that's where a lot of the structural inspiration for modern ANNs comes from. But, there does seem to be a little more than that too; it's not clear whether all the brain does is reducible to a hash function (reducible in any useful sense, at least; a very very very big, very very very sparse hash function, perhaps).
Our brain can understand that a cartoon-picture of a cat is a cat. Also, our brain can understand that a picture of a cat taken from a hugely different angle than seen before is a cat. Deep learning cannot do those kind of tricks.
There's quite probably some of that. A quote from J.S. Mill on the distinction between science and technology strikes me as useful:
"One of the strongest reasons for drawing the line of separation clearly and broadly between science and art is the following:—That the principle of classification in science most conveniently follows the classification of causes, while arts must necessarily be classified according to the classification of the effects, the production of which is their appropriate end."
Essays on some unsettled Questions of Political Economy
What are your thoughts on newer recurrent architectures like the DNC (or its predecessor, the neural Turing machine)? While the demonstrated results with DNCs so far are pretty limited, it seems that they embody a push towards allowing a neural network to actually "think" over multiple steps: storing complex information, formulating a plan, and acting on that plan.
Yes. I think these architectures are very exciting and a step in the "right" direction. Eventually we will want to move from rote memorization and pattern matching to more challenging aspects of intelligence.
As much as I dislike calling on the neural net / biological net metaphor, I do think that computer science has made some headway in how "useful codes", in the sense of semantically-meaningful interpolation, can be derived from natural scene stimuli, and therefore the onus that "we do something different" is to some extent now on the neuroscientists to think about and try to prove that "reasoning" in the human sense is anything other than an algebra of latent codes, i.e., linear or non-linear combinations of codified summaries of sensory input.
Geoff Hinton refers to thought vectors performing reasoning by analogy using algebra [1] in his Royal Society Lecture.
The other widely reported vector algebras in a semantic space were discovered by Mikolov et al when producing ~300 dimensional vectors for a billion word Wikipedia corpus.
If one performs vector algebra and ~= is near by cosine distance then using Mikolov's Vectors[3].
King - Man + Woman ~= Queen
France - Paris + Gernmany ~= Berlin
Surprisingly this works for other modalities, Chintala, Radford & Metz found a latent semantic space in images, that adds vectors for glasses or smiles to peoples faces. [4] With a generative model new images can be created as outlined in this blog post by Soumith [5]
Karpathy shows trained nets can be assembled like lego across modalities, slice off the classifier to reveal the rich semantic 'thought vector' layer of an Imagenet trained Alexnet, plug in a RNN sentence generator using word2vec and ( some over simplification ... ) you get a convincing image captioner [6].
The thought vectors are akin to high level representations of the world and can cross modalities . Text to Images using thought Vectors ( from hnnews discussion [7] )
So the vectors of though are in some way a an AI mentalese or encoding of a symbolic representation of the world derived from the data and can ( again drastic over simplification ) transfer modalities and even between previously unlinked languages [8]
[2] The paper Geoff Hinton is reffering to : Sequence to Sequence Learning with Neural Networks by Ilya Sutskever, Oriol Vinyals, Quoc V. Le https://arxiv.org/abs/1409.3215
[3] Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean
https://arxiv.org/abs/1301.3781
[4] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Alec Radford, Luke Metz, Soumith Chintala https://arxiv.org/abs/1511.06434
> I haven’t found a way to properly articulate this yet but somehow everything we do in deep learning is memorization (interpolation, pattern recognition, etc) instead of thinking (extrapolation, induction, etc). I haven’t seen a single compelling example of a neural network that I would say “thinks”, in a very abstract and hard-to-define feeling of what properties that would have and what that would look like.
> All the while I'm thinking: this thinking process this person goes through as he analyzes this data: THAT is what Machine Learning SHOULD do
-- Andrej Karpathy
Deep learning for image recognition works because our visual world is made up of structured hierarchical features: Dark/Light, Texture, Edge, Part of Object, Object, Scene. Deep learning layers create increasingly higher-level features in a computationally feasible way.