We've come to the consensus that large language models are just stochastic parro...

moconnor · on April 10, 2022

This is not the consensus among ML researchers. Transformers are showing strong generalisation[1] and their performance continues to surprise us as they scale[2].

The Socratic paper is not about “higher intelligence”, it’s about demonstrating useful behaviour purely by connecting several large models via language.

[1] https://arxiv.org/abs/2201.02177

[2] https://arxiv.org/abs/2204.02311

robbedpeter · on April 10, 2022

There is no such consensus. Transformers navigate problem spaces with various mechanisms that include recursion, and multi-pass inference means the depth can be arbitrary. This means that models pick up on the functions that generate answers, not simple statistical relationships you see in Markov chains.

"Stochastic parrot" is a derogatory term and I've never seen anyone who actually understands the technology use that phrase unironically. If anything, it's a shibboleth for bias or ignorance.

exdsq · on April 10, 2022

I asked something similar previously on HN and a researcher in the field said that scaling size/computation actually does keep showing significant improvements

mountainriver · on April 10, 2022

We have not come to that consensus and large language models display really interesting capabilities like few shot learning, which before we thought would require a widely different architecture

nl · on April 10, 2022

> We've come to the consensus that large language models are just stochastic parrots

Anyone who thinks this REALLY doesn't know how language models work. A properly trained LM will only parrot something back because of lack of diversity in training data. This does happen in some cases (eg, GPL license or something) but those are pretty unique cases.

People on HN seem to think this a lot, but they are just wrong.

chaxor · on April 11, 2022

It'sespecially true for ML in general on HN, but it's generally true for a lot of areas in the public - people often mistake skepticism for expertise or knowledge. I think the phenomenon is similar to the large crowd that cries "the sample is too small" any time statistics are brought up.

It's the first thing anyone learns, and it's easy to do.

It's really unfortunate, but that's why you see so many on HN that dismiss new technologies in ML (especially in NLP, since everyone can understand the output - that's less true in e.g. protein folding)

nl · on April 12, 2022

> people often mistake skepticism for expertise or knowledge

This is a pretty good insight.

> that's why you see so many on HN that dismiss new technologies in ML (especially in NLP, since everyone can understand the output

I think also in NLP people see output that is the same as some training data, so think it is copying it. It takes some a little bit more thought to think "ok if I asked 100 experts to try to write how to sort an array in Python" or their code is going to be very similar. This doesn't mean it is copied.

gjm11 · on April 10, 2022

"Stochastic parrots" -- have you seen, e.g., the examples in the PaLM paper of how it does on "chained inference" tasks? I don't see how you can classify that as mere parroting.

visarga · on April 11, 2022

"Stochastic parrots" is a disparaging term coined by SJW propaganda. As if the brain is not stochastic, or we don't parrot from cultural sources. Language models have been accused of bias and lack of explainability, but humans are biased too and can't really explain how we take decisions.

Overall this term says "limited to the intelligence of a parrot" which is false, models can solve math and coding problems, generate passable art, translate and speak in hundreds of languages and beat us at all board and card games. When was a parrot able to do that?

robbedpeter · on April 11, 2022

The math the models are doing are similar to rote rule chaining as opposed to calculation. The errors made look like kludged together lookups. I wonder if you could sequence the training of a model so that you could reinforce calculations over lookups, to encourage the development of an accurate and advanced mathematics module.

Neural networks can do math, but a lookup and memorized value model is structurally a lot different than a calculator model. The difference between them is a matter of weights for any given architecture. Tokenizing properly for math would help, but doing bit level tokenizing would be best, because that would allow multimodal domains to integrate more readily (i.e. audio/video/text models could share learned features more easily than if you are using parsed or domain specific tokens.) It's a great time to be alive.

riku_iki · on April 11, 2022

> it does on "chained inference" tasks

To me, it is more proof of "stochastic parrot" behavior: model seen most of the available math information in internet, and even with significant computational power, can solve only 58% of elementary school level questions, and they were probably those with clear examples in training data, and can't generalize on those beyond.

robbedpeter · on April 11, 2022

In limited, often zero or one shot probing of the model, yes. Do multiple generations and recursive passes over the output to have the model select and iterate on a target and the utility goes way up. You can coax great output from small models, even the 125m parameter gpt-neo.

The process kinda goes like this -

Think of ten answers to this question: blah blah blah

From these ten answers, which are the best 3?

Of the three answers, which is the best?

Revise and edit the best answer to be simpler or more understandable.

Prompt engineering is a nascent field, and we haven't seen nuanced or sophisticated use of the tool yet. Most of the metrics reported in papers are barely better than a naive Turing test. It doesn't take much introspection to know that even humans endlessly iterate and revise their output, and the best extemporaneous speech doesn't match well curated and edited material. It shouldn't surprise us that similar editing and revision processes will benefit transformer output.