Thank you for this. Technically it's not GPT-3, but GPT-NeoX-20B, although they ...

dang · on Feb 11, 2022

Ok, we've reverted the title now. Thanks!

(Submitted title was 'GPT-3's answers to arithmetic questions')

williamtrask · on Feb 11, 2022

Poor performance is more likely due to how transformer neural networks view numbers. It memorises them like words instead of modeling their numerical structure. Thus even if it’s seen the number 3456 and 3458, it knows nothing of 3457. Totally different embedding.

It’s like a kid memorising a multiplication table instead of learning the more general principle of multiplication (related: this illusion is why big models are so popular. Memorise more stuff.)

Paper (NeurIPS/DeepMind): https://arxiv.org/abs/1808.00508

Isinlor · on Feb 11, 2022

Take a look at this paper:

Deep Symbolic Regression for Recurrent Sequences https://arxiv.org/abs/2201.04600

If you look at embedding visualization it is very clear that the model learns order of numbers.

(Interactive demo: http://recur-env.eba-rm3fchmn.us-east-2.elasticbeanstalk.com... )

There is also:

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets https://arxiv.org/abs/2201.02177

Again, looking at visualizations the model very clearly grasps the structure of the function it models.

pfortuny · on Feb 11, 2022

Modulo 97 (the arxiv paper). That is what they do.

It is quite easy to grok operations modulo 97.

YeGoblynQueenne · on Feb 11, 2022

The "Deep Symbolic Regression" paper reports very poor generalisation results that break off after a small n (where n is the number of tokens in the predicted sequence). It works some of the time for n = 1 (predicts the next token) but accuracy drops off for n = 10. No results are reported for N > 10 as far as I can tell in the "Out of Domain Generalization" section (which is the meat and potatoes of the "generalization" claim).

tl;dr they can sometimes generalise to the next 1 to 10 tokens (digits or operators), but no more.

This kind of short-term "generalisation" on OOD data is standard in neural nets trying to approximate symbolic regressions or things like grammars etc as far as I know.

I do like they use 'Out of Domain" rather than "Out of Distribution" as a target though. That makes more sense.

Isinlor · on Feb 11, 2022

I don't think you will find any human that will extrapolate sequence generated with more than 10 operators. And longer input sequences are actually easier to handle - fig 1. the right most graph.

If you think you can do better than their program then:

Seq1: [0, 1, 2, 3, 6, 7, 13, 26, 32, 58, 116, 142, 258, 516]

Seq2: [2, 2, 3, 5, 10, 12, 22, 44, 54, 98, 196, 240, 436, 872]

Seq3: [3, 1, 8, 9, 18, 19, 37, 74, 92, 166, 332, 406, 738, 1476]

Their program is able to guess correct continuation with one more sequence element.

SHA1 hash for verification: bef5e213340f91258b3b9a0042c9c083dd91cb80

YeGoblynQueenne · on Feb 11, 2022

I don't think I understand what you mean. Aren't all the sequences on the Online Encyclopedia of Integer Sequences created by humans? We clearly have the tools to extrapolate sequences from examples, rather than just eyballing them and trying to guess them. For instance: we have maths. So I must have misunderstood your meaning?

Isinlor · on Feb 11, 2022

If you look at the 3 sequences I gave you, can you guess following elements of the sequence?

We can create sequences, but guessing underlying patterns is a lot more difficult.

Humans will have very hard time if you go beyond around 10 operators in a pattern used to generate a sequence.

My guess is that their model will be better at it than me or you.

YeGoblynQueenne · on Feb 11, 2022

Ah, I think I see what you mean: you are saying that because it's better than humans at predicting the next element in a sequence it's good at generalising. Is that correct, or am I misrepresenting your point?

Isinlor · on Feb 12, 2022

Yes.

Basically there are two approaches to sequence prediction.

The traditional style, linear regression, ARIMA, RNNs etc. where you directly predict the next element in a sequence. The output is on the same level of abstraction as the internal values used in the model.

There is also the new-ish style where you predict symbols instead of predicting the values directly. You can predict symbols representing numbers or you can also predict a symbolic formula that can be used to extrapolate the values perfectly. This is the way humans do it.

And my point is that when you look at the symbols embedding they do have interpretable structure that model can use to generalize. And experiments seems to suggest that DNNs models are indeed generalizing.

YeGoblynQueenne · on Feb 12, 2022

OK, thanks for the explanation. I think I understand what you mean. But this kind of generalisation takes very careful analysis to discern and I'm not convinced, yet. I'll be more easily convinced when I see something blatant, and n ≤ 10 is so far not there for me, even given the shift in what is predicted.

Isinlor · on Feb 12, 2022

I still don't know what you mean by n ≤ 10.

Have you played around with the demo?

http://recur-env.eba-rm3fchmn.us-east-2.elasticbeanstalk.com...

You can input there a sequence and it will predict the expression and next elements.

For example, for the sequence: [3, 1, 8, 9, 18, 19, 37, 74, 92, 166, 332, 406, 738, 1476]

It predicts: u_n = u_(n−1) + (u_(n−3) (n mod 3)) + u_(n−4)

It's a really decent guess, and if I give it one more element (from 14 to 15 elements) it gets it correctly.

YeGoblynQueenne · on Feb 12, 2022

>> I still don't know what you mean by n ≤ 10.

I mean the values of n_pred and n_op in tables 7 and 8. They go up to 10.

I haven't tried the demo. Maybe I'll give it a go if you say it's so good.

nicholast · on Feb 11, 2022

The cool thing about math applications is just how easy it would be to generate synthetic data. That these large language models haven't attempted to supplement their gigabytes+ scale data sets with such is an oversight.

williamtrask · on Feb 11, 2022

Or you could just use a 50cent calculator.

Note, you’d need to train such a model on data teaching it about the relationship of every number to every other number when run through every function. Yes, infinite synthetic data, but you’re just memorising stiff you can already generate

not2b · on Feb 11, 2022

Or build a model that has "peripherals". Oh, I'm being asked to do math. Let's put it in my calculator app. Everything doesn't have to be in one uniform network.

Evidently the brain works that way: the cortex is built on top of older components, so it doesn't have to figure out basic metabolism the same way it has to learn to identify people.

plutonorm · on Feb 11, 2022

It's recently been shown that even though the numbers are represented with different tokens, the network learns to form an internal representation that understands the progression from one token to the next.

nikolayasdf123 · on Feb 11, 2022

The idea that each number has to be inside ones Brain or Neural Network or Token is plainly wrong.

Network has to grasp the "abstract" number, but it clearly did not grasp that concept.

Isinlor · on Feb 11, 2022

How would you test if it grasped the concept?

plutonorm · on Feb 11, 2022

https://arxiv.org/pdf/2201.02177.pdf

This paper shows fairly conclusively that the network 'groks' modular addition.

pfortuny · on Feb 11, 2022

Modulo 97.

This is what it is. Not "general arithmetic".

catach · on Feb 11, 2022

Being able to extrapolate to numbers that were not in the training set, perhaps? At least that'd be a basic part of the requirement.

Isinlor · on Feb 11, 2022

Sure:

Deep Symbolic Regression for Recurrent Sequences https://arxiv.org/abs/2201.04600

(Interactive demo: http://recur-env.eba-rm3fchmn.us-east-2.elasticbeanstalk.com... )

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets https://arxiv.org/abs/2201.02177

Both of these models can generalize to numbers it have not seen.

YeGoblynQueenne · on Feb 11, 2022

As far as I can tell from a quick heuristic perusal, the "Generalization Beyond Overfitting" paper reports "generalisation" _on the validation set_. That's not particularly impressive and it's not particularly "generalisation" either.

Actually, I really don't grokk this (if I may). I often see deep learning work reporting generalisation on the validation set. What's up with that? Why is generalisation on the validation set more interesting than on the test set, let alone OOD data?

Isinlor · on Feb 11, 2022

The point of the paper is to show that NN can still learn long after fully memorizing the train dataset.

This behavior goes against current paradigm of thinking about training NNs. It is just very unexpected, similarly as double descent is unexpected from classical statistics point of view that more parameters lead to more over-fitting.

They could have split validation test set into validation and test sets, but I don't know what that would achieve in their case.

Fig. 1 center shows different train / validate splits. Fig 2. shows a swoop between different optimization algorithms if you are concerned about hyperparameters over-fitting.

But to me really interesting is the Fig 3. that shows that NN learned the structure of the problem.

YeGoblynQueenne · on Feb 11, 2022

>> The point of the paper is to show that NN can still learn long after fully memorizing the train dataset.

That is the claim in the paper. I don't understand how it is supported by measuring results on the validation set.

Figure 3 looks nice but it doesn't say anything on its own. I don't know what's the best way to interpret it. The paper offers some interpretation that convinces you, but not me. Sorry, this kind of work is too fuzzy for me. What happened to good, old-fasion proofs?

Isinlor · on Feb 12, 2022

The paper shows that their model first overfitted the data. By overfitting I mean 100% train dataset accuracy and ~0% validation dataset accuracy. The model never gets any feedback from the validation dataset trough the training procedure.

Everyone's expectation would be that this is it. The model is overfitted, so it is useless. The model is as good as a hash map, 0 generalization ability.

The paper provides empirical, factual evidence that as you continue training there is still something happening in the model. After the model memorized the whole training dataset and while it still has not received any feedback information from the validation dataset, it starts to figure out how to solve validation dataset.

Mind you, this is not interpretation, this is factual. Long after 100% overfitting, the model is able to keep increasing its accuracy on dataset it has not seen.

It's as we discovered that water can flow upwards.

Grokking was discovered by someone forgetting to turn off their computer.

Nobody knows why. So, nobody is able to make any theoretical deductions about it.

But I agree that fig 3. requires interpretation. By itself it does not say a lot, but similar structures appear in other models like in the one where we discuss elements sequence prediction. To me, the models figure out some underlying structure of the problem, and we are able to interpret that structure.

I tend to look at it from Bayesian perspective. This type of evidence increases my belief that the models are learning what I would call semantics. It's a separate line of evidence from looking at benchmark results. Here we can get a glimpse at how some models may be doing some simple predictions and it does not look like memorization.

YeGoblynQueenne · on Feb 12, 2022

>> The paper shows that their model first overfitted the data. By overfitting I mean 100% train dataset accuracy and ~0% validation dataset accuracy. The model never gets any feedback from the validation dataset trough the training procedure.

Yes, but the researchers get plenty of feedback from the validation set and there's nothing easier for them than to tweak their system to perform well on the validation set. That's overfitting on the validation set by proxy. It's absolutely inevitable when the validation set is visible to the researchers and it's very difficult to guard against because of course a team who has spent maybe a month or two working on a system with a publication deadline looming are not going to just give up on their work once they figure it it doesn't work very well. They're going to tweak it and tweak it and tweak it, until it does what they want it to. They're going to converge -they are going to converge- on some ideal set of hyperparameters that optimises their system's performance on its validation set (or the test set, it doesn't matter what it's called, it matters that it is visible to the authors). They will even find a region of the weight space where it's best to initialise their system to get it to perform well on the validation set. And, of course, if they can't find a way to get good performance out of their system, you and I will never hear about it because nobody ever publishes negative results.

So there are very strong confirmation and survivorship biases at play and it's not surprising to see, like you say, that the system keeps doing better. And that suffices to explain its performance, without the need for any mysterious post-overfitting grokking ability.

But maybe I haven't read the paper that carefully and they do guard against this sort of overfitting-by-proxy? Have you found something like that in the paper? If so, sorry for missing it myself.

Isinlor · on Feb 13, 2022

> And that suffices to explain its performance, without the need for any mysterious post-overfitting grokking ability.

It actually still does not suffice. It is just not expected no matter what the authors would be doing.

Just the fact that they managed to get that effect is interesting.

Granted, the phenomenon may be limited in scope. For example, on ImageNet it may require ridiculously long time scales. But maybe there is some underlying reason we can exploit to get to grokking faster.

It's basically all in fig 2.:

- they use 3 random seeds per result

- they show results for 12 different simple algorithmic datasets

- they evaluate 12 different combinations of hyperparameters

- for each hyperparameters combination they use 10+ different ratios of train to validation splits

So they do some 10*12*3*2 = 720 runs.

They conclude that hyperparameters are important. Seems like weight decay is especially important for the grokking phenomenon to happen when model has access to low ratio of training data.

Also, at least 2 other people managed to replicate that results:

https://twitter.com/sea_snell/status/1461344037504380931

https://twitter.com/lieberum_t/status/1480779426535288834

One hypothesis may be that models are just biased to randomly stumble upon wide, flat local minima. And wide, flat local minima generalize well.

YeGoblynQueenne · on Feb 13, 2022

I don't agree. Once a researcher has full control of the data, they can use it to prove anything at all. This is especially so in work like the one we discuss, where the experiments are performed on artificial data and the researchers have even more control on it than usual. As they say themselves, such effects are much harder to obtain on real-world data. This hints to the fact that the effect depends on the structure of the dataset and so it's unlikely to, well, generalise to data that cannot be strictly controlled.

You are impressed by the fact that one particular, counter-intuitive result was obtained, but of course there is an incentive to publish something that stands out, rather than something less notable. There is a well-known paper by John Ioannidis on cognitive biases in medical research:

Why most published research findings are false

https://journals.plos.org/plosmedicine/article?id=10.1371/jo...

It's not about machine learning per sé, but its observations can be applied to any field where empirical studies are common, like machine learning.

Especially in the field of deep learning where scholarly work tends to be primarily empirical and where understanding the behaviour of systems is impeded by the black-box nature of deep learning models, observing something mysterious and unexpected must be cause for suspicion and scrutiny of methodology, rather than accepted unconditionally as an actual observation. In particular, any hypothesis that tends towards magick, for example suggesting that a change in quantities (data, compute, training time) yields qualitative improvements (prediction transmogrifying into understanding, overfitting transforming into generalisation), should be discarded with extreme prejudice.

Isinlor · on Feb 14, 2022

> In particular, any hypothesis that tends towards magick, for example suggesting that a change in quantities (data, compute, training time) yields qualitative improvements (prediction transmogrifying into understanding, overfitting transforming into generalisation), should be discarded with extreme prejudice.

It does not tend towards magic. It does happen and people can replicate it. Melanie Mitchell recently brought back the point by Drew McDermott that AI people tend to use wishful mnemonic. Words like understanding or generalisation can easily be just a wishful mnemonic. I fully agree with that.

But the fact remains. A model that has ~100% training accuracy and ~0% validation accuracy on simple but non-trivial dataset is able to reach ~100% training and ~100% validation accuracy.

> This hints to the fact that the effect depends on the structure of the dataset and so it's unlikely to, well, generalise to data that cannot be strictly controlled.

Indeed, but it is still interesting. It may be that it manifests itself because there is a very simple rule underlying the dataset and the dataset is finite. But it also seems to work under some degree of noise and that's encouraging.

For example, the fact that it may help study connection of wide, flat local-minima and generalization is encouraging.

> You are impressed by the fact that one particular, counter-intuitive result was obtained

I'm impressed by double descent phenomenon as well. And this one shows up all over the place.

> There is a well-known paper by John Ioannidis on cognitive biases in medical research: Why most published research findings are false

I know about John Ioannidis. I was writing and thinking a lot about replication crisis in science in general. BTW - it's quite a pity that Ioannidis himself started selecting data towards his thesis with regard to COVID-19.

> It's not about machine learning per sé, but its observations can be applied to any field where empirical studies are common, like machine learning.

Unfortunately, it applies to theoretical findings too. For example, universal approximation theorem, no free lunch theorem or incompleteness theorems, are widely misunderstood. There are also countless less known theoretical results that are similarly misunderstood.

YeGoblynQueenne · on Feb 15, 2022

As far as I can tell the replications are on the same dataset, or at least the same task, of modular arithmetic. Until we've seen comparable results on radically different datasets, e.g. machine vision datasets, replications aren't really telling us much. Some dudes ran the same program and they got the same results. No surprise.

I confess that I'd be less suspicious if it reached less than full accuracy on the validation set. 100% accuracy on anything is a big red flag and there's a little leprechaun holding it and jumping up and down pointing at something. I'm about 80% confident that this "grokking" stuff will turn out to be an artifact of the dataset, or the architecture, or some elaborate self-deception of the researchers by some nasty cognitive bias.

Perhaps one reason I'm not terribly surprised by all this is that uncertainties about convergence are common in neural nets. See early stopping as a regularisation procedure, and also, yes, double descent. If we could predict when and how a neural net should converge, neural networks research would be a more scientific field and less a let's-throw-stuff-at-the-wall-and-see-what-sticks kind of field.

But, who knows. I may be wrong. It's OK to be wrong, even mostly wrong, as long as you 're wrong for the right reasons. Science gives us the tools to know when we're wrong, nothing more. The scientist must make peace with that. Thinking one can be always right is hubris.

Speaking of which, John Ioannidis is one of my personal heroes of science (sounds like an action figure line, right? The Heroes of Science!! dun-dun-duuunnn). I was a bit shocked that he came out so strongly sceptical against the mainstream concerns about Covid-19, and I've heard him make some predictions that soon proved to be false, like the number of people who would get Covid-19 in the USA (I think he said something like 20,000 people?). He really seemed to think that it was just another flu. Which btw kills lots of people and we're just used to it, so perhaps that's what he had in mind. But, I have the privilege of sharing my maternal language with Ioannidis (he's Greek, like me) and so I've been able to listen to him speak in Greek news channels, as well as in English-speaking ones, and he remains a true scientist, prepared to express his knowledgeable opinion, as is his responsibility, even if it may be controversial, or just plain wrong. In the end, he's an infectious disease expert and even his contrarian views lack that certain spark of madness in the eye of most others who share his opinions. I mean, because he's speaking with knowledge, rather than just expressing some random view he's fond of. He's still a role model for me. Even if he was wrong in this case.

>> Unfortunately, it applies to theoretical findings too. For example, universal approximation theorem, no free lunch theorem or incompleteness theorems, are widely misunderstood. There are also countless less known theoretical results that are similarly misunderstood.

I guess? Do you have some example you want to share? For my part, I try to avoid talking of things I don't work with on a daily basis, on the internet. I know what I know. I don't need to know -or have an opinion- on everything...

eutectic · on Feb 11, 2022

That depends on the tokenization scheme.

moffkalast · on Feb 11, 2022

> this illusion is why big models are so popular. Memorise more stuff

It's all just a compressed lookup table that can handle in-betweens.

spupe · on Feb 11, 2022

I went and checked, it turns out for this version Eleuther-AI has in fact included math problems [1]. So my earlier comment is partly incorrect.

[1] http://eaidata.bmk.sh/data/GPT_NeoX_20B.pdf

asah · on Feb 11, 2022

And isn't it trivial to generate lots of correct sample data ? :-)

throwaway4good · on Feb 11, 2022

No. The poor performance comes from the overall approach of using neural nets to solve basic math problems.

FL410 · on Feb 11, 2022

The cool part comes when the model can make the connection that

multiply 12345 by 87654

is the same as

def multiply_two_numbers(x, y):

return x * y

Which of course produces the desired result. The interesting part is that github copilot wrote the above with only the prompt "def multiply_two" as the prompt.

andreyk · on Feb 11, 2022

Oh, that's a pretty big difference, would be nice if post title was altered...