Program Synthesis with Large Language Models

tasdfqwer0897 · on Aug 18, 2021

Hey, I am one of the lead authors of this paper. Happy to answer questions. This is a twitter thread going over the main results:

https://twitter.com/gstsdn/status/1427794393373626368

criticaltinker · on Aug 18, 2021

First, congratulations to you and the other authors - this is a very informative and accessible paper with impressive results.

> On both datasets, we find that synthesis performance scales log-linearly with model size

This "scaling law" observation is starting to become a major trend in NLP [1] and other other modalities such as speech recognition [2] and protein structure/function prediction [3]. Do you have any insight or commentary to offer regarding the next step in improving program synthesis? For example, will techniques to scale up model size continue to be a primary focus (eg as in [4]), or do you see improvements in Transformer and attention based architectures as essential for pushing the limits of what has been achieved by you and your colleagues?

> We find that even our best models are generally unable to predict the output of a program given a specific input.

What do you think about leveraging unsupervised training to improve program synthesis? Could synthesized programs be executed on generated input in a way that supports contrastive learning [5]?

Thanks in advance for your time and comments here.

[1] Scaling Laws for Neural Language Models https://arxiv.org/abs/2001.08361

[2] Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition https://arxiv.org/abs/2010.10504

[3] ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning https://arxiv.org/abs/2007.06225

[4] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity https://arxiv.org/abs/2101.03961

[5] A Simple Framework for Contrastive Learning of Visual Representations https://arxiv.org/pdf/2108.00587.pdf

tasdfqwer0897 · on Aug 18, 2021

> do you see improvements in Transformer or attention based architectures as essential...

I do personally, but there is some disagreement about this in the field. In fact, I would go further and say that (in addition to using large pre-trained models) we will need methods of training that are pretty substantially different in order to elicit robust reasoning behavior.

Even supposing I'm wrong about this, if you go and look at the scaling plots in figure 3 and try to figure out how big your model would need to be in order to be solving most of these problems, you'd get a really big number. Even if you had such a big model, it would still require post-processing of the samples to actually get the right answers. From the perspective of applications, that's fine, but it's a little unsatisfying from the perspective of studying intelligence. Even with those caveats (!) these problems aren't that hard compared to general software engineering tasks...

> What do you think about leveraging unsupervised training to improve program synthesis? Could synthesized programs be executed on generated input in a way that supports contrastive learning [5]?

I think this is an interesting idea and someone should try it! I do think that, even restricting our attention to just getting neural networks to execute programs, that we will need to do something a little more drastic to robustly get the results we want.

mjburgess · on Aug 18, 2021

> it's a little unsatisfying from the perspective of studying intelligence

This power-law behaviour is exactly what you get when modelling a non-computable function.

Eg., consider approximating the mean of sin^2(x) over -pi to +pi.

If you sample x 10^s times (s = 1, 2, 3, ...) the difference from the non-computable analytic answer (0.5) scales log-linear (ie., power-law).

    { pwr : 
            np.log(0.5 - (
                np.sin(np.linspace(-np.pi, +np.pi, 10**pwr)
            ) ** 2).mean()) 
        for pwr in range(1, 8) 
    }

It is my view that intelligence isn't computable, and that any approximation method to some layer of intelligence will run-up against this problem.

nextaccountic · on Aug 18, 2021

I think you're mixing up terminology here, because sin^2(x) is computable.

I mean, if there is a process to progressively give better estimates of the output of a function, then this function is computable.

mjburgess · on Aug 18, 2021

sine is a real-valued function

There is a computable version for computable inputs, ie., we ask "what is sin(x)?" for some computable-number x_computable.

But there is no computable version for a non-computable real number, x_real.

nextaccountic · on Aug 18, 2021

Thank you for the clarification.

> But there is no computable version for a non-computable real number, x_real.

That's true[0]. But consider that you will never receive something non-computable as input to a program, ever. (if you're allowed to approximate the input until you have enough digits to compute what you need, then the input is computable)

Really, I think the best way to view sin(x) is as a function that receives a stream of digits 0.2345345323.. and returns a stream of digits 0.0040933884... - this computable version of sine completely captures every thing we could possibly do with it in a program. Operating with floating point numbers, then, is just mostly truncating the input and output stream.

[0] At least in classical logic; in intuitionistic logic, sin : Real -> Real is computable and is equivalent to this idea of receiving and returning streams of digits.

mjburgess · on Aug 18, 2021

> completely captures every thing we could possibly do with it in a program

Yes, but this is far less than nature can do with it -- which is my point, that discrete computation is extremely limited.

nextaccountic · on Aug 19, 2021

Oh, I see. Then we're in agreement, our digital computers are less powerful than analog computers with unlimited precision. What we don't know if such analog computer could exist -- or even if the universe is equivalent to one.

That is, we don't know if nature is actually continuous. Perhaps spacetime becomes discrete in the Planck scale or something like that (I know past attempts have been unsuccessful, but still, it might be). But if nature is continuous, there is probably a fundamental limitation on harnessing those precision bits past a certain limit. Nature's continuous variables might have enough symmetries as to make it Turing-complete. In this case, computation with full real numbers wouldn't ever happen in this universe.

I mean, if it happens, the Church-Turing thesis would be false, and the consequences would be far more strange.

YeGoblynQueenne · on Aug 18, 2021

Hello Augustus and thanks for posting the link to your co-authored paper on HN.

I will have to read the paper more carefully however having quickly scanned the paper it seems that it only reports empirical resuls. In particular, there seem to be no theoretical results about lernability of programs from natural language specifications using large language models. To make it more plain - how do we know these techniques work as well as reported on problems other than the ones in the dataset introduced in your work?

Note that I'm not asking why you introduced a new dataset, this seems to be motivated in the abstract. I'm asking: how do we know how well this kind of thing works ("this kind of thing" being what it says in the title) in the general case?

muds · on Aug 18, 2021

Great work! I found the results in Fig 16 pretty interesting [0]...

From response 1, it seems that the model has very little confidence in its decision but get the correct answer while, on the contrary, in response 3, the model seems very confident in its incorrect answer. Is this usually a trend that you see with large models? How hard is it, generally, to make such models "aware" of their own shortcomings?

[0]: https://arxiv.org/pdf/2108.07732.pdf#figure.caption.21

tasdfqwer0897 · on Aug 18, 2021

I think it might be a mistake to think that the model is not confident because its response is something a human might say if they were not confident. The model is 'just' completing the prefix text with something that has high likelihood from its perspective, so it may just be used to, for instance, seeing people hedge in similar conversations it has read in its training data.

More generally, whether these models are well-calibrated (that is, they know what they don't know) is an important area of research. I don't have references offhand, but I think it's true broadly speaking that these larger pre-trained models do tend to be better calibrated.

ipsum2 · on Aug 18, 2021

I didn't see it in the paper, but is the code and model going to be released? Would be great to play around with this.

tasdfqwer0897 · on Aug 18, 2021

Unfortunately not, but we do release both the programming dataset and the math questions dataset, so in principle you could try those out with one of the open-source models from e.g. huggingFace.

ipsum2 · on Aug 18, 2021

I'm sure there are legitimate reasons for not releasing the model, but it would've been a huge boon for the community :(.

rollulus · on Aug 18, 2021

Apart from the boon: reproducibility is a major principle of the scientific method.

mountainriver · on Aug 18, 2021

> We find that even our best models are generally unable to predict the output of a program given a specific input.

I’ve been thinking on this quite a bit, it’s an interesting property of language models in that they are essentially associative lookup machines.

It feels on some level that we need to be combining this kind of reasoning with something like world models.

tasdfqwer0897 · on Aug 18, 2021

I personally agree that this experiment is evidence that there are certain problems that cannot be solved simply by making the models bigger, and one of the main research questions I'm interested in is what we need to do to elicit more reasoning-like capabilities from them.

There are people who fall more on the side of bitter-lesson/scaling-law-maximalism, and I think it's probably healthy and valuable that there are people in the research community placing both types of bet.

algo_trader · on Aug 18, 2021

> elicit more reasoning-like capabilities from them.

But will a branch-and-search with some-noisy-evaluator get us there? Obviously this is easier with some domains.

Or maybe simply using increasing large "prompts" or "input specification" to specify the desired end result. There might be a while scaling law hiding there ..

qualudeheart · on Aug 18, 2021

Is this scaling trend expected to continue? If so how long?