First, congratulations to you and the other authors - this is a very informative and accessible paper with impressive results.
> On both datasets, we find that synthesis performance scales log-linearly with model size
This "scaling law" observation is starting to become a major trend in NLP [1] and other other modalities such as speech recognition [2] and protein structure/function prediction [3]. Do you have any insight or commentary to offer regarding the next step in improving program synthesis? For example, will techniques to scale up model size continue to be a primary focus (eg as in [4]), or do you see improvements in Transformer and attention based architectures as essential for pushing the limits of what has been achieved by you and your colleagues?
> We find that even our best models are generally unable to predict the output of a program given a specific input.
What do you think about leveraging unsupervised training to improve program synthesis? Could synthesized programs be executed on generated input in a way that supports contrastive learning [5]?
Thanks in advance for your time and comments here.
> do you see improvements in Transformer or attention based architectures as essential...
I do personally, but there is some disagreement about this in the field.
In fact, I would go further and say that (in addition to using large pre-trained models) we will need methods of training that are pretty substantially different in order to elicit robust reasoning behavior.
Even supposing I'm wrong about this, if you go and look at the scaling plots in figure 3 and try to figure out how big your model would need to be in order to be solving most of these problems, you'd get a really big number.
Even if you had such a big model, it would still require post-processing of the samples to actually get the right answers.
From the perspective of applications, that's fine, but it's a little unsatisfying from the perspective of studying intelligence.
Even with those caveats (!) these problems aren't that hard compared to general software engineering tasks...
> What do you think about leveraging unsupervised training to improve program synthesis? Could synthesized programs be executed on generated input in a way that supports contrastive learning [5]?
I think this is an interesting idea and someone should try it!
I do think that, even restricting our attention to just getting neural networks to execute programs, that we will need to do something a little more drastic to robustly get the results we want.
> But there is no computable version for a non-computable real number, x_real.
That's true[0]. But consider that you will never receive something non-computable as input to a program, ever. (if you're allowed to approximate the input until you have enough digits to compute what you need, then the input is computable)
Really, I think the best way to view sin(x) is as a function that receives a stream of digits 0.2345345323.. and returns a stream of digits 0.0040933884... - this computable version of sine completely captures every thing we could possibly do with it in a program. Operating with floating point numbers, then, is just mostly truncating the input and output stream.
[0] At least in classical logic; in intuitionistic logic, sin : Real -> Real is computable and is equivalent to this idea of receiving and returning streams of digits.
Oh, I see. Then we're in agreement, our digital computers are less powerful than analog computers with unlimited precision. What we don't know if such analog computer could exist -- or even if the universe is equivalent to one.
That is, we don't know if nature is actually continuous. Perhaps spacetime becomes discrete in the Planck scale or something like that (I know past attempts have been unsuccessful, but still, it might be). But if nature is continuous, there is probably a fundamental limitation on harnessing those precision bits past a certain limit. Nature's continuous variables might have enough symmetries as to make it Turing-complete. In this case, computation with full real numbers wouldn't ever happen in this universe.
I mean, if it happens, the Church-Turing thesis would be false, and the consequences would be far more strange.
Hello Augustus and thanks for posting the link to your co-authored paper on HN.
I will have to read the paper more carefully however having quickly scanned the
paper it seems that it only reports empirical resuls. In particular, there seem
to be no theoretical results about lernability of programs from natural language
specifications using large language models. To make it more plain - how do we
know these techniques work as well as reported on problems other than the ones
in the dataset introduced in your work?
Note that I'm not asking why you introduced a new dataset, this seems to be
motivated in the abstract. I'm asking: how do we know how well this kind of
thing works ("this kind of thing" being what it says in the title) in the
general case?
Great work! I found the results in Fig 16 pretty interesting [0]...
From response 1, it seems that the model has very little confidence in its decision but get the correct answer while, on the contrary, in response 3, the model seems very confident in its incorrect answer. Is this usually a trend that you see with large models? How hard is it, generally, to make such models "aware" of their own shortcomings?
I think it might be a mistake to think that the model is not confident because its response is something a human might say if they were not confident. The model is 'just' completing the prefix text with something that has high likelihood from its perspective, so it may just be used to, for instance, seeing people hedge in similar conversations it has read in its training data.
More generally, whether these models are well-calibrated (that is, they know what they don't know) is an important area of research. I don't have references offhand, but I think it's true broadly speaking that these larger pre-trained models do tend to be better calibrated.
Unfortunately not, but we do release both the programming dataset and the math questions dataset, so in principle you could try those out with one of the open-source models from e.g. huggingFace.
I personally agree that this experiment is evidence that there are certain problems that cannot be solved simply by making the models bigger, and one of the main research questions I'm interested in is what we need to do to elicit more reasoning-like capabilities from them.
There are people who fall more on the side of bitter-lesson/scaling-law-maximalism, and I think it's probably healthy and valuable that there are people in the research community placing both types of bet.
> elicit more reasoning-like capabilities from them.
But will a branch-and-search with some-noisy-evaluator get us there? Obviously this is easier with some domains.
Or maybe simply using increasing large "prompts" or "input specification" to specify the desired end result. There might be a while scaling law hiding there ..
https://twitter.com/gstsdn/status/1427794393373626368