> online learning - the ability to act then see the results of your action and l...

HarHarVeryFunny · on March 15, 2024

By online learning I mean incremental real-time learning (as opposed to pre-training), such that you can predict something (e.g. what some external entity is going to do next, or the results of some action you are about to take), then receive the sensory feedback of what actually happened, and use that feedback to improve your predictions for next time.

I don't think there is any substitute for a predict-act-learn loop here - you don't want to predict what someone else has done (which is essentially what LLMs learn from a training set), you want to learn how your OWN predictions are wrong, and how to update them.

exe34 · on March 15, 2024

> By online learning I mean incremental real-time learning, such that you can predict something (e.g. what some external entity is going to do next, or the results of some action you are about to take),

I used to believe this, but the recent era of LLMs has changed my mind. It's clear that the two things are not related: you don't need to update weights in real-time if you can hold context another way (attention) while predicting the next token.

The fact that we appear to remember things with one-shot, online training might be an illusion. It appears that we don't immediately update the weights (long term memory), but we store memories in short term memory first (e.g. https://www.scientificamerican.com/article/experts-short-ter...).

HarHarVeryFunny · on March 15, 2024

The fundamental difference is that humans do learn, permanently (eventually at least), from prediction feedback, however this works. I'm not convinced that STM is necessarily involved in this particular learning process (maybe just for episodic memories?), but it makes no difference - we do learn from the feedback.

An LLM can perform one-shot in-context learning, which in conversational mode will include (up to context limit) feedback from it's actions (output), but this is never learned permanently.

The problem with LLMs not permanently learning from the feedback to their own actions is that it means they will never learn new skills - they are doomed to only learn what they were pre-trained with, which isn't going to include the skills of any specific job unless that specific on-the-job experience of when to do something, or avoid doing it, were made a part of it. The training data for this does not exist - it's not the millions of lines of code on GitHub or the bug fixes/solutions suggested on Stack Overflow - what would be needed would be the inner thoughts (predictions) of developers as they tackled a variety of tasks and were presented with various outcomes (feedback) continuously throughout the software development cycle (or equivalent for any other job/skill one might want them to acquire).

It's hard to see how OpenAI or anyone else could provide this on-the-job training to an LLM even if they let it loose in a programming playground where it could generate the training dataset. How fast would the context fill with compiler/link errors, debugger output, program output etc ... once context was full you'd have to pre-train on that (very slow - months, expensive) before it could build on that experience. Days of human experience would take years to acquire. Maybe they could train it to write crud apps or some other low-hanging fruit, but it's hard to see this ever becoming the general purpose "AI programmer" some people think is around the corner. The programming challenges of any specialized domain or task would require training for that domain - it just doesn't scale. You really need each individual deployed instance of an LLM/AI to be able to learn itself - continuously and incrementally - to get the on-the-job training for any given use.

exe34 · on March 15, 2024

> but this is never learned permanently.

Are you sure? I think "Open"AI uses the chat transcripts to help the next training run?

> they are doomed to only learn what they were pre-trained with

Fine-tuning.

> The training data for this does not exist

What does "this" refer to? Have you read the Voyager paper? (https://arxiv.org/abs/2305.16291) Any lesson learnt in the library could be used for fine-tuning or the next training run for a base model.

> what would be needed would be the inner thoughts (predictions) of developers as they tackled a variety of tasks and were presented with various outcomes (feedback) continuously throughout the software development cycle

Co-pilot gets to watch people figure stuff out - there's no reason that couldn't be used for the next version. Not only does it not need to read minds, but people go out of their way to write comments or chat messages to tell it what they think is going on and how to improve its code.

> Days of human experience would take years to acquire

And once learnt, that skill will never age, never get bored, never take annual leave, never go to the kids' football games, never die. It can be replicated as many millions of time as necessary.

> they could train it to write crud apps

To be fair, a lot of computer code is crud apps. But instead of learning it in one language, now it can do it in every language that existed on stackoverflow the day before its training run.

Vetch · on March 15, 2024

> Are you sure? I think "Open"AI uses the chat transcripts to help the next training run?

> Fine-tuning.

The learning that occurs through SGD is proven to be less flexible and generalizing than what happens via context. This is due to the restricted way information flows through transformers and which is further worsened in autoregressive GPTs vs models with bidirectional encoders.

On top of that, SGD already requires a great many examples per concept and, the impact of any single example rapidly diminishes as learning rate tampers down as training ends. Finetuning a fully trained model is far less efficient, more crippled when compared to learning from context for introducing new knowledge. It's believed that instruction tuning helps reduce uncertainty in token selection more than it introduces new knowledge.

> Co-pilot gets to watch people figure stuff out

We don't actually know if that's true. It depends on how many intermediate steps Microsoft records as training data. If enough intermediate steps lead to bad results and needed backtracking, but that erasure is not captured, it will significantly harm model quality. It is not nearly as easy to do well as you make it seem.

All in all, getting online learning into models has proven very challenging. While some "infinite" context alternatives to self-attention are promising for LTM, it'd remain true that the majority of computational power and knowledge resides in the fixed FF weights. If context and weights conflict this can cause degradation during inference. You might have encountered this yourself with GPT4 worsening with search. Lots of research is required to match human learning flexibility and efficiency.

exe34 · on March 16, 2024

> If enough intermediate steps lead to bad results and needed backtracking, but that erasure is not captured

That is a fascinating insight to me. I'm so used to the emacs undo record that I forget that others are not as lucky. I just take for granted that the entire undo history would be available.

HarHarVeryFunny · on March 16, 2024

> Co-pilot gets to watch people figure stuff out

There's a reason most jobs require hands-on experience, and can't be learnt just by reading a book about how to do it, or watching someone else work, or looking at something that someone else created.

It's one thing to have a bag full of tools, but another to know how to skillfully apply them, and when to apply them, etc, etc.

You may read a book (or as an LLM ingest a ton of training data) and think you understand it, or the lessons it teaches, but it's not until the rubber hits the road and you try to do it yourself, and it doesn't go to plan, that you realize there are all sorts of missing detail and ambiguity, and all the fine advice in that programming book or stack overflow discussion doesn't quite apply to your situation, or maybe it appears to apply but for subtle reasons really doesn't.

Maybe if developers were forced to talk about every decision they were making all day every day throughout all sorts of diverse projects, from requirements gathering and design though coding and debugging, and an AI had access to transcriptions of these streams of thought, then this would be enough for them to generalize the thought processes enough to apply them to a novel situation, but even then, in this best case hypothetical scenario, I doubt it'd be enough. Certainly just watching a developer's interactions with an IDE isn't going to come remotely close to an LLM understanding of how to do the job of a developer, let alone to the level of detail that could hypothetically let it learn the job without ever having to try it itself.

I also think that many jobs, including developer and FSD, require AGI to backstop the job specific skills, else what do you do when you discover yourself in a situation that wasn't in the book you trained on? So, it's not just a matter of how do you acquire the skills to do a specific job (which I claim requires practice), but what will it take for AI architectures to progress beyond LLMs and achieve the AGI that is also necessary.

exe34 · on March 16, 2024

> You may read a book (or as an LLM ingest a ton of training data) and think you understand it, or the lessons it teaches, but it's not until the rubber hits the road and you try to do it yourself, and it doesn't go to plan, that you realize there are all sorts of missing detail and ambiguity, and all the fine advice in that programming book or stack overflow discussion doesn't quite apply to your situation, or maybe it appears to apply but for subtle reasons really doesn't.

Pre-training is comparable to reading the book. RLHF, and storing all the lifetime prompts and outputs would be comparable to "learning on the job". There are also hacks like the Voyager minecraft paper.

HarHarVeryFunny · on March 16, 2024

> storing all the lifetime prompts and outputs would be comparable to "learning on the job"

I'm not sure.

I guess we're talking about letting the LLM loose in a programming playground where it can be given requirements, design and write programs, test and debug them, with all inputs and outputs recorded for later off-line pre-training/fine-tuning. For this to be usable as training data, I guess it would have to be serialized text - basically all LLM interactions with tools (incl. editor) and program done via the console (line editor, not screen editor!).

One major question is how would the LLM actually use this to good effect? Training data is normally used to "predict next word", with the idea being that copying the most statistically common pattern is a good thing. A lot of the interactions between a fledgling programmer and his/her notes and tools are going to be BAD ideas that are later corrected and learnt from... not actions that really want to be copied. Perhaps this could be combined with some sort of tree-of-thoughts approach to avoid taking actions leading to bad outcomes, although that seems a lot easier said than done (e.g. how does one determine/evaluate a bad outcome without looking WAY ahead).