Yeah, there's some craziness here: Many people really want to believe in Cool New Magic Somehow Soon, and real money is riding on everyone mutually agreeing to keep acting like it's a sure thing.
> we still can't get LLMs to distinguish trusted and untrusted input...?
Alas, I think the fundamental problem is even worse/deeper: The core algorithm can't even distinguish or track different sources. The prompt, user inputs, its own generated output earlier in the conversation, everything is one big stream. The majority of "Prompt Engineering" seems to be trying to make sure your injected words will set a stronger stage than other injected words.
Since the model has no actual [1] concept of self/other, there's no good way to start on the bigger problems of distinguishing good-others from bad-others, let alone true-statements from false-statements.
______
[1] This is different from shallow "Chinese Room" mimicry. Similarly, output of "I love you" doesn't mean it has emotions, and "Help, I'm a human trapped in an LLM factory" obviously nonsense--well, at least if you're running a local model.
> we still can't get LLMs to distinguish trusted and untrusted input...?
Alas, I think the fundamental problem is even worse/deeper: The core algorithm can't even distinguish or track different sources. The prompt, user inputs, its own generated output earlier in the conversation, everything is one big stream. The majority of "Prompt Engineering" seems to be trying to make sure your injected words will set a stronger stage than other injected words.
Since the model has no actual [1] concept of self/other, there's no good way to start on the bigger problems of distinguishing good-others from bad-others, let alone true-statements from false-statements.
______
[1] This is different from shallow "Chinese Room" mimicry. Similarly, output of "I love you" doesn't mean it has emotions, and "Help, I'm a human trapped in an LLM factory" obviously nonsense--well, at least if you're running a local model.