> You should only use ChatGPT for things that you are able to review it's work.
This keeps being my argument when people at work daydream about time and cost savings by offloading non-critical business functions to AI. I say, "Great, so it can produce 1000x more work than a person. But then what army of people are we planning to use to check those outputs?"
I'm super-impressed with the current crop of language models for their ability to so accurately simulate correctness, but their inability to understand what they don't know - because, in fact, they don't 'know' any of it in the sense that we do - makes them like very productive but completely untrustworthy employees. A junior dev who monopolizes his mentor's time through inconsistent performance is not a good hire.
Eh, part of the problem is people don't currently understand what LLMs are doing...
Have you ever had a dumb/wrong thought in your head? I'm going to go ahead and answer yes for you, you do all the time. In fact you don't (hopefully) verbalize a stream of consciousness to other people around you. In general you think of something then reflect on what it is true/false.
This is not what LLMs do, they pitch back the first 'thought' they have, "correct" or not. This is why things like COT/TOT greatly increase the accuracy of LLM output. The problem? It requires at least an order of magnitude more processing to get an answer, and with GPU time already in high demand and expensive you don't see much of it happen.
Betting on LLMs commonly being wrong is not a safe bet at this point.
Even if the error rate of LLMs decreases with additional GPU power there's little rhyme or reason to their confabulations. Even if only 1% of the code is in error there's no guidance or pattern to where those errors might be.
It's like reviewing an overconfident junior developer's code except you can't learn their particular weaknesses. If a developer is bad about memory leaks, you know to check their every PR for memory leaks. An LLM won't necessarily produce the same types of errors given similar prompts or even the same prompt with some period of time between invocations.
In this paper, we introduce the Tree-of-Thought (ToT) framework, a novel approach aimed at improving the problem-solving capabilities of auto-regressive
large language models (LLMs). The ToT technique is inspired by the human
mind’s approach for solving complex reasoning tasks through trial and error. In
this process, the human mind explores the solution space through a tree-like
thought process, allowing for backtracking when necessary. To implement ToT
as a software system, we augment an LLM with additional modules including a
prompter agent, a checker module, a memory module, and a ToT controller. In
order to solve a given problem, these modules engage in a multi-round conversation with the LLM. The memory module records the conversation and state
history of the problem solving process, which allows the system to backtrack
to the previous steps of the thought-process and explore other directions from
there. To verify the effectiveness of the proposed technique, we implemented
a ToT-based solver for the Sudoku Puzzle. Experimental results show that the
ToT framework can significantly increase the success rate of Sudoku puzzle solving.
This keeps being my argument when people at work daydream about time and cost savings by offloading non-critical business functions to AI. I say, "Great, so it can produce 1000x more work than a person. But then what army of people are we planning to use to check those outputs?"
I'm super-impressed with the current crop of language models for their ability to so accurately simulate correctness, but their inability to understand what they don't know - because, in fact, they don't 'know' any of it in the sense that we do - makes them like very productive but completely untrustworthy employees. A junior dev who monopolizes his mentor's time through inconsistent performance is not a good hire.