I stopped playing after the challenge of creating a line-based canvas drawing of...

JimDabell · on Sept 2, 2023

I think both are important things to measure. You are describing a situation where there is a human in the loop. This test measures how reliable GPT-4 is when there isn’t a human in the loop. Right now, LLMs have vast scope as long as there’s a human involved, but if you can’t put a human in the loop this limits their use dramatically. The better LLMs get at getting things right without human oversight, the more things they can be used for.

falcor84 · on Sept 2, 2023

Agreed in general, but I'm actually thinking more about having a code interpreter in the loop. AutoGPT might be a step in the right direction. It also might be a step towards the end of human society as we know it. Probably both.

ben_w · on Sept 2, 2023

If any AI, be it an LLM or otherwise, could reliably operate at professional level without any human intervention, how many people would be permanently unemployable?

JimDabell · on Sept 2, 2023

The entire point of technology – practically its definition – is to reduce work. For centuries, people have been dreaming of a day when people don’t have to work and can get robots to do it all.

The problem is not AI taking away work – that’s a great thing – the problem is that our current economic system is not designed for this. Fixing our economic system is easier and gives much better results for people than trying to stop technological progress.

ben_w · on Sept 2, 2023

I'm not trying to suggest progress is bad.

My point is more: gosh isn't it odd that people are complaining it can't do all the things, given how radically different everything will be when that does finally come to pass?

jazzyjackson · on Sept 2, 2023

I'm ok with a mistake on the first try, what would really impress me if it could tell whether it made a mistake. In my experience GPT is tuned to be totally deferential, "you're right, i apologize, let me try again!", no spine to tell me "yeah the task looks good"

it has no sense of whether a task has been fulfilled

I've never seen any of the recursive models show convergence on a task, seems without a human hand they fall apart

An exception I've seen is with the Wolfram plugin, it seems to at least try different approaches until it arrives at an answer to present to you.

circuit10 · on Sept 2, 2023

> In my experience GPT is tuned to be totally deferential, "you're right, i apologize, let me try again!", no spine to tell me "yeah the task looks good"

This is definitely annoying, but considering their tendency to hallucinate facts it's usually preferable to something like this: https://scoop.upworthy.com/microsoft-chatbot-fights-with-hum...

But I do think it should be toned down a bit, especially if the user is just saying something like "are you sure that's right?"

capableweb · on Sept 2, 2023

> In my experience GPT is tuned to be totally deferential, "you're right, i apologize, let me try again!", no spine to tell me "yeah the task looks good"

I've managed to work around this in GPT (4 at least) by having a system prompt that forces GPT to challenge me and not blindly accept what I say without verifying it first.

IanCal · on Sept 2, 2023

> it has no sense of whether a task has been fulfilled

I've definitely seen it say it's implementation is fine if just asked to identify problems or compare to the original problem statement (and alternatively fix issues it identifies).

capableweb · on Sept 2, 2023

> I stopped playing after the challenge of creating a line-based canvas drawing of the word "Hello".

Similarly, stopped playing when the question was "Write a html page with javascript/canvas to draw the US flag that changes color when I click on it. Make the flag perfectly accurate." and it generated a flag that wasn't "perfectly accurate" (https://i.imgur.com/WhyRsYa.png, notice the position of the top/left stars) but then told me "Wrong! You guessed that GPT-4 could not solve the question, but it can! 71% of people got this question correct."

I'm not sure how the validation is done, seems to be manually hardcoded or something, but it seems it's not very reliable.

dwighttk · on Sept 2, 2023

"Resolution Criteria: On questions that may have ambiguous answers, I will clarify the acceptance criteria. Here, for example, I will copy/paste the code to a file called index.html, open the page, and I expect to see a US flag. If I click on the flag it should change color. The flag must be immediately recognizable: red/white stripes, blue background, 50 stars. It does not have to be dimensionally accurate, but it should be better than a flag I would draw by hand."

capableweb · on Sept 2, 2023

I don't think "questions that may have ambiguous answers" applies when you use a term like "perfectly accurate" which has a very specific meaning.

dandellion · on Sept 3, 2023

On top of the requirement to make it perfectly drawn I take issue with the "but it should be better than a flag I would draw by hand". That's a useless metric because we don't know how the author draws by hand.

I assumed they would have enough brain cells to draw the flag without cutting in half all the stars in the margins. But they must have failed kindergarten because I assumed wrong.

dandellion · on Sept 3, 2023

I started again picking a number at random without looking at the questions. This is what it had to say:

"You answered 39.29% of questions correctly with an average log-loss of 3.880. This means you would have scored better by just leaving your guesses at 50% for every single question.. On average there were 0.00% of people who did better at this game than you. If this was an exam and I was grading it on a curve, I would give you an A+. If you had been better calibrated, you could have scored in the top 23.41%, and I would have given you a B+."

So I did worse than random but 0% did better than me and got an A+. Nice.

kqr · on Sept 3, 2023

Note that the prompt is the input to the LLM, it does not specify the task in enough detail to evaluate the result. That's what the resolution criteria are for -- additional information on resolution you are given but the LLM is not.

qwertox · on Sept 2, 2023

The "star"-section of the US flag:

``` // Draw 50 stars: 9 rows of alternating 6 or 5 stars

      ctx.fillStyle = white;

      for (let row = 0; row < 9; row++) {

        for (let col = 0; col < (row % 2 === 0 ? 6 : 5); col++) {

          let x = 16 + col \* 32 + (row % 2) \* 16;

          let y = 16 + row \* 32;

          ctx.beginPath();

          ctx.arc(x, y, 4, 0, Math.PI \* 2);

          ctx.fill();

        }

      }

    }

```

Effectively drawing circles and the rectangle which contains them is rotated right by 90° so that a section of the blue rectangle is not covered and the dots are partially above the red stripes.

At least when I input it into ChatGPT with GPT-4, that's the result.

And the rendered solution by the site has the stars offset so that some are not fully in the blue rectangle. Accurately is something different.

scotty79 · on Sept 2, 2023

My intuition that got confirmed is that GPT fails at anything visual. Letters, shapes. It's trying but failing every time.

It succeeds only if the thing was drilled diwn hard in learning like american flag or implementing tictactoe (but not predicting best move on the fly).

andai · on Sept 2, 2023

And yet it used to be that code would be handed in on paper, and you'd get the output days (weeks?) later. I heard people quickly learned to double check their programs!

Though I think it's computationally cheaper for GPT to actually run the code than to double check its work...