GPT-4 before the RLHF phase of training had a pretty good idea of what it "knows...

esjeon · on April 6, 2023

Nah, RLHF is what made GPT-4 outperform 3.5. The base model hasn't been much improved since 3.5. Also, the calibration curve is based on a subset of MMLU, so it doesn't really represent any of the actual user experience.

sebzim4500 · on April 7, 2023

I'm not saying that RLHF does more harm than good, just that it made this particular aspect of its performance worse. Basically there is still significant room for improvement, probably without changing the architecture.

int_19h · on April 7, 2023

Source?

esjeon · on April 7, 2023

The OpenAI GPT-4 paper itself.

brucethemoose2 · on April 6, 2023

Perhaps "one model to rule them all" isnt the best approach.

sebzim4500 · on April 6, 2023

There's probably a huge amount of room for improvement in the RLHF process. If there is still low hanging fruit, it would have to be there.

brucethemoose2 · on April 6, 2023

"I dunno" would have to be marked as a good or neutral response in the RLHF process, and that seems like a problematic training incentive.

sebzim4500 · on April 7, 2023

In an ideal world "I don't know" would be considered worse than a correct answer but much better than a wrong answer.

In the UK, there is a competition called the "junior maths challenge", or something, which is a multiple choice quiz where correct answers are +1 and incorrect answers are -6 (so guessing has negative EV). I think we need a similar scoring system here.

jiggywiggy · on April 6, 2023

Hmm didn't notice any difference yet, you are saying it got worse last weeks?

For kids story writing I've been getting better results with 3.5 at times.

Where 4 is way better af coding.

sebzim4500 · on April 6, 2023

No, we have no access to the original model, unfortunately.

The fact that RLHF broke the calibration comes from the GPT-4 paper, possibly the only interesting technical detail that they include.

rideontime · on April 6, 2023

What's with the obsession with children's stories and GPT? Is it just that children have low standards?

Agentlien · on April 7, 2023

As a parent my guess would be that people see it as a way to introduce welcome variety and whimsy into the daily routine of reading a bedtime story. While also feeling like you're using a hobby interest to help with a real practical issue.

I have a small library of children's books and we've read them all several times, the good ones many times.

That said, I wouldn't personally turn to these language models. From what I've seen they tend to generate rather bland and boring stories. I would rather make up my own or reread "Kackel i grönsakslandet" for the hundredth time.