GPT-4 before the RLHF phase of training had a pretty good idea of what it "knows". The calibration graph was almost perfect, but after the RLHF the calibration is almost completely broken.
Nah, RLHF is what made GPT-4 outperform 3.5. The base model hasn't been much improved since 3.5. Also, the calibration curve is based on a subset of MMLU, so it doesn't really represent any of the actual user experience.
I'm not saying that RLHF does more harm than good, just that it made this particular aspect of its performance worse. Basically there is still significant room for improvement, probably without changing the architecture.
In an ideal world "I don't know" would be considered worse than a correct answer but much better than a wrong answer.
In the UK, there is a competition called the "junior maths challenge", or something, which is a multiple choice quiz where correct answers are +1 and incorrect answers are -6 (so guessing has negative EV). I think we need a similar scoring system here.
As a parent my guess would be that people see it as a way to introduce welcome variety and whimsy into the daily routine of reading a bedtime story. While also feeling like you're using a hobby interest to help with a real practical issue.
I have a small library of children's books and we've read them all several times, the good ones many times.
That said, I wouldn't personally turn to these language models. From what I've seen they tend to generate rather bland and boring stories. I would rather make up my own or reread "Kackel i grönsakslandet" for the hundredth time.