(My apologies if this was already asked - this thread is huge and Find-In-Page-i...

nodja · 2025-12-03T01:15:14 1764724514

> Genuine question: How is it possible for OpenAI to NOT successfully pre-train a model?

The same way everyone else fails at it.

Change some hyper parameters to match the new hardware (more params), maybe implement the latest improvements in papers after it was validated in a smaller model run. Start training the big boy, loss looks good, 2 months and millions of dollars later loss plateaus, do the whole SFT/RL shebang, run benchmarks.

It's not much better than the previous model, very tiny improvements, oops.

yalok · 2025-12-03T07:05:24 1764745524

add to it multiple iterations of having to restart pretraining from some earlier checkpoint when loss plateaus too early or starts increasing due to some bugs…

thefourthchime · 2025-12-03T01:53:41 1764726821

Isn't that what GPT 4.5 was?

wrsh07 · 2025-12-03T03:00:06 1764730806

That was a large model that iiuc was too expensive to serve profitably

Many people thought it was an improvement though

encomiast · 2025-12-02T23:30:55 1764718255

I’m not sure what ‘successfully’ means in this context. If it means training a model that is noticeably better than previous models, it’s not hard to see how that is challenging.

MikeTheGreat · 2025-12-03T01:25:30 1764725130

Ah. Thanks for posting - this makes a lot of sense.

I can totally see how they're able to pre-train models no problem, but are having trouble with the "noticeably better" part.

Thanks!

mudkipdev · 2025-12-02T23:40:54 1764718854

OpenAI allegedly has not completed a successful pretraining run since 4o

octoberfranklin · 2025-12-02T23:29:48 1764718188

You don't train the next model by starting with the previous one.

A company's ML researchers are constantly improving model architecture. When it's time to train the next model, the "best" architecture is totally different from the last one. So you have to train from scratch (mostly... you can keep some small stuff like the embeddings).

The implication here is that they screwed up bigly on the model architecture, and the end result was significantly worse than the mid-2024 model, so they didn't deploy it.

threeducks · 2025-12-03T13:52:01 1764769921

I can not say how big ML companies do it, but from personal experience of training vision models, you can absolutely reuse the weights of barely related architectures (add more layers, switch between different normalization layers, switch between separable/full convolution, change activation functions, etc.). Even if the shapes of the weights do not match, just do what you have to do to make them fit (repeat or crop). Of course the models will not work right away, but training will go much faster. I usually get over 10 times faster convergence that way.

sota_pop · 2025-12-04T15:54:56 1764863696

It’s possible the model architecture influences the effectiveness of utilizing pretrained weights. i.e. cnns might be a good fit for this since the first portion is the feature extractor, but you might scrap the decoder and simply retrain that.

Can’t say whether the same would work with Transformer architecture, but I would guess there are some portions that could potentially be reused? (there still exists an encoder/feature extraction portion)

If you’re reusing weights from an existing model, then it seems it becomes more of a “fine-tuning” exercise as opposed to training a novel foundational model.

MikeTheGreat · 2025-12-03T01:29:04 1764725344

Huh - I did not know that, and that makes a lot of sense.

I guess "Start software Vnext off the current version (or something pretty close)" is such a baseline assumption of mine that it didn't occur to me that they'd be basically starting over each time.

Thanks for posting this!

cherioo · 2025-12-02T23:30:12 1764718212

GPT4.5 was allegedly such a pre-train. It just didn’t perform good enough to announce and product it as such.

htrp · 2025-12-03T00:14:24 1764720864

it wasn't economical to deploy but i expect it wasn't wasted, expect the openai team to pick that back up at some point

mips_avatar · 2025-12-03T00:33:03 1764721983

The scoop Dylan Patel got was that part way through the gpt4.5 pretraining run the results were very very good, but it leveled off and they ended up with a huge base model that really wasn't any better on their evals.