Hacker Newsnew | past | comments | ask | show | jobs | submit | tedsanders's commentslogin

The nonprofit (OpenAI Foundation) owns ~26% of the for-profit, plus some extra warrants.

The for-profit (OpenAI Group PBC) is what's filing the S-1 Draft.

The OpenAI Foundation also exclusively appoints the board of the OpenAI Group PBC and can replace directors at any time.

https://openai.com/our-structure/

(I work at OpenAI, but I am not a lawyer and am not speaking on behalf of OpenAI - just sharing my personal understanding.)


> The OpenAI Foundation also exclusively appoints the board of the OpenAI Group PBC and can replace directors at any time.

Isn't it hard to write this with a straight face?


If they truly wanted it to be in the benefit of the not-for-profit and safe from interference, the ownership by the foundation would be much closer to or just over 50%.... just thinking out loud...

The magic 50% ownership isn't relevant for that purpose. There are special provisions which means that the Foundation effectively exerts full control over the company because it appoints the entire board.

Very cool! So glad to see people building and sharing evals that are better than SWE bench.

I'm curious - any particular reason you didn't put error bars on the graphs? Seems like it could be helpful when there are only 50 unique problems in the diamond set.


*50 unique problems but 20-40 rubrics per problem (something I had to keep reminding people internally who were unimpressed with the N)

simple answer is our reporting was pass@5. feel like you'd need like 50+ runs to have reasonable confidence intervals, which somehow i dont see other people do, so i also didnt insist on it.

hoping to work with <prominent third party evals shop> to get this on their infra and evaluated along with whatever the industry standard is.


Makes sense, thanks. I suppose error bars are tricky if trying to handle problem-to-problem variance, rubric-to-rubric variance, and run-to-run variance all at once.

For what it's worth, I work at OpenAI and I can guarantee you that we don't switch to heavily quantized models or otherwise nerf them when we're under high load. It's true that the product experience can change over time - we're frequently tweaking ChatGPT & Codex with the intention of making them better - but we don't pull any nefarious time-of-day shenanigans or similar. You should get what you pay for.


> we don't switch to heavily quantized models

That sounded like a press bulletin, so just to let you clarify yourself: Does that mean you may switch to lightly quantized models?


There's almost 0% chance that OpenAI doesn't quantize the model right off the bat.

I am willing to bet large amounts of money that OpenAI would never release a model served as fully BF16 in the year of our lord 2026. That would be insane operationally. They're almost certainly doing QAT to FP4 for FFN, and a similar or slightly larger quant for attention tensors.


It's ok if they never release a BF16 model, but it's less ok if they release it, win the benchmarks, then quantise it after a few weeks.


That would be REALLY easy to detect. It'll be 4x slower.

The tokens/sec of the model is basically directly proportional of the memory bandwidth of the hardware it runs on. So either OpenAI has to gimp model performance for its entire life, or somehow magically speed it up 4x on the first day.


that is for sure what everyone does. also they train on evals with the datasets that they would be bench against.


What do you mean by this? We don’t train on evals, and if we did I’d quit on the spot.

(The loose version of this that’s true is that there may exist eval data contamination in pretraining. This is a hard problem to fully solve.)


its not that loose of a version. its the reality and as probably is surely a focus of a dedicated post training RL-ing these kind of githubs. of course you would train specifically on the task. you would mix this eval data with others in thousands of githubs repos.


Thanks - let me clarify that we don’t switch to lightly quantized models by time of day or when under heavy load either.

(I used the adjective heavily because that’s what the original post said. I have no intention of making misleading but technically true statements.)


Thank you for your answer. I have a similar question as OP, but in regards of the GPT models in MS copilot. My experience is that the response quality is much better when calling the API directly or through the webUI.

I know this might be a question that's impossible for you to answer, but if you can shed any light to this matter, I'd be grateful as I am doing an analysis over what AI solutions that can be suitable for my organisation.


As phrased the only answer is the question; "as opposed to what?"


webUIs have giant system prompts built in

APIs have much smaller ones


its very interesting to see that this only happens to American companies. What gives?


FYI, Elo isn't an acronym - it's a person's name. No need to capitalize it as ELO.



> ELO ratings

Thank you, I just looked at the chart and said to myself: ELO? YOLO!

That Elo ranking is also called chess ranking


Élő. Meaning alive (él = it lives, -ő = adjective)


Unless you've just missed your last train to London.


You're right, thanks for the heads up! Corrected (I can't edit the post on HN though).


Electric Light Orchestra anyone?


Don't bring me down…


Very cool! The demos felt fairly contrived - e.g., count things while I talk. I wonder what more useful or commercial applications look like.


In theory I would expect it to do everything the current frontier models are capable of but with the added benefit of real time interactivity for better collaboration. The biggest benefit may be the real time video input so it can take in that input in parallel with producing outputs steered by the input rather than taking in a video or all images at once and then producing a single output for all of that.


Yes! This is a big thing ive noticed in all AI demos. If the best use case you can think of to show off yor tech is to book a holiday, that I could easily do myself, does your service really add much value? Or is it simply because the real uses will be nuanced and specialsed, and not suited for a quick general audience demo? I'm not sure.


To clarify the title, Daybreak is not a new AI model or a new product. It's a rebranding of OpenAI for Cyber, which is an umbrella over multiple things that OpenAI is doing with companies.


Dec 2025, actually: https://developers.openai.com/api/docs/models/gpt-5.5

(though knowledge cutoffs in practice can be bit fuzzy)


That's what I get for trusting the output of the LLM instead of checking the docs!


Whether a problem is "good" or "bad" is not always objective or simple.

For example, you can have problems that are underspecified, with hardcoded tests for a particular solution (out of multiple possible solutions). If your solution works fine but used a different function name than the one hardcoded in the tests, you can unfairly score 0.

When an eval has underspecified problems like these, you can still score 100% if you remember the original solution from your training data or if you just have taste similar to the original human authors. And both of these qualities - good memory and good taste - are great, but they'll be rewarded unfairly relative to a model that still did exactly what it was asked but in a different way than the hardcoded tests expected.


We don't want hallucinations either, I promise you.

A few biased defenses:

- I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.

- This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."

- On the flip side, GPT-5.5 has the highest accuracy score.

- With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.

- On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.

- Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.

Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.


Honest answer is that it isn't done running yet. It takes some human bandwidth and time to run, so results weren't ready by this morning. We don't know what the score will be, but will probably go up on the leaderboard sometime soon. I personally don't put a lot of stock in the ARC-AGI evals, as it's not relevant to most work that people do, but should still be interesting to see as a measure of reasoning ability.

(I work at OpenAI.)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: