Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've seen a rumor going around that OpenAI hasn't had a successful pre-training run since mid 2024. This seemed insane to me but if you give ChatGPT 5.1 a query about current events and instruct it not to use the internet it will tell you its knowledge cutoff is June 2024. Not sure if maybe that's just the smaller model or what. But I don't think it's a good sign to get that from any frontier model today, that's 18 months ago.




SemiAnalysis said it last week and AFAIK it wasn't denied.

https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...


The SemiAnalysis article that you linked to stated:

"OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024, highlighting the significant technical hurdle that Google’s TPU fleet has managed to overcome."

Given the overall quality of the article, that is an uncharacteristically convoluted sentence. At the risk of stating the obvious, "that was broadly deployed" (or not) is contingent on many factors, most of which are not of the GPU vs. TPU technical variety.


My reading in between the lines is OpenAI's "GPT-5" is really a GPT-4 generation model. And this is aligned with it being unimpressive. Not the promised leap forward Altman promised.

The only real change I noticed is it self censoring more than GPT-4.

From what I can tell they just removed the psychosis component that was always telling you to be right.

This is misleading. They had 4.5 which was a new scaled up training run. It was a huge model and only served to pro users, but the biggest models are always used as teacher models for smaller models. Thats how you do distillation. It would be stupid to not use the biggest model you have in distillation and a waste since they have the weights.

The would have taken some time to calculate the efficiency gains of pretraining vs RL. Resumed the GPT-4.5 for whatever budget made sense and then spent the rest on RL.

Sure they chose to not serve the large base models anymore for cost reasons.

But I’d guess Google is doing the same. Gemini 2.5 samples very fast and seems way to small to be their base pre train. The efficiency gains in pertaining scale with model scale so it makes sense to train the largest model possible. But then the models end up super sparse and oversized and make little sense to serve in inference without distillation.

In RL the efficiency is very different because you have to inference sample the model to draw online samples. So small models start to make more sense to scale.

Big model => distill => RL

Makes the most theoretical sense for training now days for efficient spending.

So they already did train a big model 4.5. Not using it would have been absurd and they have a known recipe they could return scaling on if the returns were justified.


My understanding of 4.5 was that it was released long, long after the initial training run finished. It also had an older cutoff date than the newer 4o models

Cutoff dates seem to be Oct 2024 for GPT-4.5, and Jan 2025 for the Gemini models.

It kind of explains a coding issue I had with tradingview who update their pinescript thing quite frequently. ChatGPT seemed to have issues with v4 vs v5.


This is a really great breakdown. With TPUs seemingly more efficient and costing less overall, how does this play for Nvidia? What's to stop them from entering the TPU race with their $5 trillion valuation?

As others mentioned, 5T isn't money available to NVDA. It could leverage that to buy a TPU company in an all stock deal though.

The bigger issue is that entering a 'race' implies a race to the bottom.

I've noted this before, but one of NVDA's biggest risks is that its primary customers are also technical, also make hardware, also have money, and clearly see NVDA's margin (70% gross!!, 50%+ profit) as something they want to eliminate. Google was first to get there (not a surprise), but Meta is also working on its own hardware along with Amazon.

This isn't a doom post for NVDA the company, but its stock price is riding a knifes edge. Any margin or growth contraction will not be a good day for their stock or the S&P.


Making the hardware is actually the easy part. Everyone and their uncle who had some cash have tried by now: Microsoft, Meta, Tesla, Huawei, Amazon, Intel - the list goes on and on. But Nvidia is not a chip company. Huang himself said they are mostly a software company. And that is how they were able to build a gigantic moat. Because noone else has even come close on the software side. Google is the only one who has had some success on this side, because they also spent tons of money and time on software refinement by now, while all the other chips vanished into obscurity.

Are you saying that Google, Meta, Amazon, etc... can't do software? It's the bread and butter of these companies. The CUDA moat is important to hold off the likes of AMD, but hardware like TPUs for internal use or other big software makers is not a big hurdle.

Of course Huang will lean on the software being key because he sees the hardware competition catching up.


Essentially, yes, they haven’t done deep software. Netflix probably comes closest amongst FAANG.

Google, Meta, Amazon do “shallow and broad” software. They are quite fast at capturing new markets swiftly, they frequently repackage OpenSource core and add the large amount of business logic to make it work, but essentially follow the market cycles - they hire and layoff on a few year cycle, and the people who work there typically also will jump around industries due to both transferable skills and relatively competitive competitors.

NVDA is roughly in the same bucket as HFT vendors. They retain talent on a 5-10y timescales. They build software stacks that range from complex kernel drivers and hardware simulators all the way to optimizing compilers and acceleration libraries.

This means they can build more integrated, more optimal and more coherent solutions. Just like Tesla can build a more integrated vehicle than Ford.


I have deep respect for cuda and Nvidia engineering. However, the arguments above seem to totally ignore Google Search indexing and query software stack. They are the king of distributed software and also hardware that scales. That is way TPUs are a thing now and they can compute with Nvidia where AMD failed. Distributed software is the bread and butter of Google with their multi-decade investment from day zero out of necessity. When you have to update an index of an evolving set of billions of documents daily and do that online while keeping subsecond query capability across the globe, that should teach you a few things about deep software stacks.

These companies innovate in all of those areas and direct those resources towards building hyper-scale custom infrastructure, including CPU, TPU, GPU, and custom networking hardware for the largest cloud systems, and conduct research and development on new compilers and operating system components to exploit them.

They're building it for themselves and employ world-class experts across the entire stack.

How can NVIDIA develop "more integrated" solutions when they are primarily building for these companies, as well as many others?

Examples of these companies doing things you mention as being somehow unique to or characteristic of NVIDIA:

Complex kernel drivers or modules:

- AWS: Nitro, ENA/EFA, Firecracker, NKI, bottlerocket

- Google: gasket/apex, gve, binder

- Meta: Katran, bpfilter, cgroup2, oomd, btrfs

Hardware simulators:

- AWS: Neuron, Annapurna builds simulations for nitro, graviton, inferentia and validates aws instances built for EDA services

- Google: Goldfish, Ranchu, Cuttlefish

- Meta: Arcadia, MTIA, CFD for thermal management

Optimizing Compilers:

- Amazon: NNVM, Neo-AI

- Google: MLIR, XLA, IREE

- Meta: Glow, Triton, LLM Compiler

Acceleration Libraries:

- Amazon: NeuronX, aws-ofi-nccl

- Google: Jax, TF

- Meta: FBGEMM, QNNPACK


You're suggesting Waymo isn't deep software? Or Tensorflow? Or Android? The Go programming language? Or MapReduce, AlphaGo, Kubernetes, the transformer, Chrome/Chromium or Gvisor?

You must have an amazing CV to think these are shallow projects.


No, I just realize these for what they are - reasonable projects at the exploitation (rather than exploration) stage of any industry.

I’d say I have an average CV in the EECS world, but also relatively humble perspective of what is and isn’t bleeding edge. And as the industry expands, the volume „inside” the bleeding edge is exploitation, while the surface is the exploration.

Waymo? Maybe; but that’s acquisition and they haven’t done much deep work since. Tensorflow is a handy and very useful DSL, but one that is shallow (builds heavily on CUDA and TPUs etc); Android is another acquisition, and rather incremental growth since; Go is a nth C-like language (so neither Dennis Richie nor Bjarne Stroustrup level work); MapReduce is a darn common concept in HPC (SGI had libraries for it in the 1990s) and implementation was pretty average. AlphaGo - another acquisition, and not much deep work since; Kubernetes is a layer over Linux Namespaces to solve - well - shallow and broad problems; Chrome/Chromium is the 4th major browser that reached dominance and essentially anyone with a 1B to spare can build one.. gVisor is another thin, shallow layer.

What I mean by deep software, is a product that requires 5-10y of work before it is useful, that touches multiple layers of software stack (ideally all from hardware to application) etc. But these types of jobs are relatively rare in the 2020s software world (pretty common in robotics and new space) - they were common in the 1990s where I got my calibration values ;) Netscape and Palm Pilot was a „whoa”. Chromium and Android are evolutions.


> No, I just realize these for what they are - reasonable projects at the exploitation (rather than exploration) stage of any industry.

I get that bashing on Google is fun, but TensorFlow was the FIRST modern end-user ML library. JAX, an optimizing backend for it, is in its own league even today. The damn thing is almost ten years old already!

Waymo is literally the only truly publicly available robotaxi company. I don't know where you get the idea that it's an acquisition; it's the spun-off incarnation of the Google self-driving car project that for years was the butt of "haha, software engineers think they're real engineers" jokes. Again, more than a decade of development on this.

Kubernetes is a refinement of Borg, which Google was using to do containerized workloads all the way back in 2003! How's that not a deep project?


True, for some definition of first and some definition of modern. I’d say it builds extremely heavily on the works inside XTX (and prior to that, XFactor etc) on general purpose linear algebra tooling, and still doesn’t change the fact that it remains shallow, even including JAX. Google TPUs change this equation a bit, as they are starting to come to fruition; but for them to reach the level of depth of NVDA, or even DEC to SUN, they’d have to actually own it from silicon to apps… and they eventually might. But the bulk of work at Google is narrow end-user projects, and they don’t have (at large) a deep engineering excellence focus.

Waymo is an acquihire from ‘05 DARPA challenges, and I’d say Tesla got there too (but with a much stricter hardware to user stack, which ought to bear fruits)

I’d say Kubernetes would be impressive compared to 1970s mainframes ;) Jokes aside, it’s a neat tool to use crappy PCs as server farms, which was sort of Google’s big insight in 2000s when everyone was buying Sun and dying with it, but that makes it not deep, at least not within Google itself.

But this may change. I think Brin recognizes this during the Code Red, and they start very heavily on building a technical moat since OpenAI was the first credible threat to the user behavior moat.


You think that Tesla, which has not accepted liability for a single driverless ride, has "gotten there?" I'm not even going to look up how many Waymo does in a month, I'm sure it's in the millions now.

Come on, man.

> Google's TPUs change this equation a bit

Google has been using TPUs to serve billions of customers for a decade. They were doing it at that scale before anyone else. They use them for training, too. I don't know why you say they don't own the stack "from silicon to apps" because THEY DO. Their kernels on their silicon to serve their apps. Their supply chain starts at TSMC or some third-party fab, exactly like NVIDIA.

Google's technical moat is a hundred miles deep, regardless of how dysfunctional it might look from the outside.


I think Theano takes the crown as first modern end-user library for autodiff and tensor operations.

Original Torch too. https://torch.ch/

Ok, that's fair.

Well put. I haven’t thought about it like that.

But the first example sigmoid10 gave of a company that can't do software was Microsoft.

Yeah I'm not convinced Microsoft can do software anymore. I think they're a shambling mess of a zombie software company with enough market entropy to keep going for a long time.

The prosecution presents windows 11 as evidence that Microsoft can’t do software. Actually that’s it, that’s the entirety of the case.

The prosecution rests.


Due to clerical error the frontend updates of GitHub was not part discovery so not allowed as evidence. Still, though.

Yeah the fact they had to resort to forking Chrome because they couldn’t engineer a browser folks wanted to use is pretty telling.

They did engineer a good browser: original Edge with the Chakra JavaScript Engine. It was faster than Google Chrome and had some unique features: a world-best, butter-smooth and customizable epub reader. I loved it for reading - it beat commercial epub readers - and then Nadella took over and said Microsoft is getting rid of it and Edge will move to Chromium and Microsoft will also get rid of Windows phone. Modern Microsoft will be Cloud/AI and Ads. That was so depressing.

I don't think that tells us anything.

Maintaining a web browser requires about 1000 full-time developers (about the size of the Chrome team at Google) i.e., about $400 million a year.

Why would Microsoft incur that cost when Chromium is available under a license that allows Microsoft to do whatever it wants with it?


You could say the same thing about all Microsoft products then. How many full time developers does it take to support Windows 11 when Linux is available, SqlServer when Postgres is available, Office when LibreOffice exists?

And so on all under licenses that allows Microsoft do whatever it wants with?

They should be embarrassed to do better, not spin it into a “wise business move” aka transfer that money into executive bonuses.


Microsoft gets a lot of its revenue from the sale of licenses and subscriptions for Windows and Office. An unreliable source that gives fast answers to questions tells me that the segments responsible for those two softwares have revenue of about $13 and about 20 billion per quarter respectively.

In contrast, basically no one derives any significant revenue from the sale of licenses or subscriptions for web browsers. As long as Microsoft can modify Chromium to have Microsoft's branding, to nag the user into using Microsoft Copilot and to direct search queries to Bing instead of Google Search, why should Microsoft care about web browsers?

It gets worse. Any browser Microsoft offers needs to work well on almost any web site. These web sites (of which there are 100s of 1000s) in turn are maintained by developers (hi, web devs!) that tend to be eager to embrace any new technology Google puts into Chrome, with the result that Microsoft must responding by putting the same technological capabilities into its own web browser. Note that the same does not hold for Windows: there is no competitor to Microsoft offering a competitor to Windows that is constantly inducing the maintainers of Windows applications to embrace new technologies, requiring Microsoft to incur the expense of applying engineering pressure to Windows to keep up. This suggests to me that maintaining Windows is actually significantly cheaper than it would be to maintain an independent mainstream browser. An independent mainstream browser is probably the most expensive category of software to create and to maintain excepting only foundational AI models.

"Independent" here means "not a fork of Chromium or Firefox". "Mainstream" means "capable of correctly rendering the vast majority of web sites a typical person might want to visit".


You don't need a Google-sized team to work on a browser. No other browser engine has a team that large.

They did incur that cost… for decades. They were in a position where their customers were literally forced to use their product and they still couldn’t create something people wanted to use.

Potentially these last two points are related.


Huang said that many years ago, long before ChatGPT or the current AI hype were a thing. In that interview he said that their costs for software R&D and support are equal or even bigger than their hardware side. They've also been hiring top SWE talent for almost two decades now. None of the other companies have spent even close to this much time and money on GPU software, at least until LLMs became insanely popular. So I'd be surprised to see them catch up anytime soon.

If CUDA were as trivial to replicate as you say then Nvidia wouldn’t be what it is today.

CUDA is not hard to replicate, but the network effects make it very hard to break trough with new product. Just like with everything when network effeft applies.

Meta makes websites and apps. Historically, they haven't succeeded at lower-level development. A somewhat recent example was when they tried to make a custom OS for their VR headsets, completely failed, and had to continue using Android.

You're generalizing a failure at delivering one consumer solution and ignoring the successful infrastructure research and development that occurs behind the scenes.

Meta builds hardware from chip to cluster to datacenter scale, and drives research into simulation at every scale, all the way to CFD simulation of datacenter thermal management.


More than one failure. They had a project to make a custom chip for model training a few years ago, and they scrapped it. Now they have another one, which entered testing in March. I don't think it's going well, because testing should have wrapped up recently, right before the news that they're in serious talks to buy a lot of TPUs from Google. On the other side of the stack, Llama 4 was a disaster and they haven't shipped anything since.

They have the money and talent to do it. As you point out, they do have major successes in areas that take real engineering. But they also have a lot of failures. It will depend how the internal politics play out, I imagine.


Remind me which company originated PyTorch?

Remind me that PyTorch is not a GPU driver.

Genuine question: given LLMs' inexorable commoditization of software, how soon before NVDA's CUDA moat is breached too? Is CUDA somehow fundamentally different from other kinds of software or firmware?

Current Gen LLMs are not breaching the moat yet.

Yeah they are. llama.cpp has had good performance on cpu, amd, and apple metal for at least a year now.

Thw hardware is not the issue. It's the model architectures leading to cascading errors

Nvidia has everything they need to build the most advanced GPU Chip in the world and mass produce it.

Everything.

They can easily just do this for more optimized Chips.

"easily" in sense of that wouldn't require that much investment. Nvidia knows how to invest and has done this for a long time. Their Ominiverse or robots platform isaac are all epxensive. Nvidia has 10x more software engineers than AMD


They still go to TSMC for fab, and so does everyone else.

For sure. But they also have high volumne and know how to do everything.

Also certain companies normally don't like to do things themselves if they don't have to.

Nonetheless nvidia is were it is because it has cude and an ecoysystem. Everyone uses this ecosystem and then you just run that stuff on the bigger version of the same ecosystem.


> What's to stop them from entering the TPU race with their $5 trillion valuation?

Valuation isn’t available money; they'd have to raise more money in the current, probably tighter for them, investment environment to enter the TPU race, since the money they have already raised that that valuation is based on is already needed to provide runway for what they are already doing without putting money into the TPU race


Nvidia is already in the TPU race aren't they? This is exactly what the tensor cores on their current products are supposed to do, but they're just more heterogeneous GPU based architectures and exist with CUDA cores etc. on the same die. I think it should be within their capability to make a device which devotes an even higher ratio of transistors to tensor processing.

$5 trillion valuation doesn't mean it has $5 trillion cash in pocket -- so "it depends"

If you look at the history how GPUs evolved:

1. there had be fixed function hardware for certain graphics stages

2. Programmable massively parallel hardware took over. Nvidia was at the forefront of this.

TPUs seem to me similar to fixed function hardware. For Nvidia it's a step backwards and even though they go into this direction recently I can't see them go all the way.

Otherwise you don't need cuda, but hardware guy's that write verilog or vhdl. They don't have that much of an edge there.


Why dig for gold when you are the gold standard for the shovel already?

That is.... actually a seriously meaty article from a blog I've never heard of. Thanks for the pointer.

Semi analysis is great, they typically do semiconductors but reporting is top notch.

Wow, that was a good article. So much detail from financial to optical linking to build various data flow topologies. Makes me less aghast at the $10M salaries for the masters of these techniques.

This article about them got published just yesterday... https://news.ycombinator.com/item?id=46124883

There's a lot of misleading information in what they publish, plagiarism, and I believe some information that wouldn't be possible to get without breaking NDAs


> I believe some information that wouldn't be possible to get without breaking NDAs

…why would I care about this in the slightest?


Dylan Patel founded Semianalysis and he has a great interview with Satya Nadella on Dwarkesh Patel's podcast.

Semianalysis is great, def recommend following

Dylan Patel joined Dwarkesh recently to interview Satya Nadella: https://www.dwarkesh.com/p/satya-nadella-2

And this is relevant how? That interview is 1.5 hours, not something you just casually drop a link to and say "here, listen to this to even understand what point I was trying to make"

Sorry, this was meant to be a reply to this comment: https://news.ycombinator.com/item?id=46127942

I was trying to make the point that SemiAnalysis is semi-famous.


The video is interesting showing microsoft's latest data center and Nadella talking about them. I prefered the youtube version https://youtu.be/8-boBsWcr5A

You can now ask Gemini, about a video. Very useful!

I have a few lines of "download subtitles with yt-dlp", "remove the VTT crap", and "shove it into llm with a summarization prompt and/or my question appended", but I mostly use Gemini for that now. (And I use it for basically nothing else, oddly enough. They just have the monopoly on access to YouTube transcripts ;)

<insert link to 2 hour long YouTube video>

That's my reply. I assume everyone who wants to know my point has access to a LLM that can summarize videos.

Is this how internet communication is supposed to be now?


It's not a rumor, it's confirmed by OpenAI. All "models" since 4o are actually just optimizations in prompting and a new routing engine. The actual -model- you are using with 5.1 is 4. Nothing has been pre-trained from scratch since 4o.

Their own press releases confirm this. They call 5 their best new "ai system", not a new model

https://openai.com/index/introducing-gpt-5/


I can believe this, Deepseek V3.2 shows that you can get close to "gpt-5" performance with a gpt-4 level base model just with sufficient post-training.

Deepseek scores Gold at IMO and IOI while GPT-5 scores Bronze. OpenAI now has to catch up to china.

...in a single benchmark.

No. Many benchmarks, I just mentioned those two as they where being bragged about by openai and Google when their internal models achieved gold.

I don't think that counts as confirmation. 4.5 we know was a new base-model. I find it very very unlikely the base model of 4 (or 4o) is in gpt5. Also 4o is a different base model from 4 right? it's multimodal etc. Pretty sure people have leaked sizes etc and I don't think it matches up.

Gpt-5 is a “model router”

New AI system doesn't preclude new models. I thought when GPT 5 launched and users hated it the speculation was GPT 5 was a cost cutting model and the routing engine was routing to smaller, specialized dumber models that cost less on inference?

It certainly was much dumber than 4o on Perplexity when I tried it.


> and the routing engine was routing to smaller, specialized dumber models that cost less on inference?

That this was part of it was stated outright, except maybe that they "cost less" which was left for you to infer (sorry), in their launch announcement.

Paying for pro, and setting it to thinking all the time, I saw what seemed like significant improvements, but if your requests got (mis-)routed to one of the dumber models, it's not surprising if people were disappointed.

I think they made a big mistake in not clearly labelling the responses with which of the models responded to a given request, as it made people complain about GPT 5 in general, instead of complaining about the routing.


I think it’s more about deciding how much to think about stuff and not a model router per se. 5 and 5.1 get progressively better calibrated reasoning token budgets. Also o3 and “reasoning with tools” for a massive consumer audience was a major advance and fairly recent

Well then 5.x is pretty impressive

Maybe this is just armchair bs on my part, but it seems to me that the proliferation of AI-spam and just general carpet bombing of low effort SEO fodder would make a lot of info online from the last few years totally worthless.

Hardly a hot take. People have theorized about the ouroboros effect for years now. But I do wonder if that’s part of the problem


Gemini 3 has a similar 2024 cutoff and they claim to have trained it from scratch. I wish they would say more about that.

Every so often I try out a GPT model for coding again, and manage to get tricked by the very sparse conversation style into thinking it's great for a couple of days (when it says nothing and then finishes producing code with a 'I did x, y and z' with no stupid 'you're absolutely' right sucking up and it works, it feels very good).

But I always realize it's just smoke and mirrors - the actual quality of the code and the failure modes and stuff are just so much worse than claude and gemini.


I am a novice programmer -- I have programmed for 35+ years now but I build and lose the skills moving between coder to manager to sales -- multiple times. Fresh IC since last week again :) I have coded starting with Fortran, RPG and COBOL and I have also coded Java and Scala. I know modern architecture but haven't done enough grunt work to make it work or to debug (and fix) a complex problem. Needless to say sometimes my eyes glaze over the code.

And I write some code for my personal enjoyment, and I gave it to Claude 6-8 months back for improvement, it gave me a massive change log and it was quite risky so abandoned it.

I tried this again with Gemini last week, I was more prepared and asked it to improve class by class, and for whatever reasons I got better answers -- changed code, with explanations, and when I asked it to split the refactor in smaller steps, it did so. Was a joy working on this over the thanksgiving holidays. It could break the changes in small pieces, talk through them as I evolved concepts learned previously, took my feedback and prioritization, and also gave me nuanced explanation of the business objectives I was trying to achieve.

This is not to downplay claude, that is just the sequence of events narration. So while it may or may not work well for experienced programmers, it is such a helpful tool for people who know the domain or the concepts (or both) and struggle with details, since the tool can iron out a lot of details for you.

My goal now is to have another project for winter holidays and then think through 4-6 hour AI assisted refactors over the weekends. Do note that this is a project of personal interest so not spending weekends for the big man.


> I was more prepared and asked it to improve class by class, and for whatever reasons I got better answers

There is a learning curve with all of the LLM tools. It's basically required for everyone to go through the trough of disillusionment when you realize that the vibecoding magic isn't quite real in the way the influencers talk about it.

You still have to be involved in the process, steer it in the right direction, and review the output. Rejecting a lot of output and re-prompting is normal. From reading comments I think it's common for new users to expect perfection and reject the tools when it's not vibecoding the app for them autonomously. To be fair, that's what the hype influencers promised, but it's not real.

If you use it as an extension of yourself that can type and search faster, while also acknowledging that mistakes are common and you need to be on top of it, there is some interesting value for some tasks.


For me the learning curve was learning to choose what is worth asking to Claude. After 3 months on it, I can reap the benefit: Claude produces the code I want right 80% of the time. I usually ask it: to create new functions from scratch (it truly shines at understanding the context of these functions by reusing other parts of the code I wrote), refactor code, create little tools (for example a chart viewer).

It really depends on what you're building. As an experiment, I started having Claude Code build a real-time strategy game a bit over a week ago, and it's done an amazing job, with me writing no code whatsoever. It's an area with lots of tutorials for code structure etc., and I'm guessing that helps. And so while I've had to read the code and tell it to refactor things, it has managed to do a good job of it with just relatively high level prodding, and produced a well-architected engine with traits based agents for the NPCs and a lot of well-functioning game mechanics. It started as an experiment, but now I'm seriously toying with building an actual (but small) game with it just to see how far it can get.

In other areas, it is as you say and you need to be on top of it constantly.

You're absolutely right re: the learning curve, and you're much more likely to hit an area where you need to be on top of it than one that it can do autonomously, at least without a lot of scaffolding in the form of sub-agents, and rules to follow, and agent loops with reviews etc., which takes a lot of time to build up, and often include a lot of things specific to what you want to achieve. Sorting through how much effort is worth it for those things for a given project will take time to establish.


I suspect the meta architecture can also be done autonomously though no one has got there yet, figuring out the right fractal dimension for sub agents and the right prompt context can itself be thought of as a learning problem.

I appreciate this narrative; relatable to me in how I have experienced and watched others around me experience the last few years. It's as if we're all kinda-sorta following a similar "Dunning–Kruger effect" curve at the same time. It feels similar to growing up mucking around with a ppp connection and Netscape in some regards. I'll stretch it: "multimodal", meet your distant analog "hypermedia".

My problem with Gemini is how token hungry it is. It does a good job but it ends up being more expensive than any other model because it's so yappy. It sits there and argues with itself and outputs the whole movie.

Breaking down requirements, functionality and changes into smaller chunks is going to give you better results with most of the tools. If it can complete smaller tasks in the context window, the quality will likely hold up. My go to has been to develop task documents with multiple pieces of functionality and sub tasks. Build one piece of functionality at a time. Commit, clear context and start the next piece of functionality. If something goes off the rails, back up to the commit, fix and rebase future changes or abandon and branch.

That’s if I want quality. If I just want to prototype and don’t care, I’ll let it go. See what I like, don’t like and start over as detailed above.


Interesting. From my experience, Claude is much better at stuff involving frontend design somehow compared to other models (GPT is pretty bad). Gemini is also good but often the "thinking" mode just adds stuff to my code that I did not ask it to add or modifies stuff to make it "better". It likes to 1 up on the objective a lot which is not great when you're just looking for it to do what you precisely asked it and nothing else.

I have never considered trying to apply Claude/Gemini/etc. to Fortran or COBOL. That would be interesting.

I was just giving my history :) but yes I am sure this could actually get us out of the COBOL lock-in which requires 70 years old programmers to continue working.

The last article I could find on this is from 2020 though: https://www.cnbc.com/2020/04/06/new-jersey-seeks-cobol-progr...


Or you could just learn cobol. Using an LLM with a language you don’t know is pretty risky. How do you spot the subtle but fatal mistakes they make?

You can actually use Claude Code (and presumably the other tools) on non-code projects, too. If you launch claude code in a directory of files you want to work on, like CSVs or other data, you can ask it to do planning and analysis tasks, editing, and other things. It's fun to experiment with, though for obvious reasons I prefer to operate on a copy of the data I'm using rather than let Claude Code go wild.

I use Claude Code for "everything", and have just committing most things into git as a fallback.

It's great to then just have it write scripts, and then write skills to use those scripts.

A lot of my report writing etc. now involve setting up a git repo, and use Claude to do things like process the call transcripts from discovery calls and turn them into initial outlines and questions that needs followup, and tasks lists, and write scripts to do necessary analysis etc., so I can focus on the higher level stuff.


Side note from someone who just used Claude Code today for the first time: Claude Code is a TUI, so you can run it in any folder/with any IDE and it plays along nicely. I thought it was just another vscode clone, so I was pleasantly surprised that it didn't try to take over my entire workflow.

It's even better: It's a TUI if you launch it without options, but you can embed it in scripts too - the "-p" option takes a prompt, in which case it will return the answer, and you can also provide a conversation ID to continue a conversation, and give it options to return the response as JSON, or stream it.

Many of the command line agent tools support similar options.


They also have a vscode extension that compares with github copilot now, just so you know.

I'm starting with Claude at work but did have an okay experience with OpenAi so far. For clearly delimited tasks it does produce working code more often than not. I've seen some improvement on their side compared to say, last year. For something more complex and not clearly defined in advance, yes, it does produce plausible garbage and it goes off the rails a lot. I was migrating a project and asked ChatGPT to analyze the original code base and produce a migration plan. The result seemed good and encouraging because I didn't know much about that project at that time. But I ended up taking a different route and when I finished the migration (with bits of help from ChatGPT) I looked at the original migration plan out of curiosity since I had become more familiar with the project by now. And the migration plan was an absolutely useless and senseless hallucination.

Use Codex for coding work

On the contrary, I cannot use the top Gemini and Claude models because their outputs are so out place and hard to integrate with my code bases. The GPT 5 models integrate with my code base's existing patterns seamlessly.

Supply some relevant files of your codebase in the ClaudeAI project area in the right part of the browser. Usually it will understand your architecture, patterns, principles

I'm using AI in-editor, all the models have full access to my code base.

You realize on some level all of these sort of anecdotes, though, are simply random coincidence .

NME at all - 5.1 codex has been the best by far.

How can you stand the excruciating slowness? Claude Code is running circles around codex. The most mundane tasks make it think for a minute before doing anything.

I use it on medium reasoning and it's decently quick. I only switch to gpt-5.1-codex-max xhigh for the most annoying problems.

By learning to parallelize my work. This also solved my problem with slow Xcode builds.

Well you can’t edit files while Xcode is building or the compiler will throw up, so I‘m wondering what you mean here. You can’t even run swift test in 2 agents at the same time, because swift serializes access for some reason.

Whenever I have more than 1 agent run Swift tests in a loop to fix things, and another one to build something, the latter will disturb the former and I need to cancel.

And then there’s a lot of work that can’t be parallelized, like complex git rebases - well you can do other things in a worktree, but good luck merging that after you‘ve changed everything in the repo. Codex is really really bad at git.


Yes these are horrible pain points. I can only hope Apple improves this stuff if it's true that they're adding MCP support throughout the OS which should require better multi-agent handling

You can use worktrees to have multiple copies building or testing at once

I'm a solo dev so I rarely use some git features like rebase. I work out of trunk only without branches (if I need a branch, I use a feature flag). So I can't help with that

What I did is build an Xcode MCP server that controls Xcode via AppleScript and the simulator via accessibility & idb. For running, it gives locks to the agent that the agent releases once it's done via another command (or by pattern matching on logs output or scripting via JS criteria for ending the lock "atomically" without requiring a follow-up command, for more typical use). For testing, it serializes the requests into a queue and blocks the MCP response.

This works well for me because I care more about autonomous parallelization than I do eliminating waiting states, as long as I myself am not ever waiting. (This is all very interesting to me as a former DevOps/Continuous Deployment specialist - dramatically different practices around optimizing delivery these days...)

Once I get this tool working better I will productize it. It runs fully inside the macOS sandbox so I will deploy it to the Mac App Store and have an iOS companion for monitoring & managing it that syncs via iCloud and TailScale (no server on my end, more privacy friendly). If this sounds useful to you please let me know!

In addition to this, I also just work on ~3 projects at the same time and rotate through them by having about 20 iTerm2 tabs open where I use the titles of each tab (cmd-i to update) as the task title for my sake.

I've also started building more with SwiftWASM (with SQLite WASM, and I am working on porting SQLiteData to WASM too so I can have a unified data layer that has iCloud sync on Apple platforms) and web deployment for some of my apps features so that I can iterate more quickly and reuse the work in the apps.


Yes, that makes sense to me. I cannot really put builds in a queue because I have very fine-grained updates that I tell my agents so they do need the direct feedback to check what they have just done actually works, or they will interfere with each other’s work.

I do strive to use Mac OS targets because those are easier to deal with than a simulator, especially when you use Bluetooth stuff and you get direct access to log files and SQLite files.

Solo devs have it way easier in this new world because there’s no strict rules to follow. Whatever goes, goes, I guess.


I found Codex got much better (and with some AGENTS.md context about it) at ignoring unrelated changes from other agents in the same repo. But making worktrees easier to spin up and integrate back in might be a better approach for you.

When the build fails (rather than functional failure), most of the time I like to give the failure to a brand new agent to fix rather than waste context on the original agent resolving it, now that they're good at picking up on those changes. Wastes less precious context on the main task, and makes it easier to not worry about which agent addresses which build failures.

And then for individual agents checking their own work, I rely on them inspecting test or simulator/app results. This works best if agents don't break tests outside the area they're working in. I try to avoid having parallel agents working on similar things in the same tree.

I agree on the Mac target ease. Especially also if you have web views.

Orgs need to adapt to this new world too. The old way of forcing devs generally to work on only one task at a time to completion doesn't make as much sense anymore even from the perspective of the strictest of lean principles. That'll be my challenge to figure out and help educate that transformation if I want to productize this.


How can I get in touch?

hn () manabi.io

I use the web ui, easy to parallelize stuff to 90% done. manually finish the last 10% and a quick test

For Xcode projects?

i workshop a detailed outline w it first, and once i'm happy w the plan/outline, i let it run while i go do something else

By my tests (https://github.com/7mind/jopa) Gemini 3 is somewhat better than Claude with Opus 4.5. Both obliterate Codex with 5.1

What's - roughly - your monthly spend when using ppt models? I only use fixed priced copilot, and my napkin maths says I'd be spending something crazy like $200/mo if I went ppt on the more expensive models.

They have subscriptions too (at least Claude and ChatGPT/Codex; I don't use Gemini much). It's far cheaper to use the subscriptions first and then switch to paying per token beyond that.

Something around 500 euros.

Codex is super cheap though even with the cheapest GPT subscription you get lots of tokens. I use 4.5 opus at work and codex at home tbh the differences are not that big if you know what you are doing.

NME = "not my experience" I presume.

JFC TLA OD...


I've been getting great results from Codex. Can be a bit slow, but gets there. Writes good Rust, powers through integration test generation.

So (again) we are just sharing anecdata


You're absolutely right!

Somehow it doesn't get on my nerves (unlike Gemini with "Of course").


Can you give some concrete example of programming problem task GPT fails to solve?

Interested, because I’ve been getting pretty good results with different tasks using the Codex.


Try to ask it to write some GLSL shaders. Just describe what you want to see and then try to run the shaders it outputs. It can output a UV-map or the simple gradient right, but when it comes to shaders a bit more complex it most of the time will not compile or run properly, sometimes mix GLSL versions, sometimes just straight make up things which don't work or output what you want.

Library/API conflicts are the biggest pain point for me usually. Especially breaking changes. RLlib (currently 2.41.0) and Gymnasium (currently 0.29.0+) have ended in circles many times for me because they tend to be out of sync (for multi-agent environments). My go to test now is a simple hello world type card game like war, competitive multi-agent with rllib and gymnasium (pettingzoo tends to cause even more issues).

Claude Sonnet 4.5 was able to figure out a way to resolve it eventually (around 7 fixes) and I let it create an rllib.md with all the fixes and pitfalls and am curious if feeding this file to the next experiment will lead to a one-shot. GPT-5 struggled more but haven't tried Codex on this yet so it's not exactly fair.

All done with Copilot in agent mode, just prompting, no specs or anything.


I posted this example before but academic papers on algorithms often have pseudo code but no actual code.

I thought it would be handy to use AI to make the code from the paper so a few months ago I tried to use Claude (not GPT, because I only have access to Claude) to recreate C++ code to implement the algorithms in this paper as practice for me in LLM use and it didn’t go well.

https://users.cs.duke.edu/~reif/paper/chen/graph/graph.pdf


I just tried it with GPT-5.1-Codex. The compression ratio is not amazing, so not sure if it really worked, but at least it ran without errors.

A few ideas how to make it work for you:

1. You gave a link to a PDF, but you did not describe how you provided the content of the PDF to the model. It might only have read the text with something like pdftotext, which for this PDF results in a garbled mess. It is safer to convert the pages to PNG (e.g. with pdftoppm) and let the model read it from the pages. A prompt like "Transcribe these pages as markdown." should be sufficient. If you can not see what the model did, there is a chance it made things up.

2. You used C++, but Python is much easier to write. You can tell the model to translate the code to C++ once it works in Python.

3. Tell the model to write unit tests to verify that the individual components work as intended.

4. Use Agent Mode and tell the model to print something and to judge whether the output is sensible, so it can debug the code.


Interesting. Thanks for the suggestions.

Completely failed for me running the code it changed in a docker container i keep running. Claude did it flawlessly. It absolutely rocks at code reviews but ir‘s terrible in comparison generating code

It really depends on what kind of code. I've found it incredible for frontend dev, and for scripts. It falls apart in more complex projects and monorepos

I find for difficult questions math and design questions GPT5 tends to produce better answers than Claude and Gemini.

Could you clarify what you mean by design questions? I do agree that GPT5 tends to have a better agentic dispatch style for math questions but I've found it has really struggled with data model design.

At this point you are now forced to use the "AI"s as code search tools--and it annoys me to no end.

The problem is that the "AI"s can cough up code examples based upon proprietary codebases that you, as an individual, have no access to. That creates a significant quality differential between coders who only use publicly available search (Google, Github, etc.) vs those who use "AI" systems.


How would the AIs have access to proprietary codebases?

Microsoft owns github

Same experience here. The more commonly known the stuff it regurgitates is, the fewer errors. But if you venture into RF electronics or embedded land, beware of it turning into a master of bs.

Which makes sense for something that isn’t AI but LLM.


OpenAI is in the "don't look behind the curtain" stage with both their technology and finances.

I recall reading that Google had similar 'delay' issues when crawling the web in 2000 and early 2001, but they managed to survive. That said, OpenAI seems much less differentiated (now) than Google was back then, so this may be a much riskier situation.

Google didn't raise at a $500 billion valuation.

The 25x revenue multiple wouldn't be so bad if they weren't burning so much cash on R&D and if they actually had a moat.

Google caught up quick, the Chinese are spinning up open source models left and right, and the world really just isn't ready to adopt AI everywhere yet. We're in the premature/awkward phase.

They're just too early, and the AGI is just too far away.

Doesn't look like their "advertising" idea to increase revenue is working, either.


There is no moat in selling/renting AI models. They are a commoditized product now. I can't imagine with what thought process did investors poured in such money on OpenAI.

Tulip mania is a mania because it short circuits thought.

The differentiation should be open source, nonprofit, and ethical.

As a shady for-profit, there is none. That's the problem with this particular fraud.


Why is profit bad? You can be open source, ethical, and for-profit.

If you start out as a non-profit, and pull a bunch of shady shenanigans in order to convert to a for-profit, claiming to be ethical after that is a bit of a hard sell.

Yes, the story was something like Google hadn’t rebuilt their index for something like 8 months if I recall correctly

I noticed this recently when I asked it whether I should play Indiana Jones on my PS5 or PC with a 9070 XT. It assumed I had made a typo until I clarified, then it went off to the internet and came back telling me what a sick rig I have.

OpenAI is the only SOTA model provider that doesn't have a cutoff date in the current year. That why it preforms bad at writing code for any new libraries or libraries that have had significant updates like Svelte.

State Of The Art is maybe a bit exaggerated. It's more like an early model that never really adapted, and only got watered down (smaller network, outdated information, and you cannot see thought/reasoning).

Also their models get dumber and dumber over time.


im not sure why we need to go off rumours, the knowledge cutoff for each openai model is clearly listed in the table:

https://platform.openai.com/docs/models/compare?model=gpt-5....


I asked ChatGPT 5.1 to help me solve a silly installation issue with the codex command line tool (I’m not an npm user and the recommended installation method is some kludge using npm), and ChatGPT told me, with a straight face, that codex was discontinued and that I must have meant the “openai” command.

"with a straight face"

Anthropomorphizing non-human things is only human.

Stop anthropomorphizing non-human things. They don't like it.


Don’t forget SemiAnalysis’s founder Dylan Patel is supposedly roommates with Anthropics RL tech lead Sholto..

The fundamental problem with bubbles like this, is that you get people like this who are able to take advantage of the The Gell-Mann amnesia effect, except the details that they’re wrong about are so niche that there’s a vanishingly small group of people who are qualified to call them out on it, and there’s simultaneously so much more attention on what they say because investors and speculators are so desperate and anxious for new information.

I followed him on Twitter. He said some very interesting things, I thought. Then he started talking about the niche of ML/AI I work near, and he was completely wrong about it. I became enlightened.


Funny, had it tell me the same thing twice yesterday and that was _with_ thinking + search enabled on the request (it apparently refused to carry out the search, which it does once in every blue moon).

I didn't make this connection that the training data is that old, but that would indeed augur poorly.


Just a minor correction, but I think it's important because some comments here seem to be giving bad information, but on OpenAI's model site it says that the knowledge cutoff for gpt-5 is Sept 30, 2024, https://platform.openai.com/docs/models/compare, which is later than the June 01, 2024 date of GPT-4.1.

Now I don't know if this means that OpenAI was able to add that 3 months of data to earlier models by tuning or if it was a "from scratch" pre-training run, but it has to be a substantial difference in the models.


What is a pre-training run?

Pre-training is just training, it got the name because most models have a post-training stage so to differentiate people call it pre-training.

Pre-training: You train on a vast amount of data, as varied and high quality as possible, this will determine the distribution the model can operate with, so LLMs are usually trained on a curated dataset of the whole internet, the output of the pre-training is usually called the base model.

Post-training: You narrow down the task by training on the specific model needs you want. You can do this through several ways:

- Supervised Finetuning (SFT): Training on a strict high quality dataset of the task you want. For example if you wanted a summarization model, you'd finetune the model on high quality text->summary pairs and the model would be able to summarize much better than the base model.

- Reinforcement Learning (RL): You train a separate model that ranks outputs, then use it to rate the output of the model, then use that data to train the model.

- Direct Preference Optimizaton (DPO): You have pairs of good/bad generations and use them to align the model towards/away the kinds of responses you want.

Post-training is what makes the models able to be easily used, the most common is instruction tuning that teaches to model to talk in turns, but post-training can be used for anything. E.g. if you want a translation model that always translates a certain way, or a model that knows how to use tools, etc. you'd achieve all that through post-training. Post-training is where most of the secret sauce in current models is nowadays.


Want to also add that the model doesn’t know how to respond in a user-> assistant style conversation after it’s pretraining, and it’s a pure text predictor (look at the open source base models)

There’s also what is being called mid-training where the model is trained on high(er) quality traces and acts as a bridge between pre and post training


just to go off of this there is also stochastic random overfit retraining process (SRORP). Idea behind SRORP is to avoid overfitting. SRORP will take data points from -any- aspect of the past process with replacment and create usually 3-9 bootstrap models randomly. The median is then taken from all model weights to wipe out outliers. This SRORP polishing -if done carefully- is usually good for a 3-4% gain in all benchmarks

If pre-training is just training, then how on earth can OpenAI not have "a successful pre-training run"? The word successful indicates that they tried, but failed.

It might be me misunderstanding how this works, but I assumed that the training phase was fairly reproducible. You might get different results on each run, do to changes in the input, but not massively so. If OpenAI can't continuously and reliably train new models, then they are even more overvalued that I previously assumed.


Because success for them doesn't mean it works, it means it works much better than what they currently have. If a 1% improvement comes at the cost of spending 10x more on training and 2x more on inference then you're failing at runs. (numbers out of ass)

That makes sense. It's not that the training didn't complete or returned a moronic model, but the capabilities have plateaued.

Maybe this has something to do with why they're declaring "code red".

- Reinforcement learning with verifiable rewards (RLVR): instead of using a grader model you use a domain that can be deterministically graded, such as math problems.

If you've an hour to spare this Karpathy video is good at explaining how it all works https://youtu.be/7xTGNNLPyMI

The first step in building a large language model. That's when the model is initiated and trained on a huge dataset to learn patterns and whatnot. The "P" in "GPT" stands for "pre-trained."

That’s where they take their big pile of data and train the model to do next-token-prediction.

Łukasz Kaiser basically confirmed it in a podcast:

https://youtu.be/3K-R4yVjJfU?si=JdVyYOlxUbEcvEEo&t=2624

> Q: Are the releases aligned with pre-training efforts?

> A: There used to be a time not that long ago, maybe half a year, distant past, where the models would align with RL runs or pretraining runs ... now the naming is by capability. GPT5 is a capable model; 5.1 is a more capable model


I wonder if the failures to pretrain are the result of our understanding of neural networks being more akin to alchemy rather than chemistry

I doubt it's that important that their dataset of current events is up to date. At this stage, I believe private and synthetic data comprises a large fraction of pretraining. Web search substitutes for current event pretraining.

I tried OpenAI models for coding in Go, but they constantly say your syntax is not correct. Let me rewrite your whole file without `any`.`any` was introduced in 2022. It takes some time to adopt it in codebases, but they should not be doing stuff like that at the end of 2025.

That would explain why it’s so bad with new Swift features and more recent ast-grep rules.

Usually current events get taught through mid-training, so even with old pre-training current events still could be added

It has no idea what it's own knowledge cutoff is.

Knowledge cutoff date is usually part of the system prompt.

Helps you get useful answers like "I don't know that's too recent" when you ask questions like "who won the basketball game last night".


I opened up the Gemini app and asked:

Who are the Dallas Stars playing tonight? Is there anyone that will be out of the lineup for the Stars? What have their recent injuries been like? Can you get the current score?

It's response:

The Dallas Stars are playing the New York Rangers tonight, Tuesday, December 2, 2025, at Madison Square Garden.

Here is the information available regarding the rest of your questions:

Out of the Lineup & Recent Injuries *Lian Bichsel (Defenseman): He suffered a lower-body injury (ankle) in the recent game against the Senators and is expected to miss 6-8 weeks and require surgery. His absence will be felt on the blueline.

*Jake Oettinger (Goaltender): He is dealing with an illness and is not expected to start tonight; Casey DeSmith is projected to start in his place. Oettinger is expected to play tomorrow night in New Jersey.

*Thomas Harley (Defenseman): He remains week-to-week with a lower-body injury and is not with the team on their current road trip.

Other injured players who will likely be out of the lineup include Matt Duchene (Upper Body - on IR-LT, but joining the road trip), Adam Erne (Lower Body - on IR-LT), and Nils Lundkvist (Lower Body - on IR-LT).

Current Score The game is in progress, and the score is currently New York Rangers 1 - 1 Dallas Stars.

Would you like an update on the final score later, or news on a specific player's status?

I ask it things like updates on teams all the time, and it usually pulls the right information.


The consumer apps use RAG and traditional search to give the LLM recent information in the prompt when it answers your query. This basically bridges over the knowledge gap between the end of training and today.

I'm fully aware, I'm just wanting to point out to people that the actual AI apps they'll use can and do return recent information due to integrations like that. Lots of people think AI can only answer stuff in its training set, but it can answer anything from whatever data you hand to it, including any data on the internet.

Lots of AI tools can easily answer "who won the basketball game last night".


GCP is sort of blending this into their Gemini APIs.

https://ai.google.dev/gemini-api/docs/google-search

You don’t have to do RAG or use a serp scraper MCP, just add the Search Grounding tool to the APzi request and it does the rest at the model’s discretion and $0.014 / search.

I think that’s generally a fair price for my time vs doing my own search queries at 1/100th the speed. It could get expensive for deep research type queries.


The pricing may be reasonable, but Grounding with Google Search has extremely restrictive terms of use.

https://ai.google.dev/gemini-api/terms#grounding-with-google...


yup, I looked at it GwGS and it's not interesting. I want actual results, not a summary and some links to sites (not pages), then to go crawl those pages

Don't want Google as the middleman between the internet and my agent. Their search has become so bad I don't even use it any more


Any data after that is contaminated with vast amount of AI slop. Is anyone training on anything newer..?

Both Claude 4.x and Gemini 3 go up to early 2025. Still not that recent, but not as bad as GPT.

Interesting enough today I found that https://huggingface.co/openai/gpt-oss-20b when downloaded an running locally and asked:

====

when have you been built?

I’m a GPT‑4‑based model that OpenAI released on March 14 2023. The underlying training data goes up to early 2023, and the model was trained in the months leading up to that release. If you’re asking about the ChatGPT product that ships the model to users, it went live in November 2022 and has since received updates (GPT‑3.5, GPT‑4, etc.) that keep it current.

====

But when supposedly running it from https://huggingface.co/chat/models/openai/gpt-oss-20b:

====

when have you been built?

I’m a language model created by OpenAI. The current generation (GPT‑4) that powers this chat was first released in March 2023 and has been updated and fine‑tuned up through the end of 2024. My training data runs up to the beginning of June 2025, so I’m built on knowledge available up to that point.

====

And that makes me thinking that although https://huggingface.co/chat claims to be using the models available to public at https://huggingface.co , it doesn't seems to be true and I raised this question here https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/discussions... , https://github.com/huggingface/inference-playground/issues/1... and https://github.com/ggml-org/llama.cpp/discussions/15396#disc... .




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: