Do you not run into too many false positives around "ah, this thing you used here is known to be tricky, the issue is..."
I've seen that when prompting it to look for concurrency issues vs saying something more like "please inspect this rigorously to look for potential issues..."
What's more useful is to have it attempt to not only find such bugs but prove them with a regression test. In Rust, for concurrency tests write e.g. Shuttle or Loom tests, etc.
It would be generally good if most code made setting up such tests as easy as possible, but in most corporate codebases this second step is gonna require a huge amount of refactoring or boilerplate crap to get the things interacting in the test env in an accurate, well-controlled way. You can quickly end up fighting to understand "is the bug not actually there, or is the attempt to repro it not working correctly?"
(Which isn't to say don't do it: I think this is a huge benefit you can gain from being able to refactor more quickly. Just to say that you're gonna short-term give yourself a lot more homework to make sure you don't fix things that aren't bugs, or break other things in your quest to make them more provable/testable.)
yes but i can identify those easily. i know that if it flags something that is obviously a non issue, i can discard it.
...because false positives are good errors. false negatives is what i'm worried about.
i feel massively more sure that something has no big oversights if multiple runs (or even multiple different models) cannot find anything but false positives
Just in case you didn't read the full article, this is how they describe finding the bugs in the Linux kernel as well.
Since it's a large codebase, they go even more specific and hint that the bug is in file A, then try again with a hint that the bug is in file B, and so on.
very interesting. i think "verbal biasing" and "knowing how to speak" in general is a really important thing with LLMs. it seems to massively affect output. (interestingly, somewhat less with Opus than with GPT-5.4 and Composer 2. Opus seems to intuit a little better. but still important.)
it's like the idea behind the book _The Mom Test_ suddenly got very important for programming
As a meta activity, I like to run different codebases through the same bug-hunt prompt and compare the number found as a barometer of quality.
I was very impressed when the top three AIs all failed to find anything other than minor stylistic nitpicks in a huge blob of what to me looked like “spaghetti code” in LLVM.
Meanwhile at $dayjob the AI reviews all start with “This looks like someone’s failed attempt at…”
I've been pretty satisfied using oh-my-openagent (omo) on opencode with both opus-4.6 and gpt-5.4 lately.
The author of omo suggests different prompting strategies for different models and goes into some detail here.
https://github.com/code-yeongyu/oh-my-openagent/blob/dev/doc...
For each agent they define, they change the prompt depending on which model is being used to fit it.
I wonder how much of the "x did worse than y for the same prompt" tests could be improved if the prompts were actually tailored to what the model is good at.
I also wonder if any of this matters or if it's all a crock of bologna..
i think it may matter a good bit. i definitely have to write in different styles with different models (and catch myself doing so unintentionally) now that you mention it...
Fwiw I run this eval every week on a set of known prompts and I believe the in group differences are bigger than out group.
That is I get more variance between opus 4.6 and itself than I do between the sota models.
I don’t have the budget for statistical relevance but I’m convinced people claiming broad differences are just vibing, or there are times when agent features make a big difference.
Not a real solution but you could try using AquaVoice for dictation. It can gather screen context so you just say the function name out loud and it capitalizes and spells everything correctly. (Even hard cases!)
Try Mercury by Inception. It's available as autocomplete in Zed. Last time I tried it, Zed had an API key hidden in their docs that allowed you to use it for free
The crazy thing is that it's a diffusion-based LLM. That makes it very fast, like Cursor Tab, and the outputs seem very accurate in my limited testing (although I find Cursor Tab to still feel "like 10% better")
---
That said, you should really give agentic coding a la Claude Code a try. It's gotten incredibly good. I still need to check the outputs of course, but after using it for 2-3 days, I've learned to "think" about how to tackle a problem with it similarly like I had to learn when first picking up programming.
Once I did, suddenly it didn't feel risky and weird anymore, because it's doing what I would've done manually anyways. Step by step. It might not be as blackboxy as you think it is
why should only profits matter? if i had a killer product today that i just need to sell tomorrow, wouldn't you still invest today knowing i'll probably only start to make money tomorrow (or perhaps next week)?
the expectation is that they'll eventually make money. they can't raise forever. only startups are not profitable for a few years. but most companies that have existed for a long while have been profitable
and since they're expected to make a LOT of money, everyone wants a piece of that future pie, pushing up the valuation and amount raised to admittedly somewhat delusional levels like here
It's well know everyone is making great money on inference. The cost is training.
Whether GPT-5 was profitable to run depends on which profit margin you’re talking about. If we subtract the cost of compute from revenue to calculate the gross margin (on an accounting basis),2 it seems to be about 30% — lower than the norm for software companies (where 60-80% is typical) but still higher than many industries.
(They go on to point out that there are other costs that might mean they didn't break even on other costs - although I suspect these costs should be partially amortized over the whole GPT 5.x series, not just 5.0)
"Most of what we're building out at this point is the inference [...] We're profitable on inference. If we didn't pay for training, we'd be a very profitable company"
"There’s a bright spot, however. OpenAI has gotten more efficient at serving paying users: Its compute margin—the revenue left after subtracting the cost of running AI models for those customers—was roughly 70% in October, an increase from about 52% at the end of last year and roughly 35% in January 2024."
> It's well know everyone is making great money on inference.
That is not, in fact, "well known", but based entirely on the announcements of the inference providers themselves who also get very cagey when asked to show their work and at least look like they're soliciting a constant firehose of investment money simply to keep the lights on. In particular there's a troubling tendency to call revenue "recurring" before it actually, you know, recurs.
> based entirely on the announcements of the inference providers themselves who also get very cagey when asked to show their work
I mean sure, it's self reported.
But the inference prices somewhere like Fireworks or TogetherAI charges is comparable to what Google/AWS/Azure charge for the same model an we know they aren't losing money - they have public accounts that show it, eg:
> If someone has a subscription then yes that is pretty normal.
Not if you've substantively changed rate limits 3 times in the last 5 months while still counting those forecast revenues. In most industries that's called rug-pulling.
It doesn’t matter how you call it. A recurring subscription on the books is a recurring subscription. Yes you can cancel anytime (how generous of them), it also doesn’t matter.
And why do you think twenty competitors can stay competitive for years to come?
Industries always consolidate and winners emerge. SOTA LLMs look like a natural monopoly or duopoly to me because the cost to train the next model keeps going up such that it won't make sense for 20 competitors to compete at the very high end.
TSMC is a perfect example of this. Fab costs double every 4 years (Rock’s Law). It's almost impossible to compete against TSMC because no one has the customer base to generate enough revenue to build the next generation of fabs - except those who are propped up by governments such as Intel and Rapidus. Samsung is basically the SK government.
I don’t see how companies can catch OpenAI or Anthropic without the strong revenue growth.
Google has already surpassed them both in all areas except coding. People on HN only look at benchmarks, but Gemini's multimodal understanding, things like identifying what a plant is, normal user use cases (other than chatting), integration with other tools, is much better.
It's believable that Meta, ByteDance, etc. can catch up too. It is not certain that scaling will meaningfully increase performance indefinitely, and if it stops soon, they surely will. Furthermore, other market conditions (US political instability) can enable even more labs, like Mistral, to serve as compelling alternatives.
Uber, TSMC, etc. have strong moats in the form of physical goods and factories. LLMs have nothing even remotely comparable. The main moat is in knowledge, which is easy to transfer between labs. Do you think all the money that goes into training a model goes into the actual final training run? No, it is mostly experiments and failed ideas, which do not have to be repeated by future labs and offshoots.
>Industries always consolidate and winners emerge.
no, most industries just sell boring generic products, a few industries favor monopolists. Semiconductors are one of them but LLMs are also as far removed from that business as is physically possible.
TSMC makes the most complicated machines humans have ever built, a LLM requires a few dozen nerds, a power plant, a few thousand lines of python and chips. That's why if you're Elon Musk you could buy all of the above and train yourself an LLM in a month.
LLMs are comically simple pieces of software, they're just big. But anyone with a billion dollars can have one, they're all going to be commoditized and free in due time, like search. Copying a lithography machine is difficult, copying software is easy. that's why Google burrowed itself into email, and browsers, and your phone's OS. Problem for openai is they don't have any of that, there's already half a dozen companies that, for 99% of people, do what they do.
The barrier to replicating TSMC isn't just cost, it's supply chain, geopolitics, and talent.
Only one company on Earth can make the UV lithography machines TSMC buys for their highest end fabs, and they're not selling to anyone else.
The PRC tried to brute force this supply chain backed by the full might of the Party's blank check, all red tape cut, literally the best possible duplication scenario, and they failed.
They will succeed eventually since they have proof it’s possible and their plans span decades. I expect them to have working EUV in 10 years. Whether it’ll still be bleeding edge tech is a different question I dare not guess the answer to.
I would not be surprised at all if it's vibe coded. I have seen exactly the same thing myself.
I gave instruction to Claude to add a toggle button to a website where the value needs to be stored in local storage.
It is a very straightforward change. Just follow exactly how it is done for a different boolean setting and you are set. An intern can do that on the first day of their job.
Everything is done properly except that on page load, the stored setting is not read.
Which can be easily discovered if the author, with or without AI tools, has a test or manually goes through the entire workflow just once. I discovered the problem myself and fixed it.
Setting all of that aside -- even if this is not AI coded, at the least it shows the site owner doesn't have the basic care for its visitors to go through this important workflow to check if everything works properly.
And who cares if it's vibe-coded or not. Since when do we care more on the how than on the what? Are people looking at how a tool was coded before using it, as if it would accelerate confidence?
if they really want me to use this lang for everything, they'd have to 1. massively improve compilation speed, 2. get the ecosystem going (what's the correct way to spin up an http server like with express?) and 3. get rid of roughly 150 of the 200 keywords there are
especially w.r.t. the last one, of course everyone frets at huge breaking changes like this, so it won't happen, so people won't use it
> 3. get rid of roughly 150 of the 200 keywords there are
I don't understand this point. Could you explain?
The new keywords enable new language features (ex: async/await, any, actor), and these features are opt-in. If you don't want to use them, you don't have to.
What are they keywords you think should be removed?
> these features are opt-in. If you don't want to use them, you don't have to.
Using a language is more than just writing it with a pre-established knowledge of what subset of features you think is worth the tradeoffs. More keywords/features means when you try to figure out how to do something new, there may be 15 different ways and you need to analyze and figure out which is the best one for this scenario, which ones are nonstarters, etc.
That's was more or less the whole design goal of Go. It was made by C++ programmers who were fed up with how many features were in the language, so they kept the feature set limited. Even the formatting is decided by the language. You may not agree with every decision, but what matters is decisions were made and they're standardized, so everyone is on the same page. You can read anyone else's code, and you know exactly what's going on.
besides it being almost impossible to understand what "the right way of doing stuff" is with Swift (or any bloated language), i absolutely _do_ have to use the keywords.
reading someone else's code is part of working with the language (as is understanding LLM output nowadays). i can't just make others not use the keywords i don't know/beed/like. especially if working within teams, or using OSS.
Focusing on the keywords rather than the macros, I think the rest of them have legitimate use cases, though they're often misused, especially fileprivate.
this is gonna sound ranty, but it's straight from the heart:
i think most of them are pointless. not every feature needs to be a new keyword. stuff could be expressed within the language. if the language is so inflexible in that regard that it's impossible to express stuff without a keyword, use macros for gods sake.
why is there a need to have a "convenience func" declaration?
why is "didSet" a keyword?
what about "actor"? most other languages don't have nearly as many keywords and manage to express the idea of actors just fine!
reply