More

merlindru · 2026-04-06T21:46:23 1775511983

Yes but only for the message that includes it. Whereas /effort max keeps it at max effort the entire convo, to my knowledge

merlindru · 2026-04-04T18:48:04 1775328484

I like biasing it towards the fact that there is a bug, so it can't just say "no bugs! all good!" without looking into it very hard.

Usually I ask something like this:

"This code has a bug. Can you find it?"

Sometimes I also tell it that "the bug is non-obvious"

Which I've anecdotally found to have a higher rate of success than just asking for a spot check

majormajor · 2026-04-04T23:33:51 1775345631

Do you not run into too many false positives around "ah, this thing you used here is known to be tricky, the issue is..."

I've seen that when prompting it to look for concurrency issues vs saying something more like "please inspect this rigorously to look for potential issues..."

cmrdporcupine · 2026-04-04T23:59:55 1775347195

What's more useful is to have it attempt to not only find such bugs but prove them with a regression test. In Rust, for concurrency tests write e.g. Shuttle or Loom tests, etc.

majormajor · 2026-04-05T00:08:32 1775347712

It would be generally good if most code made setting up such tests as easy as possible, but in most corporate codebases this second step is gonna require a huge amount of refactoring or boilerplate crap to get the things interacting in the test env in an accurate, well-controlled way. You can quickly end up fighting to understand "is the bug not actually there, or is the attempt to repro it not working correctly?"

(Which isn't to say don't do it: I think this is a huge benefit you can gain from being able to refactor more quickly. Just to say that you're gonna short-term give yourself a lot more homework to make sure you don't fix things that aren't bugs, or break other things in your quest to make them more provable/testable.)

simulator5g · 2026-04-05T01:45:25 1775353525

That is an unfortunate case you described, but also, git gud and write tests in the first place so you don't need to refactor things down the road.

merlindru · 2026-04-06T14:47:57 1775486877

yes but i can identify those easily. i know that if it flags something that is obviously a non issue, i can discard it.

...because false positives are good errors. false negatives is what i'm worried about.

i feel massively more sure that something has no big oversights if multiple runs (or even multiple different models) cannot find anything but false positives

Nition · 2026-04-04T21:01:44 1775336504

Just in case you didn't read the full article, this is how they describe finding the bugs in the Linux kernel as well.

Since it's a large codebase, they go even more specific and hint that the bug is in file A, then try again with a hint that the bug is in file B, and so on.

merlindru · 2026-04-06T14:51:08 1775487068

very interesting. i think "verbal biasing" and "knowing how to speak" in general is a really important thing with LLMs. it seems to massively affect output. (interestingly, somewhat less with Opus than with GPT-5.4 and Composer 2. Opus seems to intuit a little better. but still important.)

it's like the idea behind the book _The Mom Test_ suddenly got very important for programming

jiggawatts · 2026-04-05T00:16:45 1775348205

As a meta activity, I like to run different codebases through the same bug-hunt prompt and compare the number found as a barometer of quality.

I was very impressed when the top three AIs all failed to find anything other than minor stylistic nitpicks in a huge blob of what to me looked like “spaghetti code” in LLVM.

Meanwhile at $dayjob the AI reviews all start with “This looks like someone’s failed attempt at…”

kgwxd · 2026-04-05T05:33:29 1775367209

> so it can't just say "no bugs! all good!"

If anyone, or anything, ever answers a question like that, you should stop asking it questions.

merlindru · 2026-04-04T08:52:17 1775292737

i tried out gpt 5.4 xhigh and it did meaningfully worse with the same prompt as opus 4.6. like, obvious mistakes

josh_p · 2026-04-05T01:09:48 1775351388

I've been pretty satisfied using oh-my-openagent (omo) on opencode with both opus-4.6 and gpt-5.4 lately. The author of omo suggests different prompting strategies for different models and goes into some detail here. https://github.com/code-yeongyu/oh-my-openagent/blob/dev/doc... For each agent they define, they change the prompt depending on which model is being used to fit it. I wonder how much of the "x did worse than y for the same prompt" tests could be improved if the prompts were actually tailored to what the model is good at. I also wonder if any of this matters or if it's all a crock of bologna..

merlindru · 2026-04-06T14:52:05 1775487125

i think it may matter a good bit. i definitely have to write in different styles with different models (and catch myself doing so unintentionally) now that you mention it...

definitely not bologna, at least anecdotally :)

kasey_junk · 2026-04-04T13:50:59 1775310659

Fwiw I run this eval every week on a set of known prompts and I believe the in group differences are bigger than out group.

That is I get more variance between opus 4.6 and itself than I do between the sota models.

I don’t have the budget for statistical relevance but I’m convinced people claiming broad differences are just vibing, or there are times when agent features make a big difference.

merlindru · 2026-04-06T14:54:05 1775487245

it may be the agent features in my case. now that i think about it, i also forgot that my CLAUDE.md is different from my AGENTS.md

either way, all that one can really rely on is the benchmarks, and those are easily cheated/overfitted to.

i think it's all very hard to quantify, so take my previous comment with a massive rock of salt

merlindru · 2026-04-04T08:49:50 1775292590

I still haven't got this email, anyone else?

merlindru · 2026-04-03T10:51:43 1775213503

Not a real solution but you could try using AquaVoice for dictation. It can gather screen context so you just say the function name out loud and it capitalizes and spells everything correctly. (Even hard cases!)

merlindru · 2026-04-03T10:46:15 1775213175

Try Mercury by Inception. It's available as autocomplete in Zed. Last time I tried it, Zed had an API key hidden in their docs that allowed you to use it for free

The crazy thing is that it's a diffusion-based LLM. That makes it very fast, like Cursor Tab, and the outputs seem very accurate in my limited testing (although I find Cursor Tab to still feel "like 10% better")

---

That said, you should really give agentic coding a la Claude Code a try. It's gotten incredibly good. I still need to check the outputs of course, but after using it for 2-3 days, I've learned to "think" about how to tackle a problem with it similarly like I had to learn when first picking up programming.

Once I did, suddenly it didn't feel risky and weird anymore, because it's doing what I would've done manually anyways. Step by step. It might not be as blackboxy as you think it is

merlindru · 2026-03-31T21:41:25 1774993285

why should only profits matter? if i had a killer product today that i just need to sell tomorrow, wouldn't you still invest today knowing i'll probably only start to make money tomorrow (or perhaps next week)?

the expectation is that they'll eventually make money. they can't raise forever. only startups are not profitable for a few years. but most companies that have existed for a long while have been profitable

and since they're expected to make a LOT of money, everyone wants a piece of that future pie, pushing up the valuation and amount raised to admittedly somewhat delusional levels like here

bandrami · 2026-03-31T22:11:39 1774995099

> why should only profits matter?

In this case because it's not clear that anybody has actually figured out how to sell inference for more than it costs

nl · 2026-03-31T23:14:13 1774998853

It's well know everyone is making great money on inference. The cost is training.

Whether GPT-5 was profitable to run depends on which profit margin you’re talking about. If we subtract the cost of compute from revenue to calculate the gross margin (on an accounting basis),2 it seems to be about 30% — lower than the norm for software companies (where 60-80% is typical) but still higher than many industries.

(They go on to point out that there are other costs that might mean they didn't break even on other costs - although I suspect these costs should be partially amortized over the whole GPT 5.x series, not just 5.0)

https://epochai.substack.com/p/can-ai-companies-become-profi...

https://martinalderson.com/posts/are-openai-and-anthropic-re... (with math working backwards from GPU capacity)

"Most of what we're building out at this point is the inference [...] We're profitable on inference. If we didn't pay for training, we'd be a very profitable company"

https://simonwillison.net/2025/Aug/17/sam-altman/

"There’s a bright spot, however. OpenAI has gotten more efficient at serving paying users: Its compute margin—the revenue left after subtracting the cost of running AI models for those customers—was roughly 70% in October, an increase from about 52% at the end of last year and roughly 35% in January 2024."

https://archive.is/OqIny#selection-1279.0-1279.305 (Note this is after having to pay higher spot rates for compute because of higher than expected demand)

bandrami · 2026-03-31T23:47:49 1775000869

> It's well know everyone is making great money on inference.

That is not, in fact, "well known", but based entirely on the announcements of the inference providers themselves who also get very cagey when asked to show their work and at least look like they're soliciting a constant firehose of investment money simply to keep the lights on. In particular there's a troubling tendency to call revenue "recurring" before it actually, you know, recurs.

nl · 2026-04-01T03:30:38 1775014238

> based entirely on the announcements of the inference providers themselves who also get very cagey when asked to show their work

I mean sure, it's self reported.

But the inference prices somewhere like Fireworks or TogetherAI charges is comparable to what Google/AWS/Azure charge for the same model an we know they aren't losing money - they have public accounts that show it, eg:

https://au.finance.yahoo.com/news/wall-street-resets-amazon-...

Fireworks’ gross margin—gross profit as a percentage of revenue—is roughly 50%, according to the same person

https://archive.is/Y26lA#selection-1249.65-1249.173

> In particular there's a troubling tendency to call revenue "recurring" before it actually, you know, recurs.

If someone has a subscription then yes that is pretty normal.

bandrami · 2026-04-01T04:41:42 1775018502

> If someone has a subscription then yes that is pretty normal.

Not if you've substantively changed rate limits 3 times in the last 5 months while still counting those forecast revenues. In most industries that's called rug-pulling.

baq · 2026-04-01T06:03:48 1775023428

It doesn’t matter how you call it. A recurring subscription on the books is a recurring subscription. Yes you can cancel anytime (how generous of them), it also doesn’t matter.

Barrin92 · 2026-03-31T21:43:31 1774993411

not if your product is selling two dollars for one dollar and as soon as you'll start to charge more I'll switch to one of your twenty competitors

profit isn't a function of having a killer product, it's a function of having no competition

aurareturn · 2026-03-31T21:53:27 1774994007

And why do you think twenty competitors can stay competitive for years to come?

Industries always consolidate and winners emerge. SOTA LLMs look like a natural monopoly or duopoly to me because the cost to train the next model keeps going up such that it won't make sense for 20 competitors to compete at the very high end.

TSMC is a perfect example of this. Fab costs double every 4 years (Rock’s Law). It's almost impossible to compete against TSMC because no one has the customer base to generate enough revenue to build the next generation of fabs - except those who are propped up by governments such as Intel and Rapidus. Samsung is basically the SK government.

I don’t see how companies can catch OpenAI or Anthropic without the strong revenue growth.

harmonic18374 · 2026-04-01T04:38:45 1775018325

Google has already surpassed them both in all areas except coding. People on HN only look at benchmarks, but Gemini's multimodal understanding, things like identifying what a plant is, normal user use cases (other than chatting), integration with other tools, is much better.

It's believable that Meta, ByteDance, etc. can catch up too. It is not certain that scaling will meaningfully increase performance indefinitely, and if it stops soon, they surely will. Furthermore, other market conditions (US political instability) can enable even more labs, like Mistral, to serve as compelling alternatives.

Uber, TSMC, etc. have strong moats in the form of physical goods and factories. LLMs have nothing even remotely comparable. The main moat is in knowledge, which is easy to transfer between labs. Do you think all the money that goes into training a model goes into the actual final training run? No, it is mostly experiments and failed ideas, which do not have to be repeated by future labs and offshoots.

otabdeveloper4 · 2026-04-01T07:56:38 1775030198

> It is not certain that scaling will meaningfully increase performance indefinitely

It's certain that it won't. We've already hit diminishing returns.

outside1234 · 2026-03-31T22:52:05 1774997525

Google has completely caught OpenAI. Anthropic has a better coding model, but I'm sure Google is working on that too.

baq · 2026-04-01T06:05:49 1775023549

> Anthropic has a better coding model

I’ll be polite and call this statement ‘a very debatable’ one.

Barrin92 · 2026-04-01T00:15:03 1775002503

>Industries always consolidate and winners emerge.

no, most industries just sell boring generic products, a few industries favor monopolists. Semiconductors are one of them but LLMs are also as far removed from that business as is physically possible.

TSMC makes the most complicated machines humans have ever built, a LLM requires a few dozen nerds, a power plant, a few thousand lines of python and chips. That's why if you're Elon Musk you could buy all of the above and train yourself an LLM in a month.

LLMs are comically simple pieces of software, they're just big. But anyone with a billion dollars can have one, they're all going to be commoditized and free in due time, like search. Copying a lithography machine is difficult, copying software is easy. that's why Google burrowed itself into email, and browsers, and your phone's OS. Problem for openai is they don't have any of that, there's already half a dozen companies that, for 99% of people, do what they do.

komali2 · 2026-04-01T01:41:49 1775007709

The barrier to replicating TSMC isn't just cost, it's supply chain, geopolitics, and talent.

Only one company on Earth can make the UV lithography machines TSMC buys for their highest end fabs, and they're not selling to anyone else.

The PRC tried to brute force this supply chain backed by the full might of the Party's blank check, all red tape cut, literally the best possible duplication scenario, and they failed.

purpleidea · 2026-04-01T05:30:00 1775021400

The PRC didn't fail, they haven't finished succeeding yet.

baq · 2026-04-01T06:08:19 1775023699

They will succeed eventually since they have proof it’s possible and their plans span decades. I expect them to have working EUV in 10 years. Whether it’ll still be bleeding edge tech is a different question I dare not guess the answer to.

ds2df · 2026-03-31T21:46:11 1774993571

no competition is a bit extreme. Limited competition yes due to competitive advantages.

merlindru · 2026-03-27T11:12:54 1774609974

this is a bad faith take. i think the website is really cool and doesn't reek of slop at all. what makes you think differently?

g947o · 2026-03-27T12:35:11 1774614911

I would not be surprised at all if it's vibe coded. I have seen exactly the same thing myself.

I gave instruction to Claude to add a toggle button to a website where the value needs to be stored in local storage.

It is a very straightforward change. Just follow exactly how it is done for a different boolean setting and you are set. An intern can do that on the first day of their job.

Everything is done properly except that on page load, the stored setting is not read.

Which can be easily discovered if the author, with or without AI tools, has a test or manually goes through the entire workflow just once. I discovered the problem myself and fixed it.

Setting all of that aside -- even if this is not AI coded, at the least it shows the site owner doesn't have the basic care for its visitors to go through this important workflow to check if everything works properly.

xeyownt · 2026-03-27T12:21:55 1774614115

Same.

And who cares if it's vibe-coded or not. Since when do we care more on the how than on the what? Are people looking at how a tool was coded before using it, as if it would accelerate confidence?

1123581321 · 2026-03-27T12:53:40 1774616020

It’s a heuristic to approach a program a bit warily as the length of the documentation likely outpaces how thoroughly it was designed and tested.

merlindru · 2026-03-26T13:01:26 1774530086

they seem to be adding more and more keywords

if they really want me to use this lang for everything, they'd have to 1. massively improve compilation speed, 2. get the ecosystem going (what's the correct way to spin up an http server like with express?) and 3. get rid of roughly 150 of the 200 keywords there are

especially w.r.t. the last one, of course everyone frets at huge breaking changes like this, so it won't happen, so people won't use it

zffr · 2026-03-26T15:32:04 1774539124

> 3. get rid of roughly 150 of the 200 keywords there are

I don't understand this point. Could you explain?

The new keywords enable new language features (ex: async/await, any, actor), and these features are opt-in. If you don't want to use them, you don't have to.

What are they keywords you think should be removed?

hbn · 2026-03-26T18:18:05 1774549085

> these features are opt-in. If you don't want to use them, you don't have to.

Using a language is more than just writing it with a pre-established knowledge of what subset of features you think is worth the tradeoffs. More keywords/features means when you try to figure out how to do something new, there may be 15 different ways and you need to analyze and figure out which is the best one for this scenario, which ones are nonstarters, etc.

That's was more or less the whole design goal of Go. It was made by C++ programmers who were fed up with how many features were in the language, so they kept the feature set limited. Even the formatting is decided by the language. You may not agree with every decision, but what matters is decisions were made and they're standardized, so everyone is on the same page. You can read anyone else's code, and you know exactly what's going on.

merlindru · 2026-03-27T11:15:08 1774610108

besides it being almost impossible to understand what "the right way of doing stuff" is with Swift (or any bloated language), i absolutely _do_ have to use the keywords.

reading someone else's code is part of working with the language (as is understanding LLM output nowadays). i can't just make others not use the keywords i don't know/beed/like. especially if working within teams, or using OSS.

zffr · 2026-03-27T17:12:57 1774631577

Fair point, I had not considered needing to read and understand code you didn't write yourself.

Especially in a corporate setting, not understanding a keyword you see in a PR could lead to bad code being checked in.

merlindru · 2026-03-26T12:57:22 1774529842

i would get rid of associatedtype, borrowing, consuming, deinit, extension, fileprivate, init, inout, internal, nonisolated, open, operator, precedencegroup, protocol, rethrows, subscript, typealias, #available, #colorLiteral, #else, #elseif, #endif, #fileLiteral, #if, #imageLiteral, #keyPath, #selector, #sourceLocation, #unavailable, associativity, convenience, didSet, dynamic, indirect, infix, lazy, left, mutating, nonmutating, postfix, precedence, prefix, right, unowned, weak, and willSet

willtemperley · 2026-03-26T14:03:13 1774533793

It's true that internal is pointless.

Focusing on the keywords rather than the macros, I think the rest of them have legitimate use cases, though they're often misused, especially fileprivate.

merlindru · 2026-03-27T11:17:05 1774610225

this is gonna sound ranty, but it's straight from the heart:

i think most of them are pointless. not every feature needs to be a new keyword. stuff could be expressed within the language. if the language is so inflexible in that regard that it's impossible to express stuff without a keyword, use macros for gods sake.

why is there a need to have a "convenience func" declaration?

why is "didSet" a keyword?

what about "actor"? most other languages don't have nearly as many keywords and manage to express the idea of actors just fine!