More

NitpickLawyer · 2026-04-16T14:14:26 1776348866

I agree with the sentiment, but these models aren't suited for that. You can run much bigger models on prem with ~100k of hardware, and those can actually be useful in real-world tasks. These small models are fun to play with, but are nowhere close to solving the needs of a dev shop working in healthcare or banking, sadly.

NitpickLawyer · 2026-04-16T14:08:43 1776348523

> Close enough

No. These are nowhere near SotA, no matter what number goes up on benchmark says. They are amazing for what they are (runnable on regular PCs), and you can find usecases for them (where privacy >> speed / accuracy) where they perform "good enough", but they are not magic. They have limitations, and you need to adapt your workflows to handle them.

julianlam · 2026-04-16T14:12:48 1776348768

Can you share more about what adaptations you made when using smaller models?

I'm just starting my exploration of these small models for coding on my 16GB machine (yeah, puny...) and am running into issues where the solution may very well be to reduce the scope of the problem set so the smaller model can handle it.

ukuina · 2026-04-16T14:45:08 1776350708

You'd do most of the planning/cognition yourself, down to the module/method signature level, and then have it loop through the plan to "fill in the code". Need a strong testing harness to loop effectively.

adrian_b · 2026-04-16T14:42:37 1776350557

It is very unlikely that general claims about a model are useful, but only very specific claims, which indicate the exact number of parameters and quantization methods that are used by the compared models.

If you perform the inference locally, there is a huge space of compromise between the inference speed and the quality of the results.

Most open weights models are available in a variety of sizes. Thus you can choose anywhere from very small models with a little more than 1B parameters to very big models with over 750B parameters.

For a given model, you can choose to evaluate it in its native number size, which is normally BF16, or in a great variety of smaller quantized number sizes, in order to fit the model in less memory or just to reduce the time for accessing the memory.

Therefore, if you choose big models without quantization, you may obtain results very close to SOTA proprietary models.

If you choose models so small and so quantized as to run in the memory of a consumer GPU, then it is normal to get results much worse than with a SOTA model that is run on datacenter hardware.

Choosing to run models that do not fit inside the GPU memory reduces the inference speed a lot, and choosing models that do not fit even inside the CPU memory reduces the inference speed even more.

Nevertheless, slow inference that produces better results may reduce the overall time for completing a project, so one should do a lot of experiments to determine an appropriate compromise.

When you use your own hardware, you do not have to worry about token cost or subscription limits, which may change the optimal strategy for using a coding assistant. Moreover, it is likely that in many cases it may be worthwhile to use multiple open-weights models for the same task, in order to choose the best solution.

For example, when comparing older open-weights models with Mythos, by using appropriate prompts all the bugs that could be found by Mythos could also be found by old models, but the difference was that Mythos found all the bugs alone, while with the free models you had to run several of them in order to find all bugs, because all models had different strengths and weaknesses.

(In other HN threads there have been some bogus claims that Mythos was somehow much smarter, but that does not appear to be true, because the other company has provided the precise prompts used for finding the bugs, and it would not hove been too difficult to generate them automatically by a harness, while Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it. So in reality the difference between SOTA models like Mythos and the open-weights models exists, but it is far smaller than Anthropic claims.)

aesthesia · 2026-04-16T15:32:57 1776353577

> Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it.

Unless there's been more information since their original post (https://red.anthropic.com/2026/mythos-preview/), this is a misleading description of the scaffold. The process was:

- provide a container with running software and its source code

- prompt Mythos to prioritize source files based on the likelihood they contain vulnerabilities

- use this prioritization to prompt parallel agents to look for and verify vulnerabilities, focusing on but not limited to a single seed file

- as a final validation step, have another instance evaluate the validity and interestingness of the resulting bug reports

This amounts to at most three invocations of the model for each file, once for prioritization, once for the main vulnerability run, and once for the final check. The prompts only became more specific as a result of information the model itself produced, not any external process injecting additional information.

NitpickLawyer · 2026-04-15T15:55:15 1776268515

> If a robot unloading your dishwasher breaks one of your dishes once, this is a massive failure.

That's a bit exaggerated, no? Early roombas would get tangled in socks, drag pet poop all over the floor, break glass stuff and so on, and yet the market accepted that, evolved, and now we have plenty of cleaning robots from various companies, including cheap spying ones from china.

I actually think that there's a lot of value in being the first to deploy bots into homes, even if they aren't perfect. The amount of data you'd collect is invaluable, and by the looks of it, can't be synth generated in a lab.

I think the "safer" option is still the "bring them to factories first, offices next and homes last", but anyway I'm sure someone will jump straight to home deployments.

NitpickLawyer · 2026-04-15T12:58:59 1776257939

> to fail / go bankrupt / flounder

This is exactly what "the Internet" said about spacex when they announced Starlink. Oh, it never worked. LEO constellations were tried in the 90s, ALL of them failed. Haha, it will never work. 14k satellites, that's insane, dreams, lies, hahaha.

... and yet, they are now at ~10k satellites launched, and are serving 9+mil customers, for some unknown billions/year in revenue (should become clear in a few months when they IPO).

mandeepj · 2026-04-15T14:55:45 1776264945

> for some unknown billions/year in revenue

Read here on HN yday: they’ve $20B in revenue, but xAI is a drag.

NitpickLawyer · 2026-04-15T05:46:33 1776231993

Some explorations with an AI overlord also in LeGuin's "The Dispossessed"

WaxProlix · 2026-04-15T17:40:10 1776274810

Hard to talk about favorites in books, but there was a solid decade of my life where I'd have probably said this was my favorite sci fi book. Highly recommend to anyone reading this.

NitpickLawyer · 2026-04-15T03:44:52 1776224692

The move towards "trusted partners" also acts as a way to protect from distillation.

NitpickLawyer · 2026-04-13T12:56:44 1776085004

> can squeeze more performance out of a model with rather humble resources vs a frontier lab.

That's the idea behind distillation. They are finetuning it on traces produced by opus. This is poor man's distillation (and the least efficient) and it still works unreasonably well for what it costs.

NitpickLawyer · 2026-04-12T17:06:46 1776013606

> what we're currently capable of.

What we're capable of != what we're doing / not doing because of political will. We are technically capable of reaching significant fractions of c with tech from the 1960s. We'll never do that because there's no will to do it, but the tech is there.

Same for self replicating stuff. We could build self-contained factories that build stuff from raw minerals, but we'll likely not do it until there's a will for it. Or need for it.

TheOtherHobbes · 2026-04-12T19:03:39 1776020619

Political will is part of the Fermi paradox. So are technical reliability and cultural stability.

The idea that you can just build a thing and send out a swarm and (slow) boom - you've colonised the galaxy, and all the adjacent galaxies - is hopelessly naive. To the point I'd call it stupid and silly.

Let's say you have a replicator thing that works. You send them out in swarms.

And then what? Some die, some miss, some are destroyed by accidents.

Some work.

But "a replicator landed and made some more" is not colonisation. Colonisation implies there's some kind of to-and-fro traffic, maybe trade, some kind of information exchange at a minimum.

And that implies the source civilisation has political, technological, and cultural stability, which can survive an incredibly slow diaspora.

Colonisation worked on Earth because it didn't take long to cross the Atlantic by sailing ship. Successful colonisers landed where humans already existed and trade was easy.

It doesn't work on interstellar, never mind intergalactic time scales, because nothing stays stable for that long. Not hardware, software, politics, or culture.

Nor, on slightly longer periods, biology. On much longer periods, geology, and eventually astrophysics, because stars change, and planetary systems aren't unconditionally stable.

So a colonising wave from a unified culture is an incredibly unlikely thing, not at all an obvious necessity.

GTP · 2026-04-12T19:04:57 1776020697

This would mean bootstrapping current advanced manufacturing technologies to a new planet. You would need so many different tools to do that that I seriously doubt we are currently capable of making it compact enough to be sent into deep space with current technologies. We're currently sending at most small capsules into deep space, my gut tells me that for self-contained factories we would need to send something in the size order of a skyscraper.

db48x · 2026-04-13T20:27:05 1776112025

Yea, the paper discusses a probe with a mass between 50g and 500kg using a diamondoid data storage medium that holds ~6,250 exabytes per gram. Plenty of room for any blueprints you want to include, up to and including a planet full of humans. If not actually today’s tech, it is but a few years into the future. I’m sure my next computer will have a few hundred grams of that diamondoid storage.

GTP · 2026-04-14T12:00:55 1776168055

Blueprints are the last of my concerns. What I think will be hard to do is to implement a full supply chain into a single space-travelling factory, including sourcing and refining of raw materials. But, regarding the blueprints, it now occurs to me that our "recipes" are made to work on our planet. Another one may lack some "ingredients" or have atmospheric conditions that could mess with the chemical reactions we use here. So we would need an advanced AI able to adapt production to the environment it finds.

db48x · 2026-04-14T16:30:21 1776184221

Sourcing and refining materials are just blueprints. Adaptations for other environments are just different blueprints.

But honestly most of the work would be done in vacuum. Skip the planets, build the daughter probes out of asteroids. Most systems should have plenty of easily accessible material even if they don’t have a prominent asteroid belt, even if the probe has to scavenge the system’s oort cloud.

NitpickLawyer · 2026-04-11T14:05:59 1775916359

I've heard that removing latex gloves when breathing might help...

bertil · 2026-04-11T14:17:29 1775917049

Classic case of PR leveraging a real, anecdotic observation on one single result, but completely flipping it to pretend it’s a systematic result, to saw doubt on all scientific findings around microplastic. The same companies behind this last story have done the same thing to slow down regulation to limit the impact of smoking, alcohol, processed foods, oil refinery, global warming, lead pipes…

NitpickLawyer · 2026-04-11T14:58:49 1775919529

... or just a benign joke. Not everything is that serious, you know...

bertil · 2026-04-11T16:17:01 1775924221

I wish the millions of people killed by delaying safety legislation for decades knew that pretending to make jokes (what became known as the "stochastic asshole" approach) was also a common tactic taught by those PR firms, to make critics sound like sour-puss.

NitpickLawyer · 2026-04-11T17:08:54 1775927334

Whatever deamons you're fighting, they are not here in the room with us, friend. Be well.

bertil · 2026-04-11T18:11:04 1775931064

Do you know what a "useful idiot" is, in the Soviet manipulation tradecraft?

Someone who repeats, jokingly or not, an argument that was placed somewhere deniable. One lab, looking at a small study, published a correction saying their estimates were wrong because they didn’t realize how their gloves accounted for it. Do you know who knew about that? Every intern in every lab ever. This was a minor correction that should never has reached anyone except the 10 readers of their original report.

But, strangely, that story got a wide coverage in the press: the usual “science” publication, the trade press, even widespread media. Why? Because it was presented as a “They are making things up about micro-plastics” piece, and those can go really far. And that kind of coverage doesn’t happen by accident.

So no, I don’t think you did that deliberately. But I know you read about it recently; I know you didn’t check what that original story was that triggered the coverage; I know you found that quaint—and I have no reason to think you deliberately tried to spread misinformation. But, you did. Because the people who want to sow doubt know what they are doing.

NitpickLawyer · 2026-04-11T06:01:35 1775887295

This sounds like the "acchhshually the iphone wasn't the first touchscreen phone, we had the motorola x34 vr34 t435 that did that one year before". Sure. Does anyone remember that phone? No? Well, the iphone changed the world.

bigyabai · 2026-04-11T16:29:56 1775924996

The iPhone was the product-ized version of the smartphone. Smartphones were not a new technology, Apple's implementation of it in the iPhone is not unique. Web browsing, caller ID and MP3 playback were not new or world-changing features for a mobile phone.

"the iPhone changed the world" and "ChatGPT changed the world" are indeed both midwit takes that will get you mocked in technical circles. Both products have a net negative impact on technological progress and directly contribute to the enshittification of their respective market segments.