Hacker Newsnew | past | comments | ask | show | jobs | submit | gpjanik's commentslogin

Thanks god we're getting rid of nuclear energy btw.

The UK is building new nuclear (albeit slowly).

Didn't the gov just announce new reactors were being built? In any case UK has a lot of windpower.

We have at least recently accepted the Fingleton report's findings (https://www.gov.uk/government/publications/nuclear-regulator...) which is an attempt to get rid of a bunch of administrative blockers. I'm not aware of any new reactors being announced?


The price of electricity in the UK is going up right now to start paying for the next nuclear plant 10 years (fingers crossed) in advance of it generating any electicity.

They literally invented the idea of applying Contracts for Difference from finance to help build nuclear in the UK.

Those have massively helped renewables get built in the UK and elsewhere by letting governments cheaply subsidize financial risk in energy investments while allowing competitive developers to bid the price lower.

But it wasn't enough for nuclear, where the builders didn't want to be on the hook for the inevitable doubling (or worse) of final cost, so they wrote special rules for new nuclear to be paid in advance and not held to cost estimates.


> where the builders didn't want to be on the hook for the inevitable doubling (or worse) of final cost, so they wrote special rules for new nuclear to be paid in advance and not held to cost estimates.

Sounds relatively fair to me - since the vast majority of delays and cost is government initiated (or enabled: e.g. giving power to NIMBY groups). Ideally there are carve-outs for any delays or cost overruns solely the fault of the builders and operators. Projects like these always will have a certain amount of incompetence and graft to them, unfortunately.

Maybe once/if the nuclear industry can get un-destroyed by the government that destroyed it in the first place, these subsidies can go away. If subsidies are good for green energy, they are good for kickstarting the nuclear energy segment again. If countries can successfully hold the line and stick to a 20 year program of increasing the pace and velocity of building new plants a robust industry just might emerge.


That's not happening right now.

"Pozsar’s argument: the moment Western nations froze Russian foreign exchange reserves, the assumed risk-free nature of these dollar holdings changed fundamentally. What had been viewed as having negligible credit risk suddenly carried confiscation risk."

This has nothing to do with dollar. Almost all of the confiscated currency was in Europe, and it has to do with invading your bank's ally. No country in the world ever assumed that's risk free, it wasn't being priced back then and isn't now, because nobody except from Russia is stupid enough to do that.


Importantly, Russia fully expected to have access to those dollars during a war with the west, as that was the entire reason of stockpiling them in the first place, but Russia's central bank was not informed ahead of time that they were invading Ukraine and was completely unable to move the money before the west reacted.

The risk was known and expected. The risk that (hilariously) wasn't planned for was the risk that Putin is that fucking stupid.


Good point about Europe holding most of the frozen assets (~$300B vs ~$5B in the US). But that actually reinforces rather than contradicts Pozsar's framework; it's not specifically about the dollar versus other currencies, but about Western-controlled financial claims generally (whether USD, EUR, or other G7 assets) versus 'outside money' that can't be frozen by institutional decision.

The key insight imp isn't 'dollars are risky,' it's 'any financial claim on a Western institution now carries confiscation risk if your geopolitical interests diverge.' Whether those claims are denominated in dollars or euros doesn't change the fundamental calculus for reserve holders. That's why we're seeing (and might see more) diversification toward gold and commodities—assets with less/different counterparty risk rather than just EUR-for-USD swaps.


There have been numerous cases of sanctioning and wealth confisactions (Afrghanistan, Venezuela, Iraq, Iran, Libya), and "_now_ carries confiscation risk" is just factually incorrect. It has always carried such risk, this risk has materialized numerous times, and most importantly, no diversification is happening - literally nothing changed: https://data.imf.org/en/news/4225global%20fx%20reserves%20de...

By the way, good luck with trusting China, Russia, or other places to store wealth more than Europe. It's total ignorance to believe there's somewhere safer to store your money than the West.


"Think in math, write in code" is the possibly worst programming paradigm for most tasks. Math notations, conventions and concepts usually operate under the principles of minimum description lenght. Good programming actively fights that in favor of extensibility, readability, and generally caters to human nature, not maximum density of notation.

If you want to put this to test, try formulating a React component with autocomplete as a "math problem". Good luck.

(I studied maths, if anyone is questioning where my beliefs come from, that's because I actually used to think in maths while programming for a long time.)


The people I know who „think in math“ don’t think in the syntax of the written notation, including myself.


If you're adding some computational/problem breakdown/heuristic steps on top/instead of mathematical concepts, then you're doing the opposite of what the author proposes.

Scientific conensus in math is Occam's Razor, or the principle of parsimony. In algebra, topology, logic and many other domains, this means that rather than having many computational steps (or a "simple mental model") to arrive to an answer, you introduce a concept that captures a class of problems and use that. Very beneficial for dealing with purely mathematical problems, absolute distaster for quick problem solving IMO.


> then you're doing the opposite of what the author proposes

No, it’s exactly what the author is writing about. Just check his example, it’s pretty clear what he means by “thinking in math”

> Scientific conensus in math is Occam's Razor, or the principle of parsimony. In algebra, topology, logic and many other domains, this means that rather than having many computational steps (or a "simple mental model") to arrive to an answer, you introduce a concept that captures a class of problems and use that.

I don’t even know what you mean by this.


  > Math notations, conventions and concepts usually operate under the principles of minimum description lenght.
I think you're confusing the fact that math is typically written by hand on paper or a board. That optimization is not due to math, it is due to the communication medium.

If we look at older code we'll actually see a similar thing. People are limited in character lengths so the exact same thing happened. That's why there's still conventions like 80 char text width, though now those things serve as a style guide rather than a hard rule. It also helps that we can auto complete variables.

Your criticism is misplaced.


Language models don't "pack concepts" into the C dimension of one layer (I guess that's where the 12k number came from), neither do they have to be orthogonal to be viewed as distinct or separate. LLMs generally aren't trained to make distinct concepts far apart in the vector space either. The whole point of dense representations, is that there's no clear separation between which concept lives where. People train sparse autoencoders to work out which neurons fire based on the topics involved. Neuronpedia demonstrates it very nicely: https://www.neuronpedia.org/.


The spare autoencoder work is /exactly/ premised on the kind of near-orthogonality that this article talks about. It's called the 'superposition hypothesis' originally: https://transformer-circuits.pub/2022/toy_model/index.html

The SAE's job is to try to pull apart the sparse nearly-orthogonal 'concepts' from a given embedding vector, by decomposing the dense vector into a sparsely activation over-complete basis. They tend to find that this works well, and even allows matching embedding spaces between different LLMs efficiently.


Agreed, but that's not in the C dimension of a first-layer embedding of a single token though, it's across the whole model and that's what I said in the comment above.


Agreed, if you relax the requirement for perfect orthogonality, then, yes, you can pack in much more info. You basically introduced additional (fractional) dimensions clustered with the main dimensions. Put another way, many concepts are not orthogonal, but have some commonality or correlation.

So nothing earth shattering here. The article is also filled with words like "remarkable", "fascinating", "profound", etc. that make me feel like some level of subliminal manipulation is going on. Maybe some use of an LLM?


It's... really not what I meant. This requirement does not have to be relaxed, it doesn't exist at all.

Semantic similarity in embedding space is a convenient accident, not a design constraint. The model's real "understanding" emerges from the full forward pass, not the embedding geometry.


I'm speaking in more in general conceptual terms, not about the specifics of LLM architecture


"Here's the key insight: each forward pass processes ALL tokens in ALL sequences simultaneously."

This sounds incorrect, you only process all tokens once, and later incrementally. It's an auto-regressive model after all.


Not during prefill, i.e. the very first token generated in a new conversation. During this forward pass, all tokens in the context are all processed at the same time, and then attention's KV are cached, you still generate a single token, but you need to compute attention from all tokens to all tokens.

From that point on every subsequent tokens is processed sequentially in autoregressive way, but because we have the KV cache, this becomes O(N) (1 token query to all tokens) and not O(N^2)


I somehow missed the "decode phase" paragraph and hence was confused - it's essentially that separation I meant, you're obviously correct.


This is finetuned to the benchmarks and nowhere close to O1-Preview in any other tasks. Not worth looking into unless you specifically want to solve these problems - however, still impressive.


We beat O1-preview and even many other 7B models over many math benchmarks, which was TEST set (not in training set at all).

If you want to make the model fully generalist, feel free to train it over coding datasets (such as RL with passing unit tests as reward).


It's already good accomplishment as it is but I think it'd be very surprising to show training such a small model as a generalist scales to the same magnitude as specialized finetuning. At some point you have to fit more background data and relations in the same amount of information space... but it's hard to say how much that is the case for a given size vs what we just haven't optimized yet. Unfortunately I think that will have to wait for someone with more compute before we can verify this * a dozen one way or the other :).

Side question, since it sounds like you were involved: how big is the impact on benchmarks of taking this 1.5B model down from fp32 to fp8 or similar? The focus on parameters alone sometimes feels like comparing house sizes by their lengths alone. And, if you were indeed involved, thanks for making all of this open and available!


For quantization, very big impact for small models, can drop at much as 10% on AIME. Our model does best on bfloat16 ;)

Come checkout our repo at: https://github.com/agentica-project/deepscaler


It is great discovery, it could even open a next step in AI with MoM "Mixture of Models", where small fine-tuned models take each part of a task (instead of the current MoE)


Check out one of my prior work: https://stylus-diffusion.github.io/

This work scales up selection/routing over many models/LoRAs


Love it, will check, thank you for showing / sharing all of that!


o1 is more than just math solver. And you cannot possibly train that much in a small model.

However smaller specialized models looks to be the right way to handle world's complexity. Sort of mixture of experts on one level above. Orchestrating them will be another problem. Possible solution is generalists model "to rule them all".


Have you considered the very practical importance of running specialized models for specialized tasks on common hardware (maybe a couple of CPU cores in a couple GB of RAM)?


Small models are just tools. Even many of them will make only a toolset. They don't evolve in AGI by themselves. But putting them together in a structure (brain) may result in something close. Like big smart calculator. It takes more to create a 'character' similar to, say, terminator.


I disagree. They demonstrated a way to dramatically reduce training costs, 18x cheaper than R1. That alone is worth attention.

Also beating O1 on any benchmark is nontrivial.


I'm not so sure it's impressive even for mathematical tasks.

When ChatGPT came out, there was a flood of fine-tuned LLMs claiming ChatGPT-level performance for a fraction of the size. Every single time this happened, it was misleading.

These LLMs were able to score higher than ChatGPT because they took a narrow set of benchmarks and fine-tuned for those benchmarks. It's not difficult to fine-tune an LLM for a few benchmarks, cheaply and beat a SOTA generalist LLM at that benchmark. Comparing a generalist LLM to a specialist LLM is like comparing apples to oranges. What you want is to compare specialist LLMs to other specialist LLMs.

It would have been much more interesting and valuable if that was done here. Instead, we have a clickbait, misleading headline and no comparisons to math specialized LLMs which certainly should have been performed.


But if that's the case - what do the benchmarks even mean then?


Automated benchmarks are still very useful. Just less so when the LLM is trained in a way to overfit to them, which is why we have to be careful with random people and the claims they make. Human evaluation is the gold standard, but even it has issues.


The question is how do you train your LLMs to not 'cheat'?

Imagine you have an exam coming up, and the set of questions leaks - how do you prepare for the exam then?

Memorizing the test problems would be obviously problematic, but maybe practicing the problems that appear on the exam would be less so, or just giving extra attention to the topics that will come up would be even less like cheating.

The more honest approach you choose, the more indicative your training would be of exam results but everybody decides how much cheating they allow for themselves, which makes it a test of the honesty not the skill of the student.


I think the only way is to check your dataset for the benchmark leak and remove it before training, but (as you say) that's assuming an honest actor is training the LLM, going against the incentives of leaving the benchmark leak in the training data. Even then, a benchmark leak can make it through those checks.

I think it would be interesting to create a dynamic benchmark. For example, a benchmark which uses math and a random value determined at evaluation for the answer. The correct answer would be different for each run. Theoretically, training on it wouldn't help beat the benchmark because the random value would change the answer. Maybe this has already been done.


A lot of people in community are weary of benchmarks for this exact reason.


I tested it on basic long addition problems. It frequently misplaced the decimal signs, used unnecessary reasoning tokens (like restating previously done steps) and overall seemed only marginally more reliable than the base DeepSeek 1.5B.

On my own pet eval, writing a fast Fibonacci algorithm in Scheme, it actually performed much worse. It took a much longer tangent before arriving at fast doubling algorithm, but then completely forgot how to even write S-expressions, proceeding to instead imagine Scheme uses a Python-like syntax while babbling about tail recursion.


> On my own pet eval, writing a fast Fibonacci algorithm in Scheme,

This model was trained on math problems datasets only, it seems. It makes sense that it's not any better at programming.


The original model, aside from its programming mistakes, also misremembered the doubling formula. I hoped to see that solved, which it was, as well as maybe a more general performance boost from recovering some distillation loss.


This model can't code at all.

It does high school math homework, plus maybe some easy physics. And it does them surprisingly well. Outside of that, it fails every test prompt in my set.

It's a pure specialist model.


It's absolutely worth to look into.

It's a great find


Oh yeah, there are 4 slightly chubby guys in "elite" sports out of 10 000, therefore we got athleticism wrong.

Also, sorry to say it so directly, but none of these guys would have a remote possibility to (athletically speaking - not talking ability) play in third tier soccer in Italy or Spain, or any actually physically difficult sport (e.g. climbing). It's a rule in these sports that people look at minimum super fit, at maximum godlike, and the few exceptions that exist show extremely visible downsides (and it's clear they'd be better off being athletic).

For context, a picture of a soccer player considered unfit (constantly rated at something around ~70/100 physically in various rankings/video games/etc)

https://64.media.tumblr.com/9a64e77011b24cf2aff656356587de97...


I cannot possibly comprehend how someone could think that NFL football and NBA basketball are ‘not physically difficult’ sports.


Ugh, NFL players need to run on average about half of what a third tier soccer players do. At the extreme, a goalkeeper runs about the same as most running NFL player. For sprinting, the peak speed is similar, but soccer players run 2-3km of sprints during the game (1-1.5km for NFL). I'm not bringing NBA into this because it's just not a running sport altogether.

For explosiveness, top speeds of NFL and top 5 leagues in Europe are comparable, but more consistent for soccer players. They of course have to run with a ball next to their legs, rather than in hand, which makes it technically harder. For jumping, tall soccer players are closer to NBA players than to NFL (Tomori, Ronaldo, Lewandowski, etc, jump around 80cm).

In terms of agility, NFL and top 5 leagues is similar, about 3 seconds to 30km/h, but of course the best performing players in soccer are better.

So, with some similar parameters, soccer players do what NFL players do, but 3 times as long. That's the difference between "I can do this with a bit of a belly" and "I need to look like a god to even survive this game without getting a heart attack".

Edit: my point above wasn't that it's not physically difficult altogether, it was that these are not _elite_ sports in terms of physical requirement. Swimming, climbing, sprinting, soccer (mainly by the virtue of how professionalized it is), bicycle racing, that's physically super difficult. Basketball is super technical and relatively chill in physical requirements compared to these sports, and NFL is generally challenging but not nearly as much as the "top" sports, unless you specifically cherry-pick comparison to favor heavy, fast people. I chose rather versatile metrics that focus on input, e.g. how much you need to train to become fit enough.


And a soccer player runs 1/20th of what a marathoner does. Kipchoge would smoke Messi in a long-distance race. And I don't think Messi would get a hit off of Gerrit Cole, or have a snowball's chance in hell of stopping JJ Watt.


I mean again, even within a single sport, there are role differences, but the degree of fitness that you have to have to build 130kg muscle mass that JJ Watt has, and to run the 38.5 that Mbappe does... is just not the same level of fitness.

On the reverse, Mbappe has less strenght, and JJ Watt moves like a tank with 27km/h speed. If you compare them to elite strenght sports, they're both weak. If you compare them to sprinters, Mbappe is an amateur sprinter, and JJ Watt is disabled.

Sportsmen specialize in what they do, but NFL simply doesn't require versatility so they are good in fewer categories, and not very good in any. Soccer players are good in multiple categories, and by the virtue of being the more popular and competitive sport and having insanely bigger selection, occassionally very good in one or two (e.g. Bale, Mbappe, and other freaks of nature who are essentially sprinters).

Also, soccer player does not run 1/20th of marathoner runs. Elite wingers run just under _a third of marathon_ each game, of which 3km can be sprint.


You are grading American football on a world football rubric. One could just say that world football is not demanding because you don’t need much strength, you can be small and fast (easier than being big and fast), you don’t have to be physically resilient enough to get flattened by a 350 pound man and jump back up, you don’t need hand skills, etc.


NFL has a lot less running than soccer, grandparent thinks all athleticism looks like soccer.

Imagine making a weightlifter run for an hour.


What is "pirate" about this?

A lot of big companies self-host npm to avoid chain of supply attacks.


Where is EU when you need it? Subscriptions are a mess and it's one place in which EU could've forced something, but it won't.

I also think they're mentally aligned with the idea of having to go through 20 forms to achieve something, as that's their daily job.


Hi from Germany. In case you were wondering, we regulated ourselves to the point where I can't even see the demo of SAM2 until some other service than Meta deploys it.

Does anyone know if this already happened?


It’s more like “Meta is restricting European access to models even though they don’t have to, because they believe it’s an effective lobbying technique as they try to get EU regulations written to their preference.”

The same thing happened with the Threads app which was withheld from European users last year for no actual technical reason. Now it’s been released and nothing changed in between.

These free models and apps are bargaining chips for Meta against the EU. Once the regulatory situation settles, they’ll do what they always do and adapt to reach the largest possible global audience.


> Meta is restricting European access to models even though they don’t have to

This video segmentation model could be used by self-driving cars to detect pedestrians, or in road traffic management systems to detect vehicles, either of which would make it a Chapter III High-Risk AI System.

And if we instead say it's not specific to those high-risk applications, it is instead a general purpose model - wouldn't that make it a Chapter V General Purpose AI Model?

Obviously you and I know the "general purpose AI models" chapter was drafted with LLMs (and their successors) in mind, rather than image segmentation models - but it's the letter of the law, not the intent, that counts.


> The same thing happened with the Threads app which was withheld from European users last year for no actual technical reason. Now it’s been released and nothing changed in between.

No technical reason, but legal reasons. IIRC it was about cross-account data sharing from Instagram to Threads, which is a lot more dicey legally in the EU than in NA.


It’s not like Meta doesn’t know how it works. They ship many apps that share accounts like FB + Messenger most prominently.

They’ve also had separate apps in the past that shared an Instagram account, like IGTV (2018 - 2022).

The Threads delay was primarily a lobbying ploy.


No, it really was a legal privacy thing. I worked in privacy at Meta at that time. Everybody was eager to ship it everywhere, but it wasn't worth the wrath of the EU to launch without a clear data separation between IG and threads.


Not saying you're wrong, but in this instance it might be a regulation specific to Germany since the site works just fine from the Netherlands.


Sounds like big tech's strategy to make you protest against regulating them is working brilliantly.


Regulation in this space works exclusively in favor of big tech, not against them. Almost all of that regulation was literally written for the benefit and with aid of the big tech.


Hi also from Germany - works fine here


Looking at it right now from Denmark. You must have some other problem.


Which German regulation prevents this? Is it biometric related?

It seems that https://mullvad.net is a necessary part of my Internet toolkit these days, for many reasons.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: