Hacker Newsnew | past | comments | ask | show | jobs | submit | GistNoesis's commentslogin

What they need is not so much memory but memory bandwidth.

For training, their models have a certain number of memory needed to store the parameters, and this memory is touched for every example of every iteration. Big models have 10^12 (>1T )parameters, and with typical values of 10^3 examples per batch, and 10^6 number of iteration. They need ~10^21 memory accesses per run. And they want to do multiple runs.

DDR5 RAM bandwidth is 100G/s = 10^11, Graphics RAM (HBM) is 1T/s = 10^12. By buying the wafer they get to choose which types of memory they get.

10^21 / 10^12 = 10^9s = 30 years of memory access (just to update the model weights), you need to also add a factor 10^1-10^3 to account for the memory access needed for the model computation)

But the good news is that it parallelize extremely well. If you parallelize you 1T parameters, 10^3 times, your run time is brought down to 10^6 s = 12 days. But you need 10^3 *10^12 = 10^15 Bytes of RAM by run for weight update and 10^18 for computation (your 120 billions gigabytes is 10^20, so not so far off).

Are all these memory access technically required : No if you use other algorithms, but more compute and memory is better if money is not a problem.

Is it strategically good to deprive your concurrents from access to memory : Very short-sighted yes.

It's a textbook cornering of the computing market to prevent the emergence of local models, because customers won't be able to buy the minimal RAM necessary to run the models locally even just the inferencing part (not the training). Basically a war on people where little Timmy won't be able to get a RAM stick to play computer games at Xmas.


Thanks - but this seems like fairly extreme speculation.

> if money is not a problem.

Money is a problem, even for them.


In the video, in the continuous version the game never end and highlight the "loser" strategy.

When you are behind the optimal play is to make a gamble, which most likely will make you even worse. From the naive winning side it seems the loser is just doing a stupid strategy of not following the optimal dichotomy strategy, and therefore that's why they are losing. But in fact they are a "player" doing not only their best, but the best that can be done.

The infinite sum of ever smaller probabilities like in Zeno's paradox, converge towards a finite value. The inevitable is a large fraction of the time, you are playing catch-up and will never escape.

You are losing, playing optimally, but slowly realising the probabilities that you are a loser as evidence by the score which will most likely go down even more next round. Most likely the entire future is an endless sequence of more and more desperate looking losing bets, just hoping to strike it once that will most likely never comes.

In economics such things are called "traps", for example the poverty trap exhibits similar mechanics. Where even though you display incredible ingenuity by playing the optimal game strategy, most of the time you will never escape, and you will need to take even more desperate measures in the future. That's separating the wheat from the chaff from the chaff's perspective or how you make good villains : because like Bane in Batman there are some times (the probability is slim but finite) where the gamble pays off once and you escape the hell hole you were born in and become legend.

If you don't play this optimal strategy you will lose slower but even more surely. The optimal strategy is to bet just enough to go from your current situation to the winning side. It's also important not to overshoot : this is not always taking moonshots, but betting just enough to escape the hole, because once out, the probabilities plays in your favor.


TLDR : 1B particles ~ 3s per iterations

For examples like particle simulations, on a single node with a 4090 GPU everything running on GPU without memory transfer to the CPU:

-The main bottleneck is memory usage : available 24GB, Storing the particles 3 position coordinates, + 3 velocity coordinates, 4 bytes by number (float32) = Max 1B particles

-Then GPU memory bandwidth : if everything is on the GPU you get between 1000GB/s of global memory access and 10000GB/s when shared memory caches are hit. The number of memory access is roughly proportional to the number of effective collisions between your particles which is proportional to the number of particles so around 12-30 times ( see optimal sphere packing number of neighbors in 3d, and multiply by your overlap factor). All in all for 1B particles, you can collision them all and move them in 1 to 10s.

If you have to transfer things to the CPU, you become limited by the PCI-express 4.0 bandwidth of 16GB/s. So you can at most move 1B particles to and from the GPU, 0.7 times per second.

Then if you want to store the particle on disk, instead of RAM because your system is bigger, then you can either use a M2 ssd (but you will burn them quickly) which has a theoretical bandwidth of 20GB/s so not a bottleneck, or use a network storage over 100Gb/s (= 12.5GB/s) ethernet, via two interfaces to your parameter server which can be as big as you can afford.

So to summarize so far : 1B particles takes 1 to 10s per iteration per GPU. If you want to do smarter integration schemes like Rk4, you divide by 6. If you need 64 bits precisions you divide by 2. If you only need 16bits precisions you can multiply by 2.

The number of particle you need : Volume of the box / h^3 with h the diameter of the particle = finest details you want to be able to resolve.

If you use an adaptive scheme most of your particles are close to the surface of objects so O( surface of objects / h^2 ) with h=average resolution of the surface of the mesh. But adaptive scheme is 10 times slower.

The precision of the approximation can be bounded by Taylor formula. SPH is typically order 2, but has issues with boundaries, so to represent a sharp boundary the h must be small.

If you want higher order and sharp boundaries, you can do Finite Element Method, instead. But you'll need to tessellate the space with things like Delaunay/Voronoi, and update them as they move.


Might be worth starting with a baseline where there’s no collision, only advection, and assume higher than 1fps just because this gives higher particles per second but still fits in 24GB? I wouldn’t be too surprised if you can advection 100M particles at interactive rates.


The theoretical maximum rate for 1B particle advection (Just doing p[] += v[]dt), is 1000GB/s / 24GB = 42 iteration per second. If you only have 100M you can have 10 times more iteration.

But that's without any rendering, and non interacting particles which are extremely boring unless you like fireworks. (You can add a term like v[] += gdt for free.) And you don't need to store colors for your particles if you can compute the colors from the particle number with a function.

Rasterizing is slower, because each pixel of the image might get touched by multiple particles (which mean concurrent accesses in the GPU to the same memory address which they don't like).

Obtaining the screen coordinates is just a matrix multiply, but rendering the particles in the correct depth order requires multiple pass, atomic operations, or z-sorting. Alternatively you can slice your point clouds, by mixing them up with a peak-shaped weight function around the desired depth value, and use an order independent reduction like sum, but memory accesses are still concurrent.

For the rasterizing, you can also use the space partitioning indices of the particle to render to a part of the screen independently without concurrent access problems. That's called "tile rendering". Each tile render the subset of particles which may fall in it. (There are plenty of literature in the Gaussian Splatting community).


> The theoretical maximum rate for 1B particle advection (Just doing p[] += v[]ddt), is 1000GB/s / 24GB =41.667/s 42 iteration per second.

Just to clarify, the 24GB comes from multiplying 1B particles by 24 bytes? Why 24 bytes? If we used float3 particle positions, the rate would presumably be mem_bandwidth / particle_footprint. If we use a 5090, then the rate would be 1790GB/s / 12B = 146B particles / second (or 146fps of 1B particles).

> non interacting particles which are extremely boring

You assumed particle-particle collision above, which is expensive and might be over-kill. The top comment asked simply about the maximum rate of moving particles. Since interesting things take time & space, the correct accurate answer to that question is likely to be less interesting than trading away some time to get the features you proposed; your first answer is definitely interesting, but didn’t quite answer the question asked, right?

Anyway, I’m talking about other possibilities, for example interaction with a field, or collision against large objects. Those are still physically interesting, and when you have a field or large objects (as long as they’re significantly smaller footprint than the particle data) they can be engineered to have high cache coherency, and thus not count significantly against your bandwidth budget. You can get significantly more interesting than pure advection for a small fraction of the cost of particle-particle collisions.

Yes if you need rendering, that will take time out of your budget, true and good point. Getting into the billions of primitives is where ray tracing can sometimes pay off over raster. The BVH update is a O(N) algorithm that replaces the O(N) raster algorithm, but the BVH update is simpler than the rasterization process you described, and BVH update doesn’t have the scatter problem (write to multiple pixels) that you mentioned, it’s write once. BVH update on clustered triangles can now be done at pretty close to memory bandwidth. Particles aren’t quite as fast yet, AFAIK, but we might get there soon.


The choice of the dynamic computation graph [1] of PyTorch made it easier to debug and implement, leading to higher adoption, even though running speed was initially slower (and therefore training cost higher).

Other decisions follow from this one.

Tensorflow started with static and had to move to dynamic at version 2.0, which broke everything. Fragmentation between tensorflow 1, tensorflow 2, keras, jax.

Pytorch's compilation of this computation graph erased the remaining edge of Tensorflow.

Is the battle over ? From a purely computational point, Pytorch solution is very far from optimal and billions of dollars of electricity and GPUs are burned every year, but major players are happy with circular deals to entrench their positions. So at the pace of current AI code development, probably one or two years before Pytorch is old history.

[1] https://www.geeksforgeeks.org/deep-learning/dynamic-vs-stati...


Someone’s got to prototype the next generation of architectures.


> at the pace of current AI code development, probably one or two years before Pytorch is old history.

Ehhh, I don’t know about that.

Sure, new AI techniques and new models are coming out pretty fast, but when I go to work with a new AI project, they’re often using a version of PyTorch or CUDA from when the project began a year or two ago. It’s been super annoying having to update projects to PyTorch 2.7.0 and CUDA 12.8 so I can run them on RTX 5000 series GPUs.

All this to say: If PyTorch was going to be replaced in a year or two, we’d know the name of its killer by now, and they’d be the talk of HN. Not to mention that at this point all of the PhDs flooding into AI startups wrote their grad work in PyTorch, it has a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at. I don’t even know what that would be.

Bear in mind that it took a few years for Tensorflow to die out due to lock in, and we all knew about PyTorch that whole time.


> a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at

Higher level code migration to the newer framework, is going to 0. You ask your favorite agent (or intern) to port and check that the migration is exact. We already see this in the multitude of deep-learning frameworks.

The day one optimization trick that PyTorch can't do but another framework can, which reduce your training cost 10x and PyTorch is going the way of the dodo.

The day one architecture which can't be implemented in PyTorch get superior performance, and it's bye bye python.

We see this with architectures which require real-time rendering like Gaussian Splatting (Instant Nerf), or the caching strategies for LLM sequence generation.

Pytorch's has 3 main selling point :

- Abstracting away the GPU (or device) specific code, which is due to nvidia's mess : custom optimized kernels, which you are forced to adapt to if you don't want to write custom kernels.

If you don't mind writing optimized kernels, because the machine write them. Or if you don't need Cuda because you can't use Nvidia hardware because for example you are in China. Or if you use custom silicon, like Grok and need your own kernels anyway.

- Automatic differentiation. It's one of its weak point, because they went for easy instead of optimal. They shut themselves off some architectures. Some language like Julia because of the dynamic low-level compilation can do things Pytorch won't even dream about, (but Julia has its own problems mainly related to memory allocations). Here with the pytorch's introduction of the "scan function"[2] we have made our way full circle to Theano, Tensorflow's/Keras ancestor, which is usually the pain point of the bad automatic differentiating strategy chosen by Pytorch.

The optimal solution like all physics Phds which wrote simulations know, is writing custom adjoint code with 'Source Code Transformation' or symbolically : it's not hard but very tedious so it's now a great fit for your LLM (or intern or Phd candidate running 'student gradient descent') if you prove or check your gradient calculation is ok.

- Cluster Orchestration and serialization : a model can be shared with less security risks than arbitrary source code, because you only share weights. A model can be splitted between machines dynamically. But this is also a big weakness because your code rust as you become dependent of versioning, you are locked with the specific version number your model was trained on.

[2] "https://docs.pytorch.org/xla/master/features/scan.html


What would stop PyTorch from implementing whatever optimization trick becomes important? Even if it requires a different API.


There are two types of stops : soft stops, and hard stops.

- Soft stops is when the dynamic graph computation overhead is too much, which mean you can still calculate, but if you were to write the function manually or with a better framework, you could be 10x faster.

Typical example involve manually unrolling a loop. Or doing kernel fusion. Other typical example is when you have lots of small objects or need to do loops in python because it doesn't vectorize well. Or using the sparsity efficiently by ignoring the zeros.

- Hard stop is when computing the function become impossible, because the memory needed to do the computation in a non optimal way explode. Some times you can get away with just writing customized kernels.

The typical example where you can get away with it are custom attention layers.

Typical example where you can't get away are physics simulations. Like for example the force is the gradient of energy, but you have n^2 interactions between the particles, so if you use anything more than 0 memory preserved during the forward pass per interaction, your memory consumption explode. And typically with things like Lagrangian or Hamiltonian neural networks where you look the discover dynamics of an energy conserving system, you need to be able differentiate at least three times in a row.

There are also energy expanding stops, where you need to find work-around to make it work like if you want to have your parameters changing shape during the optimization process like learning point clouds of growing size, and they spread you thin so they won't be standardized.


>computing the exact gradient using calculus

First of all, gradient computation with back-prop (aka reverse-mode automatic differentiation) is exact to numerical precision (except for edge-cases that are not relevant here) so it's not about the way of computing the gradient.

What Andrej is trying to tell is that when you create a model, you have freedom of design in the shape of the loss function. And that in this design what matters for learning is not so much the value of the loss function, but its slopes, and curvature (peaks and valleys).

The problematic case being flat valleys, surrounded by straight cliffs, (picture the grand canyon).

Advanced optimizers in deep learning like "Adam", are still first-order, with diagonal approximation of the curvature, which mean the optimizer in addition to the gradient it has an estimate of the scale sensitivity of each parameter independently. So the cheap thing it can reasonably do is modulate the gradient with this scale.

The length of the gradient vector, being often problematic, what optimizers would usually do was something called "line-search", which is determine the optimal step-size along this direction. But the cost of doing that is usually between 10-100 evaluation of the cost function which is often not worth the effort in the noisy stochastic context, compared to just taking a smaller step multiple times.

Higher-order optimizers necessitate that the loss function is twice differentiable, so non-linearities like relu, which are cheap to calculate can't be used.

Lower-order global optimizers don't even necessitate the gradient, which is useful when the energy-function landscape has lots of local minima, (picture an egg-box).


How do I write my custom attractor equation ?


Can measurements and calibration be done easily at home to see if your sound system is well calibrated ?

I was thinking of playing a pink noise on the speaker and recording it with a cheap microphone or displaying it with the Spectroid app, but the microphone probably has it's own frequency response.

Is there an App for that ? With each phone model microphone factory calibrated ?

Is there a way to use known fact about physics like harmonics should have a specific shape (timbres) that's should help equalize frequency with respect to each other ? Or from various microphone positions, calibrate it so that any cheap microphone can do the trick ?


It's harder than that, because your ears have their own frequency response that can only be measured by your consciousness.

I've found a lot of enjoyment in equalizing as best you can with hardware, and then doing the finishing touches against your ears.

Take a frequency generator, I use a web one, and play a tone that matches your eq knob, and work your way through making sure all the tones are at the same volume. You'll have to do this a few times.

You might be surprised at how good it will sound. Especially the upper ranges which will usually be far too low if you only calibrated with mics and hardware and not your wetware.


> but the microphone probably has it's own frequency response.

Spot on. When you buy a measurement microphone, you can download it's specific response from the manufacturer.


A UMIK-1 is pretty cheap these days. Also check out REW.


Thanks, that the same approach I was suggesting in my other comment in this thread https://news.ycombinator.com/item?id=45628245 . But couldn't find literature specific to the Bezier curves to help break the communication gap, and my specific knowledge of Bezier Curve isn't deep enough.

I am happy to see other people use the approach I consider more natural.

It's a generic global optimization approach and geometry is always full of pathological edge cases, so it's hard to tell if you miss any. Getting to work in the average case is usually easy, but to be sure it always work is much harder.


Your comment motivated me to also comment :) I agree: Bézier curves and b-splines have a lot of rich geometry baked it, so it makes sense to use it, especially if you can avoid second derivatives. I think I misunderstood your comment about the extra control point work to find an initial guess. It looks like we’re saying something pretty close though: you can split the curve smartly to remove the single crossing in this case, which skips all the recursion that I’m suggesting, which is probably better.

Yes the real trouble is true optimality guarantees. I remember that there were edge cases of the above approach needed that a lot of subdivision steps to succeed for general degree curves, so it might end up worse than a more rigorously justified approach in these cases.


This is fundamentally a geometric problem and the author completely missed the geometry aspect by transforming everything into polynomials and root finding.

The naive generic way of finding distances from point P to curve C([0,1]) is a procedure quite standard for global minimization : repeat "find a local minimum on the space constrained to be better than any previous minimum"

- Find a point P0 local minimum of d(P,P0) subject to the constraint P0=C(t0) (aka P0 is on C)

- define d0 = d(P,P0)

- Bisect the curve at t1 into two curves C1 = C([0,t0]) and C2([t0,1])

- Find a point P1 local minimum of d(P,P1) subject to the constraint P1=C(t1) with 0< t1 < t0 (aka P1 on c1) and the additional constraint d(P,P1) < d0 (note here the inequality is strict so that we won't find P0 again) if it exist.

- define d1 = d(P,P1)

- Find a point P2 local minimum of d(P,P2) subject to the constraint P2=C(t2) with t0 < t2 < 1 (aka P2 on c2) and the additional constraint d(P,P2) < d0 (note here the inequality is strict so that we won't find P0 again) if it exist.

- define d2 = d(P,P2)

Here in the case of cubic Bezier the curve have only one loop so you don't need to bissect anymore. If curves where higher order like spirals, you would need to cascade the problems with a shrinking distance constraint (aka looking only for points in R^2 inside the circle

So the distance is min over all local minimum found = min( d0,d1,d2)

Here the local minimization problems are in R^n (and inside disks centered around P) with n the dimension of the space, and because this is numerical approximation, you can either use slack (dual) variables to find the tx which express the on Curve constraint or barrier methods to express the disk constraints once a t parametrization has been chosen.

Multivariate Newton is guaranteed to work because distances are positive so the Sequential Quadratic Programming problems are convex (scipy minimize "SLSQP"). (Whereas the author needed 5 polynomials root, you can only need to solve for 3 points because each solve solves for two coordinates).

A local minimum is a point which satisfy the KKT conditions.

This procedure is quite standard : it's for example use to find the eigenvalues of matrices iteratively. Or finding all solutions to a minimization problem.

Where this procedure shines, is when you have multiple splines, and you want to find the minimal distance to them : you can partition the space efficiently and not compute distances to part of curves which are to far away to have a chance to be a minimum. This will scale independently of the resolution of your chain spline. (imagine a spiral the number of local minimum you need to encounter are proportional to the complexity of the scene and not the resolution of the spiral)

But when you are speaking about computing the whole signed distance field, you should often take the step of solving the Eikonal equation over the space instead of computing individual distances.


I'm interested in the approach you're describing but it's hard to follow a comment in the margin. Is there a paper or an implementation example somewhere?


The general technique is not recent I was taught it in school in global optimisation class more than 15 years ago.

Here there is a small number of local minimum, the idea is to iterate over them in increasing order.

Can't remember the exact name but here is a more recent paper proposing "Sequential Gradient Descent" https://arxiv.org/abs/2011.04866 which features a similar idea.

Sequential convex programming : http://web.stanford.edu/class/ee364b/lectures/seq_notes.pdf

There is not really something special to it, it just standard local non linear minimization techniques with constraints Sequential Least Squares Quadratic Programming (SLSQP).

It's just about framing it as an optimization problem looking for "Points" with constraints and applying standard optimization toolbox, and recognizing which type of problem your specific problem is. You can write it as basic gradient descent if you don't care about performance.

The problem of finding a minimum of a quadratic function inside a disk is commonly known as the "Trust Region SubProblem" https://cran.r-project.org/web/packages/trust/vignettes/trus... but in this specific case of distances to curve we are on the easy case of Positive Definite.


What you described in your first message seemed similar to the approach used in the degree N root solving algorithm by Cem Yuksel; splitting the curve in simpler segments, then bisect into them. I'd be happy to explore what you suggested, but I'm not mathematically literate, so I'll be honest with you; what you're saying here is complete gibberish to me, and it's very hard to follow your point. It will take me weeks to figure out your suggestion and make a call as to whether it's actually simpler or more performant than what is proposed in the article.


I have written some gist to illustrate the approach I suggest. The code run but there may be bugs, and it don't use the appropriate optimizer. The purpose is to illustrate the optimisation approach.

https://gist.github.com/unrealwill/1ad0e50e8505fd191b617903b...

Point 33 "intersection between bezier curve with a circle" may be useful to find the feasible regions of the subproblems https://pomax.github.io/bezierinfo/#introduction

The approach I suggest will need more work, and there are probably problematic edge cases to consider and numerical stability issues. Proper proofs have not been done. It's typically some high work-low reward situation.

It's mostly interesting because it highlight the link between roots and local mimimum. And because it respect the structure of the problem more.

To find roots we can find a first root then divide the polynomial by (x-root). And find a root again.

If you are not mathematically literate, it'll probably be hard to do the details necessary to make it performant. But if you use a standard black-box optimizer with constraints it should be able to do it in few iterations.

You can simplify the problem by considering piece-wise segments instead of splines. The extension to chains of segment is roughly the same, and the spatial acceleration structure based on branch-and-bound are easier.


The answer is adaptability.

In both cases for "Question Answering" it's about similarity search but there are two main orthogonal differences between RAG and Non-RAG :

-Knowing the question at the time of index building

-Higher order features : the ability to compare fetched documents with one another and refine the question

Non-RAG, aka multi-layer (non-causal) transformer with infinite context, is the more generic version, fully differentiable meaning you can use machine learning to learn how to Non-RAG better. Each layer of the transformer can use the previous layer to reason and refine the similarity search. (A causal transformer know the question at the time when it is feed the question, and can choose to focus it's attention on different part of the previously computed features of the provided documents but may benefit from having some reflection token, or better : be given the question before being presented the documents (provided you've trained it to answer it like that).)

RAG is an approximation of the generic case to make it faster and cheaper. Usually it breaks end-to-end differentiability by using external tools, so this mean that if you want to use machine learning to learn how to RAG better you will need to use some variant of Reinforcement Learning which is slower to learn things. RAG usually don't know the question at the time of index building, and documents are treated independently of each other, so no (automatic) higher order features (embeddings are fixed).

A third usual approximation, is to feed the output of RAG into Non-RAG, to hopefully get the best of both world. You can learn the Non-RAG given RAG with machine learning (if you train it with some conversations where it used RAG), but the RAG part won't improve by itself.

Non-RAG need to learn so it needs a big training dataset, but fortunately it can pick-up question answer pair in an unsupervised fashion when you feed it the whole web, and you only need a small instruction training and preference optimization dataset to shape it to your need. If performance isn't what you expect in a specific case, you can provide more specific examples and retrain the model until it gets it and you get better performance for the case you were interested in. You can improve the best case but it's hard to improve the worst case.

RAG has more control on what you feed it but content should be in a more structured way. You can prevent worst cases more easily but it's hard to improve good case.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: