TLDR : 1B particles ~ 3s per iterations For examples like particle simulations, ...

dahart · 2025-11-12T14:58:12 1762959492

Might be worth starting with a baseline where there’s no collision, only advection, and assume higher than 1fps just because this gives higher particles per second but still fits in 24GB? I wouldn’t be too surprised if you can advection 100M particles at interactive rates.

GistNoesis · 2025-11-12T16:01:34 1762963294

The theoretical maximum rate for 1B particle advection (Just doing p[] += v[]dt), is 1000GB/s / 24GB = 42 iteration per second. If you only have 100M you can have 10 times more iteration.

But that's without any rendering, and non interacting particles which are extremely boring unless you like fireworks. (You can add a term like v[] += gdt for free.) And you don't need to store colors for your particles if you can compute the colors from the particle number with a function.

Rasterizing is slower, because each pixel of the image might get touched by multiple particles (which mean concurrent accesses in the GPU to the same memory address which they don't like).

Obtaining the screen coordinates is just a matrix multiply, but rendering the particles in the correct depth order requires multiple pass, atomic operations, or z-sorting. Alternatively you can slice your point clouds, by mixing them up with a peak-shaped weight function around the desired depth value, and use an order independent reduction like sum, but memory accesses are still concurrent.

For the rasterizing, you can also use the space partitioning indices of the particle to render to a part of the screen independently without concurrent access problems. That's called "tile rendering". Each tile render the subset of particles which may fall in it. (There are plenty of literature in the Gaussian Splatting community).

dahart · 2025-11-13T15:28:01 1763047681

> The theoretical maximum rate for 1B particle advection (Just doing p[] += v[]ddt), is 1000GB/s / 24GB =41.667/s 42 iteration per second.

Just to clarify, the 24GB comes from multiplying 1B particles by 24 bytes? Why 24 bytes? If we used float3 particle positions, the rate would presumably be mem_bandwidth / particle_footprint. If we use a 5090, then the rate would be 1790GB/s / 12B = 146B particles / second (or 146fps of 1B particles).

> non interacting particles which are extremely boring

You assumed particle-particle collision above, which is expensive and might be over-kill. The top comment asked simply about the maximum rate of moving particles. Since interesting things take time & space, the correct accurate answer to that question is likely to be less interesting than trading away some time to get the features you proposed; your first answer is definitely interesting, but didn’t quite answer the question asked, right?

Anyway, I’m talking about other possibilities, for example interaction with a field, or collision against large objects. Those are still physically interesting, and when you have a field or large objects (as long as they’re significantly smaller footprint than the particle data) they can be engineered to have high cache coherency, and thus not count significantly against your bandwidth budget. You can get significantly more interesting than pure advection for a small fraction of the cost of particle-particle collisions.

Yes if you need rendering, that will take time out of your budget, true and good point. Getting into the billions of primitives is where ray tracing can sometimes pay off over raster. The BVH update is a O(N) algorithm that replaces the O(N) raster algorithm, but the BVH update is simpler than the rasterization process you described, and BVH update doesn’t have the scatter problem (write to multiple pixels) that you mentioned, it’s write once. BVH update on clustered triangles can now be done at pretty close to memory bandwidth. Particles aren’t quite as fast yet, AFAIK, but we might get there soon.