If memory is exposed in CUDA more like allocators than pre-allocated static memo...

dragontamer · on Nov 22, 2021

No. This is a fundamental misunderstanding of __shared__ memory and how kernels work.

__shared__ memory is a specification like .bss in ELF files. Its something that must be known at compile-time, well-before runtime. The GPU kernel, when launched, will reserve some __shared__ memory for itself.

----

Different kernels may use different chunks of that 64kB for themselves. For example, if Kernel Foo uses 30kB of __shared__ memory and is currently running, and Kernel Bar only uses 20kB of __shared__ memory, the GPU will allow Foo + Bar to run simultaneously.

The fundamental "unit of compute" is the OpenCL workgroup / CUDA block. The GPU will run many, many kernels (even different kernels) _ON THE SAME CORE_, as long as there's enough registers, and __shared__ memory available for them.

-----

"Thread local" variables compile into registers in practice. GPUs have ~256 registers per core, so if Kernel Foo uses 40 registers, and Kernel Bar uses 70, the GPU SMs (Symmetric Multiprocessors, what you'd call a "core" in the CPU world) could run 4 copies of Foo + 1 copy of Bar.

This is because Foo#1 will eventually run into a memory-latency issue (VRAM read/writes have well over 100-cycles of latency, maybe 300+ cycles on older GPUs). Instead of waiting for this memory operation to complete, the GPU will switch to another workgroup (like Foo#2, Foo#3, or Bar#1) to ensure that the GPU-cores stay utilized.

__shared__ memory works kinda like these registers, they're divy'd out at runtime by treating the OpenCL workgroups / CUDA blocks as a unit.

------

So __shared__ memory has to be preallocated by this model. Its an important per-unit resource that is tracked by the GPU at the lowest level, so that multiple kernels could be run concurrently on the same cores (GPUs are like SMT or Hyperthreading: capable of running 8+ kernels per core as long as you have enough registers / ___shared__ memory to launch all kernels)

nynx · on Nov 22, 2021

Gotcha. I assumed that was the case, but I was seeing some other stuff that implied otherwise elsewhere in this thread.

Yeah, in that case, it'd probably have to be a transformation on top of static variables or something like that.

dragontamer · on Nov 22, 2021

In the Lisp world... it is ambiguous if functions or macros are called at compile time or runtime.

I'm not sure if this is kosher in the world of Rust, but...

    static RAY_STACK: [Shared<Ray>; 2000] = [Shared::new(Ray::default()); 2000];

This could still work, if Shared::new(...) were a compile-time function. Or a language-extension that looked like a compile-time function.

EDIT: You wouldn't be allowed to have Shared::new inside of a loop or a recursive function though. But as long as you had assurances that any such Shared::new instance ran exactly once throughout the code, it might work?

Or maybe that's too ugly. "static" probably captures the idea better

nynx · on Nov 22, 2021

The best way to do it is probably the way rust-gpu does it: https://github.com/EmbarkStudios/rust-gpu/blob/main/docs/src...

The entry point of the kernel would supply any objects that have special properties.

shaklee3 · on Nov 23, 2021

fwiw, and I think you know this, shared memory does not have to by preallocated. dynamic shared memory allows you to allocate at kernel launch time.

dragontamer · on Nov 23, 2021

Yeah, I'm not 100% sure what to say in English though, lol.

There's compile-time and runtime. But there's also kernel-launch time? Dynamic shared memory is done before kernel-launch, possibly during cpu-runtime but before gpu-runtime.

-------

Things get crazier when you see OpenCL paradigms like... #define constants during CPU-runtime, invoke the OpenCL compiler (under the assumption that the compiler will now optimize the constant into the code directly), and then kernel-launch.