If memory is exposed in CUDA more like allocators than pre-allocated static memory, look into using the alloc crate with custom allocators that hand out global or shared memory.
No. This is a fundamental misunderstanding of __shared__ memory and how kernels work.
__shared__ memory is a specification like .bss in ELF files. Its something that must be known at compile-time, well-before runtime. The GPU kernel, when launched, will reserve some __shared__ memory for itself.
----
Different kernels may use different chunks of that 64kB for themselves. For example, if Kernel Foo uses 30kB of __shared__ memory and is currently running, and Kernel Bar only uses 20kB of __shared__ memory, the GPU will allow Foo + Bar to run simultaneously.
The fundamental "unit of compute" is the OpenCL workgroup / CUDA block. The GPU will run many, many kernels (even different kernels) _ON THE SAME CORE_, as long as there's enough registers, and __shared__ memory available for them.
-----
"Thread local" variables compile into registers in practice. GPUs have ~256 registers per core, so if Kernel Foo uses 40 registers, and Kernel Bar uses 70, the GPU SMs (Symmetric Multiprocessors, what you'd call a "core" in the CPU world) could run 4 copies of Foo + 1 copy of Bar.
This is because Foo#1 will eventually run into a memory-latency issue (VRAM read/writes have well over 100-cycles of latency, maybe 300+ cycles on older GPUs). Instead of waiting for this memory operation to complete, the GPU will switch to another workgroup (like Foo#2, Foo#3, or Bar#1) to ensure that the GPU-cores stay utilized.
__shared__ memory works kinda like these registers, they're divy'd out at runtime by treating the OpenCL workgroups / CUDA blocks as a unit.
------
So __shared__ memory has to be preallocated by this model. Its an important per-unit resource that is tracked by the GPU at the lowest level, so that multiple kernels could be run concurrently on the same cores (GPUs are like SMT or Hyperthreading: capable of running 8+ kernels per core as long as you have enough registers / ___shared__ memory to launch all kernels)
This could still work, if Shared::new(...) were a compile-time function. Or a language-extension that looked like a compile-time function.
EDIT: You wouldn't be allowed to have Shared::new inside of a loop or a recursive function though. But as long as you had assurances that any such Shared::new instance ran exactly once throughout the code, it might work?
Or maybe that's too ugly. "static" probably captures the idea better
Yeah, I'm not 100% sure what to say in English though, lol.
There's compile-time and runtime. But there's also kernel-launch time? Dynamic shared memory is done before kernel-launch, possibly during cpu-runtime but before gpu-runtime.
-------
Things get crazier when you see OpenCL paradigms like... #define constants during CPU-runtime, invoke the OpenCL compiler (under the assumption that the compiler will now optimize the constant into the code directly), and then kernel-launch.