As I said, these features are often not needed. You can implement e.g. a neural network library without needing atomic operations.
> How do you plan to do it, if not through __shared__ memory?
Can't you use __shared__ memory the same way you use workgroup barriers and global memory? Might be slower, but good caching should make it comparable, which should be the case of prefix sum (you read right after writing, so should get good cache hit probability).
Global parallel hash table is probably the fastest way to implement collision detection. Its pretty fundamental to manipulation of 3d space, be it graphics, physics or other such simulations.
Which of course, run _GREAT_ on GPUs.
--------------
I've written "Inner Join" on a GPU for fun. Yes, the SQL operator. Its pretty fast. Databases probably can run on GPUs and parallelize easily. But any database would need globally consistent reads/writes. Sure, GPUs have less RAM than a CPU, but GPU-RAM is way faster so that might actually be a net benefit if your data is between 200MB and 4GB in size.
Use your imagination. Anywhere you'd use an atomic on a CPU is where you might use an atomic on a GPU.
> How do you plan to do it, if not through __shared__ memory?
Can't you use __shared__ memory the same way you use workgroup barriers and global memory? Might be slower, but good caching should make it comparable, which should be the case of prefix sum (you read right after writing, so should get good cache hit probability).