> Since threads are processed in parallel, tiny-gpu assumes that all threads "co...

bootsmann · on April 25, 2024

Isn't the first just equivalent to calling __syncthreads() on every launch?

Jasper_ · on April 25, 2024

No, that effectively syncs all warps in a thread group. This implementation isn't doing any synchronization, it's independently doing PC/decode for different instructions, and just assuming they won't diverge. That's... a baffling combination of decisions; why do independent PC/decode if they're not to diverge? It reads as a very basic lack of ability to understand the core fundamental value of a GPU. And this isn't a secret GPU architecture thing. Here's a slide deck from 2009 going over the actual high-level architecture of a GPU. Notice how fetch/decode are shared between threads.

https://engineering.purdue.edu/~smidkiff/ece563/slides/GPU.p...

stanleykm · on April 25, 2024

syncthreads synchronizes threads within a threadgroup and not across all threads.

hyperbovine · on April 25, 2024

Which experienced CUDA programmers do anyways!