Unviable, GPU shines at what they do because they hide memory latency. They do that by having N threads in flight for M computes unit with N > M by a sufficiently large factor (for instance N = 5M or more like 10M ...). In flight means all states including all register values are save inside the GPU usually in its cache.
So at any point in time among the N threads they are M that are ready to execute (ie not waiting on memory) and those do execute. Anytime a thread need to lookup memory, like accessing a texture for instance, the memory lookup is schedule and the thread is put to sleep but its states stays in the GPU. But overall they are always M threads not waiting. This is why if you do performance analysis on GPU in some basic cases the memory lookup operation looks like they are free (ie takes no time).
On the CPU when you context switch between thread the kernel save the CPU states (all register values and ancillary states) to main memory. Which means that switching thread on a CPU core means writing current thread context to memory and reading next thread context from memory. So it even worsen the memory latency issues.
If you design a CPU capable of holding many threads context within silicon (on die memory) you might get closer to a GPU. But you do not get much from doing so for CPU workload. Also at which point does your design is more a GPU then a CPU ?
> If you design a CPU capable of holding many threads context within silicon (on die memory) you might get closer to a GPU.
From a pure programming model POV, this is just SMT which RISC-V supports quite handily - the native "core"-like abstraction is specifically pointed out as a 'hardware thread', "hart" for short. Now, clearly GPGPU adds some features that are not encompassed by this model, such as vartious sorts of "scratchpads"/"memories" often with restricted addressing. But the general feature is accounted for.
So at any point in time among the N threads they are M that are ready to execute (ie not waiting on memory) and those do execute. Anytime a thread need to lookup memory, like accessing a texture for instance, the memory lookup is schedule and the thread is put to sleep but its states stays in the GPU. But overall they are always M threads not waiting. This is why if you do performance analysis on GPU in some basic cases the memory lookup operation looks like they are free (ie takes no time).
On the CPU when you context switch between thread the kernel save the CPU states (all register values and ancillary states) to main memory. Which means that switching thread on a CPU core means writing current thread context to memory and reading next thread context from memory. So it even worsen the memory latency issues.
If you design a CPU capable of holding many threads context within silicon (on die memory) you might get closer to a GPU. But you do not get much from doing so for CPU workload. Also at which point does your design is more a GPU then a CPU ?