The GPU itself can send PCIe 4.0 messages out. So why not have the GPU make I/O requests on behalf of itself? Its a bit obscure, but this feature has been around for a number of years now. The idea is to remove the CPU and DDR4 from the loop entirely, because those just bottleneck / slowdown the GPU.
--------
From an absolute performance perspective, it seems good. But CPUs are really good and standardized at accessing I/O in very efficient ways. I'm personally of the opinion that blocking and/or event driven I/O from the CPU (with the full benefit of threads / OS-level concepts) would be easier to think about than high-performance GPU-code.
But still, its a neat concept, and it seems like there's a big demand for it (see PS5 / XBox Series X).
The CPU is still acting as the PCIe controller though (right?), which kind of makes the CPU act like a network switch. PCIe is a point-to-point protocol kind of like ethernet too. Old-school PCI was a shared bus so devices might be able to directly talk to each other, but I don't think that was ever actually used.
As you can see, the GPU is attached to the x16 slot, and the 4x NVMe SSDs are attached to the GPU. When the CPU wants to store data on the SSD, it communicates first to the GPU, which then pass-throughs the data to the four SSDs.
NVidia's GPUs would command the PCIe switch to grab data, without having the PCIe switch send data to the CPU (which would most likely be dropped in DDR4, or maybe L3 in an optimized situation).
My understanding matches yours, but it's worth noting that (IIUC) memory and PCIe are (last time I checked?) a separate I/O subsystem that just happens to reside within the same package as the CPU on modern chips. So P2PDMA avoids burning CPU cycles and RAM bandwidth shuffling data around that you never wanted to use on the CPU anyway. (Also see: https://lwn.net/Articles/767281/)
https://www.nvidia.com/en-us/geforce/news/rtx-io-gpu-acceler...
https://www.amd.com/en/products/professional-graphics/radeon...
The GPU itself can send PCIe 4.0 messages out. So why not have the GPU make I/O requests on behalf of itself? Its a bit obscure, but this feature has been around for a number of years now. The idea is to remove the CPU and DDR4 from the loop entirely, because those just bottleneck / slowdown the GPU.
--------
From an absolute performance perspective, it seems good. But CPUs are really good and standardized at accessing I/O in very efficient ways. I'm personally of the opinion that blocking and/or event driven I/O from the CPU (with the full benefit of threads / OS-level concepts) would be easier to think about than high-performance GPU-code.
But still, its a neat concept, and it seems like there's a big demand for it (see PS5 / XBox Series X).