I'm surprised the accelerators of yore trick actually worked and balancing a trio is trivially more difficult than duo? I enjoy the idea of having tons of VRAM and system RAM and loading a big model and getting responses a few times per hour as long as its high quality
Yeah, I was equally surprised. I am using a patched version of ollama to run the models: https://github.com/austinksmith/ollama37 which has a trivial change to allow it to run with old versions of cuda (3.5, 3.7). Obviously this was before tensor cores were a thing, so you're not going to be blown away by the performance, but it was cheap. I got 3x k40s for $75 on ebay, they are passively cooled, so they do need to be in a server chassis.