Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone have any tips for how to spin up services that can efficient peform inference with the HuggingFace weights of models like this.

I would love to switch to something like this over OpenAI's GPT3.5 Turbo, but this weekend I'm struggling to get reasonable inference speed on reasonably priced machines.



Have you tried llama.cpp? It doesn't need a GPU so it's generally cheaper to run, and inference speed is decent (1-10 tokens per second depending on model size and hardware specs). Not sure if it's been set up to work with the open assistant stuff yet, but should be soon given how fast things are moving.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: