Does anyone have any tips for how to spin up services that can efficient peform ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		oli5679 on April 15, 2023 \| parent \| context \| favorite \| on: OpenAssistant Conversations – Democratizing Large ... Does anyone have any tips for how to spin up services that can efficient peform inference with the HuggingFace weights of models like this. I would love to switch to something like this over OpenAI's GPT3.5 Turbo, but this weekend I'm struggling to get reasonable inference speed on reasonably priced machines.

logicchains on April 16, 2023 [–]

Have you tried llama.cpp? It doesn't need a GPU so it's generally cheaper to run, and inference speed is decent (1-10 tokens per second depending on model size and hardware specs). Not sure if it's been set up to work with the open assistant stuff yet, but should be soon given how fast things are moving.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact