The gold standard of local-only model inference for LLaMA, alpaca, and friends is LLaMA-cpp, https://github.com/ggerganov/llama.cpp No dependencies, no GPU needed, just point it to a model snapshot that you download separately on bittorrent. Simple CLI tools that are usable (somewhat) from shell scripts.
Which includes the llama.cpp backend, and a lot more.
Unfortunately, despite claiming to be the "Automatic1111" of text generation, it doesn't support any of the prompt engineering capabilities (i.e. negative prompts, prompt weights, prompt blending, etc) available in Automatic1111, despite the fact that it's not difficult to implement - https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...
Luckily for Ooga Booga, no one else supports it either. Why this is? I have no explanation except that the NLP community doesn't know jack about prompt engineering, which is Kafkaesque
The 7B and 13B models use the same architecture, so llama.cpp already supports it. (Source: running 13b-chat myself using llama.cpp GPU offload.) It's only 70B that has additional extensions that llama.cpp would need to implement.
Llama2 7B and 13B GGML are up and work with existing llama.cpp no changes needed! The 70B does require a llama.cpp change, but I'm sure it won't take long.
Hoping they add support for llama 2 soon!