Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yay fun to see it make its way to HN :) It turns out that my original checkpoint runs _way_ faster than I expected (100 tok/s) on MacBook Air M1 with -O3 when compiling, so I am now training a bigger 44M model, which should still running interactively. Maybe the 7B Llama model is within reach... :thinking_emoji:


I did use a tweaked nanoGPT to pretrain a 12M model on TinyStories (2Gbytes produced by GPT4), and results are pretty amazing. I've adapted it a bit on Wikipedia then, and it looks like a solid bullshit generator, much smarter than any smoothed n-gram model, and significantly smaller. My bet small LLMs will be predominant in multiple areas. My next goal is to reduce 7B llama2 to 10-100M without making it much dumber.


I also trained nanoGPT on TinyStories, produced about a 32M model. The results are amazing, especially considering I opted for a character-level model similar to the toy dataset in the repo. I’m writing about the experience while also doing a deep dive into the code on medium (username oaguy1). Smaller LLMs are definitely worth considering with the right quality training data. Once I finish playing with TinyStories, I recently tweaked the Standardized Project Gutenberg Corpus (~11GB) to be more modern. Want to see what I can do with it with nanoGPT and then maybe Huggingface’s libraries.


>My next goal is to reduce 7B llama2 to 10-100M without making it much dumber.

That is going to be hard as the 7B model was trained on 2T tokens. Maybe if you heavily restrict the range in which the model should operate.


1. It’s faster and cheaper to train a smaller model

2. Better than tokens is to train on probability distributions (distillation) and trees of probability distributions


I've never seen anything about training on probability distributions or trees of them. Do you have articles with examples you could share with us?

I did try a quick search for it. Found some interesting papers. The links to them are below in case anyone finds them interesting.

https://arxiv.org/abs/2212.11481

https://towardsdatascience.com/a-new-way-to-predict-probabil...

https://arxiv.org/pdf/1912.07913.pdf

https://dukespace.lib.duke.edu/dspace/bitstream/handle/10161...


Would love to read more about your time in NanoGPT. I've been getting familiar with it myself lately and it's still pretty much gibberish in the output with 16M, but the dataset is admittedly trash right now as well.


How do you adapt it on Wikipedia? Do you just add it to the dataset and continue training?


Your work is an inspiration as always!! My n00b question is: what do you think is currently the most practical path to running a reasonably-sized (doesn't have to be the biggest) LLM on a commodity linux server for hooking up to a hobby web app ... i.e., one without a fancy GPU. (Renting instances with GPUs on, say, Linode, is significantly more expensive than standard servers that host web apps.) Is this totally out of reach, or are approaches like yours (or others you know of) a feasible path forward?


I would use textsynth (https://bellard.org/ts_server/) or llama.cpp (https://github.com/ggerganov/llama.cpp) if you're running on CPU.

  - I wouldn't use anything higher than a 7B model if you want decent speed.
  - Quantize to 4-bit to save RAM and run inference faster.
Speed will be around 15 tokens per second on CPU (tolerable), and 5-10x faster with a GPU.


I've been playing with running some models on the free tier Oracle VM machines with 24GB RAM and Ampere CPU and it works pretty well with llama.cpp. It's actually surprisingly quick; speed doesn't scale too well with the number of threads on CPU, so even the 4 ARM64 cores on that VM, with NEON, run at a similar speed to my 24-core Ryzen 3850X (maybe about half reading speed). It can easily handle Llama 2 13B, and if I recall correctly I did manage to run a 30B model in the past too. Speed for the smaller ones is ~half reading speed or so.

It's a shame the current Llama 2 jumps from 13B to 70B. In the past I tried running larger stuff by making a 32GB swap volume, but it's just impractically slow.


Prompt ingestion is too slow on the Oracle VMs.

Also its really tricky to even build llama.cpp with a BLAS library, to make prompt ingestion less slow. The Oracle Linux OpenBLAS build isnt detected ootb, and it doesn't perform well compared to x86 for some reason.

LLVM/GCC have some kind of issue identifying the Ampere ARM architecture (march=native doesn't really work), so maybe this could be improved with the right compiler flags?


Not sure if that's still the case. I remember having trouble building it a couple of months ago, had to tweak the Makefile because iirc it assumed ARM64 <=> Mac, but I recently re-cloned the repo and started from scratch and it was as simple as `make DLLAMA_BLAS=1`. I don't think I have any special setup other than having installed the apt openblas dev package.


IDK. A bunch of basic development packages like git were missing from my Ubuntu image when I tried last week, and I just gave up because it seemed like a big rabbit hole to go down.

I can see the ARM64 versions on the Ubuntu web package list, so... IDK what was going on?

On Oracle Linux, until I changed some env variables and lines in the makefile, the openblas build would "work," but it was actually silently failing and not using OpenBLAS.


Is it any easier when using Ubuntu on ARM Oracle servers?


Nah, I tried Ubuntu too.

The OpenBLAS package was missing on ARM, along with some other dependencies I needed for compilation.

At the end of the day, even with many tweaks and custom compilation flags, the instance was averaging below 1 token/sec as a Kobold Horde host, which is below the threshold to even be allowed as a llm host.


If you're running on Ampere, using llama.cpp is probably not ideal. While it's optimized for ARM, Ampere has native acceleration for workloads like this: https://cloudmarketplace.oracle.com/marketplace/en_US/adf.ta...


It might be more expensive to get a GPU instance but at a guess I'd say it's more cost-effective considering that the CPU computation will be less efficient and take much longer. I bet someone's done this out with real numbers, I just haven't seen it.


This only matters if you're scaling to meet demand and demand is higher than your spare resources, which often isn't the case for hobby projects. The 10€/mo VPS I've had for over 6 years now still has a few cores and GBs or RAM spare, so running a small model on the CPU for a personal project that only me and a few friends occasionally use wouldn't cost me a cent more.


FYI, the going rate for "smallest possible VPS" is now more like 3€/mo.


It depends on your use case, correct? If you do not have a heavy inferencing requirement, then CPU is good enough.


Great job, thanks! Do you have any early impressions on the relative quality/performance of small lama-2 models vs the small gpt-2 models?


Do you think it's possible also to create a trainer in pure C, instead of using python?


Of course it's possible. The question is whether anyone finds it worth doing.

ML algorithms are, at their core, not particularly complicated code. But they are still tricky code, because if you get them wrong you will find that you spent 500 GPU-years turning random numbers that cause the model to output gibberish into other random numbers that cause the model to output different yet semantically identical gibberish.

Writing them in a more abstract languages has advantages - like automatic differentiation. You could explicitly tell the computer how to compute the output and its derivative, or you could tell the computer how to compute the output, and let it also compute the derivative by itself.

Having all your weights in one object is also awfully convenient; you can write something like `weights -= error * deriv * learning_rate` instead of iterating over each individual weight (and a large model contains many different sets of weights, not just a single NxMxPxQ matrix)

This is good for the rapid iteration that ML research demands. However, once you have selected a model, I'm sure you can get performance advantages by coding it in a low level and eliminating inefficiencies. For example, you should be able to use the weight update equation from above by using fused multiply-accumulate, and the Python framework might not realize that.


This is C++ rather than C, but a substantial portion of PyTorch is written in C++, and they provide a C++ interface:

https://pytorch.org/tutorials/advanced/cpp_frontend.html

In other words, you can absolutely use PyTorch without Python.


In principle easy and possible, just not exactly useful. Would just involve adding the backward pass. But I’m not sure that this is something many people would want.


Just compile the Python


Are you training these things on your home rig, M1, or in the cloud?


Could you post the 44M model somewhere where we can download?


Still training. I will put it in readme


Oh wow I didn't realize you are the creator I should really learn how to read one of these days.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: