It's possible for me to learn enough math to download a huggingface model and start from tokenizing my prompt and convert them to token embeddings and add position embeddings and go through 32 layers of softmax+mlp with layernorm and write out the equation that would compute each intermediate floating number until it tells me the probabilities of each output token so I can sample a token and continue the sentence autoregressively. Computing any one of these 100 billion 16-bit floating point multiplications? I can either compute them in decimal or check out the IEEE754 fp16 format and compute in binary manually, or maybe draw a circuit with AND and NOT gates if given enough time.
These are the low level operations. From a higher level mathematical standpoint? I can prove to you analytically how a SGD optimizer on a convex surface will converge to the global minimum at an exponential rate, starting from either set theory or dependent type theory and the construction of real numbers from sets of rational numbers.
These are the low level operations. From a higher level mathematical standpoint? I can prove to you analytically how a SGD optimizer on a convex surface will converge to the global minimum at an exponential rate, starting from either set theory or dependent type theory and the construction of real numbers from sets of rational numbers.
None of these tell me how and why LLM works.