I unironically love this code. We used it as a starting point (or rather, the same design; we wrote in Go) for reversing challenges at our last startup, symbolically evaluating the stack machine bytecode to generate AVR.
The trick to reading it:
* next() is the lexer
* expr() is a precedence-climbing expression parser
* It never bothers to produce a machine code. The use of stack-based VM simplifies the single-pass compilation, and it never makes use of unknown library functions so no machine-specific knowledge is required (like dlsym in Bellard's OTCC [1]).
* It chose its primitives wisely. It never implements structs and returning with values, for example, but the code is carefully structured that the lack of them doesn't make it harder to read.
* And yet it has tons of little tricks. Switching from r-value to l-value (triggered when infix `=` or postfix `++`/`--` are read) is a single opcode fix. Reserved words are initialized from an imaginary source code. The type is represented by a single number 2n+k where n is the number of indirections.
p = "char else enum if int return sizeof while "
"open read close printf malloc free memset memcmp exit void main";
i = Char; while (i <= While) { next(); id[Tk] = i++; } // add keywords to symbol table
i = OPEN; while (i <= EXIT) { next(); id[Class] = Sys; id[Type] = INT; id[Val] = i++; } // add library to symbol table
next(); id[Tk] = Char; // handle void type
next(); idmain = id; // keep track of main
Throughout the entire source code p is a source code pointer, but at the very beginning of the program it is a string containing all reserved words and library functions, and they are read with the same lexing function `next` to the symbol table before the memory for the actual source code is allocated.
I think the beauty of it is the that's it!? feeling you get once you've looked through all ~500 lines --- based on the amount of functionality one would expect it to be much longer; but it is unexpectedly short, yet complete.
In practice, it's not "~500 lines". You have whole control-flow statements with several semicolon-separated statements in their body, all crammed on a single line. The short variable names allow for the physical line not becoming too long and there is even some symmetry to it, but it's not like each line is a single statement with many nested subexpressions.
The real size of this code, after putting each statement on its own line, would be on the order of 300%.
Imagine this:
if (tk == Mul) { next(); *++e = PSH; expr(Inc); *++e = MUL; ty = INT; }
Turning into this:
if (tk == Mul) {
next();
*++e = PSH;
expr(Inc);
*++e = MUL;
ty = INT;
}
The line count isn't not "real" just because it isn't how a mindless autoformatter would do it. The formatting conveys actual information. A line expresses one "thought". Laying it out horizontally allows the vertical direction to be used to visually convey the repeating pattern
else if (tk == Mul) { next(); *++e = PSH; expr(Inc); *++e = MUL; ty = INT; }
else if (tk == Div) { next(); *++e = PSH; expr(Inc); *++e = DIV; ty = INT; }
else if (tk == Mod) { next(); *++e = PSH; expr(Inc); *++e = MOD; ty = INT; }
You know what else would communicate that? A function or macro.
else if (tk == Mul) { applyOperator(MUL); }
else if (tk == Div) { applyOperator(DIV); }
else if (tk == Mod) { applyOperator(MOD); }
But then you're not conforming to their arbitrary idea of "minimalism = fewer functions".
I definitely have some admiration for their picking a goal and following through on it, and there are a few tricks in there that are downright brilliant, but let's not pretend this is about effective communication.
I can agree that there's a repeating pattern in handling each of the binary operators (they are around a dozen), but I'm failing to see a pattern in fragments like these:
if (tk == ']') next(); else { printf("%d: close bracket expected\n", line); exit(-1); }
if (t > PTR) { *++e = PSH; *++e = IMM; *++e = sizeof(int); *++e = MUL; }
else if (t < PTR) { printf("%d: pointer type expected\n", line); exit(-1); }
This code is implementing pointer indexing: p[i]. The first line is reading the closing `]`. The second line is computing the pointer offset, which is equal to `i * sizeof(int)`. The third line is producing an error if `p` does not have a pointer type.
I think I agree with you that this part could be refactored a bit. I would be tempted to put the "PSH" corresponding to the "i" next to when we parse the "i". I would also write the check that "p" has a pointer type before the code that indexes it.
This is why I mostly hate (but partly love for the sake of reducing bikeshedding) it when teams add an autoformatter as a mandatory part of a code pipeline -- it destroys relevant spatial information.
If we are going to force autoformatters, we might as well just use annotated ASTs instead of text so we all see our own chosen view of the code.
The number of lines isn't really the point; it's C in 4 Functions, not C In 500 Lines.
What's more impressive is that it's self-hosted and implements just the subset of C required to compile itself, which makes it harder to keep the code short, but it manages anyways.
This is fun to look at and brings back the fond memories of compiler construction and grad school. But the project could use a better overview page. It is not clear from the description if this is a plastic explosives building guide, a C compiler, a decomposition of food supplements or something else...
The trick to reading it:
* next() is the lexer
* expr() is a precedence-climbing expression parser
* stmt() parses statements and generates code
* main() has the virtual machine loop.