I unironically love this code. We used it as a starting point (or rather, the sa...

lifthrasiir · on Feb 18, 2020

C4 really has some good ideas by its own.

* It never bothers to produce a machine code. The use of stack-based VM simplifies the single-pass compilation, and it never makes use of unknown library functions so no machine-specific knowledge is required (like dlsym in Bellard's OTCC [1]).

* It chose its primitives wisely. It never implements structs and returning with values, for example, but the code is carefully structured that the lack of them doesn't make it harder to read.

* And yet it has tons of little tricks. Switching from r-value to l-value (triggered when infix `=` or postfix `++`/`--` are read) is a single opcode fix. Reserved words are initialized from an imaginary source code. The type is represented by a single number 2n+k where n is the number of indirections.

[1] https://bellard.org/otcc/

ash · on Feb 18, 2020

> Reserved words are initialized from an imaginary source code.

What does it mean? Could you point to a place in the source?

lifthrasiir · on Feb 18, 2020

I mean this part:

    p = "char else enum if int return sizeof while "
        "open read close printf malloc free memset memcmp exit void main";
    i = Char; while (i <= While) { next(); id[Tk] = i++; } // add keywords to symbol table
    i = OPEN; while (i <= EXIT) { next(); id[Class] = Sys; id[Type] = INT; id[Val] = i++; } // add library to symbol table
    next(); id[Tk] = Char; // handle void type
    next(); idmain = id; // keep track of main

Throughout the entire source code p is a source code pointer, but at the very beginning of the program it is a string containing all reserved words and library functions, and they are read with the same lexing function `next` to the symbol table before the memory for the actual source code is allocated.

userbinator · on Feb 18, 2020

I think the beauty of it is the that's it!? feeling you get once you've looked through all ~500 lines --- based on the amount of functionality one would expect it to be much longer; but it is unexpectedly short, yet complete.

For a just-as-interesting "sequel", look at C4x86: https://github.com/EarlGray/c4

bluetomcat · on Feb 18, 2020

In practice, it's not "~500 lines". You have whole control-flow statements with several semicolon-separated statements in their body, all crammed on a single line. The short variable names allow for the physical line not becoming too long and there is even some symmetry to it, but it's not like each line is a single statement with many nested subexpressions.

The real size of this code, after putting each statement on its own line, would be on the order of 300%.

Imagine this:

    if (tk == Mul) { next(); *++e = PSH; expr(Inc); *++e = MUL; ty = INT; }

Turning into this:

    if (tk == Mul) {
        next();
        *++e = PSH;
        expr(Inc);
        *++e = MUL;
        ty = INT;
    }

edflsafoiewq · on Feb 18, 2020

The line count isn't not "real" just because it isn't how a mindless autoformatter would do it. The formatting conveys actual information. A line expresses one "thought". Laying it out horizontally allows the vertical direction to be used to visually convey the repeating pattern

    else if (tk == Mul) { next(); *++e = PSH; expr(Inc); *++e = MUL; ty = INT; }
    else if (tk == Div) { next(); *++e = PSH; expr(Inc); *++e = DIV; ty = INT; }
    else if (tk == Mod) { next(); *++e = PSH; expr(Inc); *++e = MOD; ty = INT; }

kerkeslager · on Feb 18, 2020

You know what else would communicate that? A function or macro.

    else if (tk == Mul) { applyOperator(MUL); }
    else if (tk == Div) { applyOperator(DIV); }
    else if (tk == Mod) { applyOperator(MOD); }

But then you're not conforming to their arbitrary idea of "minimalism = fewer functions".

I definitely have some admiration for their picking a goal and following through on it, and there are a few tricks in there that are downright brilliant, but let's not pretend this is about effective communication.

bluetomcat · on Feb 18, 2020

I can agree that there's a repeating pattern in handling each of the binary operators (they are around a dozen), but I'm failing to see a pattern in fragments like these:

      if (tk == ']') next(); else { printf("%d: close bracket expected\n", line); exit(-1); }
      if (t > PTR) { *++e = PSH; *++e = IMM; *++e = sizeof(int); *++e = MUL;  }
      else if (t < PTR) { printf("%d: pointer type expected\n", line); exit(-1); }

ufo · on Feb 18, 2020

This code is implementing pointer indexing: p[i]. The first line is reading the closing `]`. The second line is computing the pointer offset, which is equal to `i * sizeof(int)`. The third line is producing an error if `p` does not have a pointer type.

I think I agree with you that this part could be refactored a bit. I would be tempted to put the "PSH" corresponding to the "i" next to when we parse the "i". I would also write the check that "p" has a pointer type before the code that indexes it.

    else if (tk == Brak) {
      next(); *++e = PSH; expr(Assign); *++e = PSH;
      if (tk == ']') next(); else { printf("%d: close bracket expected\n", line); exit(-1); }
      if (t < PTR) { printf("%d: pointer type expected\n", line); exit(-1); }      
      *++e = IMM; *++e = sizeof(int); *++e = MUL; *++e = ADD;
      *++e = ((ty = t - PTR) == CHAR) ? LC : LI;
    }

nitrogen · on Feb 19, 2020

This is why I mostly hate (but partly love for the sake of reducing bikeshedding) it when teams add an autoformatter as a mandatory part of a code pipeline -- it destroys relevant spatial information.

If we are going to force autoformatters, we might as well just use annotated ASTs instead of text so we all see our own chosen view of the code.

tptacek · on Feb 18, 2020

The number of lines isn't really the point; it's C in 4 Functions, not C In 500 Lines.

What's more impressive is that it's self-hosted and implements just the subset of C required to compile itself, which makes it harder to keep the code short, but it manages anyways.

mannykannot · on Feb 18, 2020

Splitting on non-line-end semicolons (other than in literals) gives you a functioning program of 942 lines, 62 of which are either blank or comments.

ptero · on Feb 18, 2020

This is fun to look at and brings back the fond memories of compiler construction and grad school. But the project could use a better overview page. It is not clear from the description if this is a plastic explosives building guide, a C compiler, a decomposition of food supplements or something else...

mzs · on Feb 18, 2020

I mean I guess, it's only a small subset of C though (no floating point or structs even):

  $ cc float.c && ./a.out && ./c4 -s float.c && ./c4 float.c && echo 'struct { int a; } s;' >struct.c && cc -c struct.c && ./c4 -s struct.c
  hello, world 1.250000
  1: #include <stdio.h>
  2:
  3: int main()
  4: {
  5:   printf("hello, world %f\n", 1.25);
      ENT  0
      IMM  83656704
      PSH
      IMM  1
      PSH
      IMM  25
      PSH
      PRTF
      ADJ  3
  6:   return 0;
      IMM  0
      LEV
  7: }
      LEV
  hello, world 0.000000
  exit(0) cycle = 13
  1: bad global declaration
  $

dkersten · on Feb 19, 2020

So not really C in 4 functions then. Just some mini language that’s a subset of C, in four functions. That’s a much simpler target to reach, I guess.

pron · on Feb 18, 2020

Yeah, it's quite nice. The only thing that bothers me is that it doesn't do bounds checking.

sansnomme · on Feb 18, 2020

Ooh explain more about your challenges?

wolf550e · on Feb 18, 2020

Starfighter was shut down but you can Google it.