This is a set of manually created hex programs in a Cthulhu Path to madness fashion. Which only have the goal
of creating a bootstrapping path to a C compiler capable of compiling GCC, with only the explicit requirement
of a single 1 KByte binary or less.
Additionally, all code must be able to be understood by 70% of the population of programmers. If the code can
not be understood by that volume, it needs to be altered until it satisfies the above requirement.
Very cool project, but I guess I won't hold out for compiling a working web browser from this project.
With a working gcc, you should be able to compile all that is needed to build Chromium and Firefox.
There are a few programming languages that cannot be bootstrapped from a C or C++ compiler, so you need a binary executable of a previous compiler version for that language, but I do not think that Chromium uses any of those.
The bootstrapping of Rust is complex, but AFAIK it is possible using mrustc. Then you should be able to also compile Firefox.
While everyone asks about lisp or forth for this task, I have to wonder what one has to leave out of C to make it "minimal" enough for this task. The subset of features left out might teach us something about the language.
It's not exactly the things you have to leave out of C. It's that for a compiler bootstraper, C gives you extremely little bang for your buck.
If you in the lisp route, you'll soon be talking about embedding a grammar DSL or a validator. On C, you will be happy to create a parser and please, let's get to the next project already.
GCC 11 and newer requires a C++11 compiler, 4.8 was the first to require C++ at all, and I think that there are some very old versions of GCC that can be built with a pre-standard C compiler.
> "We propose to collectively maintain a subset of GCC 4.7 to ensure that we can build the foundation of free software distributions starting with a simple C compiler (such as tinyCC, pcc, etc)."
"just compiling" gcc means to have all its SDKs dependencies satisfied too...
g++ 4.7.4 is the last g++ written in simple C. In theory, you should be able to compile all gccs above 4.7.4 (with their SDKs) with g++ 4.7.4 (Last time I tried I was successful with a gcc10 for x86).
I wonder who are the "geniuses" who pushed the gcc steering committee to move gcc to C++98 (the pedophile Epstein?).
That said, the C based dialect required to compile a linux/*bsd kernel is no better either.
As for my personal opinion: much of open source software is now corpo-grade trash, but unlike toxic abominations like windoz/rottenfruitOS/etc, it is free as in free beer and heavily customizable.
It appears that this project is in competition with Mes[1], which is also running under the http://bootstrappable.org/ 'umbrella', and which implements both a Scheme interpreter (in C) and a C compiler (in Scheme) it is designed to run.
The C compiler that Mes was written for, MesCC, appears to be using some 600 lines of code excluding libraries, which I haven't checked how large they are. It might be smaller than cc_x86.s[2] posted here (or it might not when including the libs), but if concerned about bootstrapping size, the size of the Scheme interpreter would have to be included as well and then the size is surely larger than cc_x86.s; also, I'm unclear about how Mes solves the mutual dependency issue of Scheme <-> C.
Mes is authored primarily by Jan (janneke) Nieuwenhuizen. Judging from a comment by Janneke on IRC[3] collaboration seems amicable, if I'm not mis-attributing this quote:
<janneke> some routes we try just are a dead end...
Not having any ties to either project I welcome corrections of any mistakes I may have made here.
Ah no, there is no competition, at least not yet ;)
Stage0---of with cc_x86.c is a part---as a project started off from the
bottom (hex0) and is working up. When stage0 just started, I created
GNU Mes soon to be folowed by MesCC, which aimed to build a version of
TinyCC and the remaining bootstrap for GNU Guix. The first versions of
mescc had its own linker, which was later in a joint effortt replaced by
M1 and hex2 from MesCC-Tools (also part of Stage0.
My initial idea for Mes was to translate it by hand to assembly, so the
M1 assembler would come in handy. As it turned out, cc_x86.c and its
next step M2-Planet were developed as part of Stage0.
Starting from an 357-byte hex0 provided by the bootstrap-seeds,
Stage0-posix builds hex0, kaem, hex1, catm, hex2, M0, cc_x86, M1, M2,
get_machine, (mescc-tools), and M2-Planet. M2-Planet in turn builds
Mes, which builds a bootstrappable TinyCC, etc.
I'll try out some bootstrapping paths and will make an effort of getting a full understanding of the whole picture, and am hoping to write something up about what I find out.
x86 Asm of a very different syntax (neither Intel nor AT&T), to be fair. For a second, I thought it was ARM. Then I started to parse some of the syntax and realised it could be made even more minimal, e.g. this function
;; in_set function
;; Receives a Char in R0, char* in R1
;; Return result in R0
based on what it seems to do, can almost certainly be replaced with a simple repnz scasb.
Trusting trust all the way down: How do you trust your assembler to output the exact machine code corresponding to the source? How do you trust whatever program you might use to do the validation? Ok, let's skip the assembler and write machine code directly. How do you trust the program you use to write that machine code? How do you even trust the firmware will execute your machine code in the way you intend?
Your joke is funny, but it's also not a complete joke. I worked at a company that was running out of funds shortly after firing me, well I kind of fired myself, but I was gone. They had a beautiful electron microscope, and figured why don't we pivot to something involving chips?
Now stare back at the https://github.com/mkmik/echo-server and try to figure out what's going on (no, it's not a buffer overflow, that's just a misdirection, and a quite effective one btw, so in order to not waste your time, I'm going to reveal that this is a supply chain attack)
How do you detect a backdoor inserted in the login program with a test? You'd have to know the backdoor password to write the test. Perhaps you haven't read Thompson's Turing Award lecture and so you have no idea what we're talking about.
Not sure if it would work, but wouldn't it greatly complicate creating a backdoor when the host's memory is very heavily constrainted? e.g. an AVR-based "computer" that would hardly do much more than actually translate assembly to machine code?
It would probably be easier to detect, but when Thompson deployed the attack in practice, nobody noticed, even though the PDP-11 address space was the same size as an AVR's, and the CPU was a lot slower. (AVRs invariably have much less physical RAM installed than a PDP-11, but you'd need some external memory to get a C compiler to build.)
This is what I don't understand about Stallman. Even if we all agree to use free software, that's only the beginning. The next step is nobody is going to use web applications anymore. Okay now that we're all liberated from corporate espionage what about how the software is made. You can't trust package maintainers any more than you can trust companies. So we all use GNU/Gentoo. As you said, the compiler may be compromised too. Well you learn the dark art of compiling a compiler. But wait, your entire computer is a black box with no schematics whatsoever. Rinse and repeat.
No, you've missed the point. Don't you understand? The software cannot be trusted whatsoever unless you control the entire stack. So you haven't solved anything. Also, if you think a paragraph is long winded then I'm sure you have never listened to Stallman.
You can't control when an apocalypse happens, so you might as well just not prepare for it.
That's essentially the same argument, which is ridiculous. Of course you can. And you should. Just because you can't control the entirety of a system, doesn't mean you can't take steps to minimize risks and exposure, and shrink the attack surface.
That is a stupid analogy. I'm talking about a machine. You are not in control unless you build the machine yourself. Even if you were provided schematics, you cannot be sure they are accurate. The point is free software does not minimize risk. It's a false sense of security.
For things you actually care about, such a surveillance, what if I told you there is a hardware backdoor in your CPU allowing the government to spy on you? Do you realize that is already known? What about the fingerprint scanner. How can you be sure the same is not true of the hardware storing that information?
I built a CPU myself out of 74xx logic, but I can't be sure that every input on every chip isn't also connected internally to an integrated computer that stores everything it sees and can transmit it to a van at 20 metres range (easily enough for someone to drive up to my house and retrieve the contents periodically).
The attacker can then reconstruct everything that happened on the CPU and work out what I've been up to. Or, more likely, throw away almost everything and only look at what was printed to the console, because everything interesting goes to the console anyway.
In fact this would be easier-than-average to achieve because the clock speed is low, I don't run the machine for very long or very often, and the density of actual electronics to "free space" on the ICs is very low (more space for shenanigans), and all my schematics are on my github project so even though you don't know which chip is connected to which other, you shouldn't have too much trouble deriving it.
So what now? Do we just give up? Is all computing fundamentally impossible? Or do we accept that nothing is perfect and just get on with it?
> The next step is nobody is going to use web applications anymore.
Well, we should certainly be wary of falling into the trap known as "Service as a Software Substitute (SaaSS)"[0], but in principle it is possible for a web app to work entirely on the client side and be distributed under a Free licence (and for the browser to enforce that, if the web developer is careful).[1]
> You can't trust package maintainers any more than you can trust companies.
The point is, if you have the source code, you don't have to trust the package maintainers. Instead you can trust whoever you choose to audit the source code for you, which might be yourself (if you're very skilled, and very untrusting), or it could be the community.
I admit that "trust the community" usually means "Assume that someone somewhere will find any critical bugs before they affect you, and assume that developers won't destroy their reputation when they know they'll eventually get caught", but we are slowly moving towards a system of community code reviews for all Free software[2]. (Obviously reproducible builds, and boostrappable builds, are necessary steps to take full advantage of this).
> But wait, your entire computer is a black box with no schematics whatsoever. Rinse and repeat.
Nope, that's the final step (unless you think the aliens that built this simulation put backdoors into the laws of physics). As for how we trust hardware, fortunately there are projects to make computers out of chips that can be safely reasoned about. You have to ask yourself what your threat model is.
Do you think the NSA is hiding a hardware backdoor in every FPGA, which detects when someone is running a compilation process and makes sure to install a software backdoor in any compiler or kernel it detects? What if multiple people on multiple homebrew computers all carried out the same build process and hashed the results, and all the hashes agreed? What if you ran these processes in virtual machines that implement custom architectures that have never been seen before? Eventually you start to run into information-theoretic problems trying explain how such a backdoor can remain hidden and effective.
> if you have the source code, you don't have to trust the package maintainers
I wouldn't go that far. It's possible for a skilled malicious developer to conceal malicious behaviour in their code, to make it hard to detect and plausibly deniable as a bug.
Speaking of which, I've never understood why SELinux is adopted so uncritically considering it was developed by the government agency behind [0].
> I'm surprised they wrote a minimal C compiler in assembly rather than any LISP dialects, then make a C compiler in LISP
I recommend you check out the author's GitHub page. It includes Forth and Lisp projects, explains why C was a success while those projects failed, and challenges any Forth and Lisp fan to actually substantiate their baseless claims with actual work instead of pushing beliefs that don't have any bearing on reality.
I think that the takeaway is familiarity trumps technical 'superiority'.
That said I don't know why having a working GC for Lisp is that important because typically you'd use Lisp to compile your C-compiler so memory efficiency doesn't really matter..
> I think that the takeaway is familiarity trumps technical 'superiority'.
No, that's not what's stated in the wiki.
The section on Forth points out the lack of useful programs actually written in Forth (a nudge to how no one bothers with the language as alternatives are always found to be preferable), and the lack of developers available to help with the work.
The section on Lisp states quite clearly that "LISP is not an easy language to implement in LISP, C and definitely not an easy task to implement in assembly".
> LISP is not an easy language to implement in LISP, C and definitely not an easy task to implement in assembly
I can't speak to the ease of implementing Lisp in assembly, but I found it easy to do in C. It's trivial to implement Lisp in Lisp if you're willing to let the hosting Lisp do the heavy lifting. The person who I'm quoting isn't a very clear writer though, and his usage of "LISP" suggests he's not exactly up to date on the state of Lisp, so perhaps I'm missing some unstated context that makes his statement sensible.
While I do find the "put up or shut up" message entertaining, it's a category error to confuse what has been done with what can be done. Still, I would consider it a win if some Lisper rises to the challenge.
Since the C language and modern CPUs basically co-evolved, it's not surprising that they're a good fit for each other. The story might be different on a Symbolics machine.
> While I do find the "put up or shut up" message entertaining, it's a category error to confuse what has been done with what can be done. Still, I would consider it a win if some Lisper rises to the challenge.
The whole point is that there are an awful lot of talkers talking a big game but no none of them ever manages to actually show something working.
I read it and I gotta say it sounds ignorant to me: lots of people have bootstrapped systems using Forth. It's not as common these days but it still happens.
And I think the wiki page predates sectorlisp, which project answers (in my opinion) the challenge, no?
And just after that "Never ending garbage collection bugs plague every assembly implementation of LISP."
So I was reacting to that, as a Lisp which will be used once to compile a Lisp based C-compiler which will used once be used to compile a C-based C-compiler, I'm not sure that a GC is needed, just never free the memory.
Are claims of easy bootstrapping in Forth really baseless, though? ASM makes sense for C subset, but if one wanted to bootstrap a more complete language directly from scratch, things would probably be different.
Forth is simple to bootstrap, but maybe not so easy to learn or reason about, because it isn't really taught in any schools or popular online materials. There are tons of resources for thinking about and writing compilers for C-like languages, and there's a decent chance that someone with a CS degree has done so doing their coursework.
"Minimal" being almost 5KLOC, sheesh. Actually, I'm surprised that a C compiler fits in that space.
Also:
---
Goal
This is a set of manually created hex programs in a Cthulhu Path to madness fashion. Which only have the goal of creating a bootstrapping path to a C compiler capable of compiling GCC, with only the explicit requirement of a single 1 KByte binary or less.
It's a compiler wth. 5k is nothing. If you strip out the comments, the constants, the labels and some other cruft, you're left with 2450 instructions. That should be well under 1000 lines in other languages.
writing a C compiler in assembly that supports structs, unions, arrays, inline assembly and a bunch more was done in less than 24 hours by an inexperienced C programmer. Who then after started doing bootstrapping speed runs to demonstrate how trivial of a problem it is to implement that level of functionality in a C compiler.
In the decades for which Lisp and FORTH existed, why didn't they solve such a trivial problem?
Or better yet, now that you can see how it is done. Could anyone actually produce a C compiler with the same level of functionality in Lisp or FORTH in the same amount of time (or less?).
It is easy to talk a big game and words are cheap, we have the entire cc_* family of C compilers written in assembly for multiple architectures and in even cross-platform arrangements. If your language was any good at bootstrapping you'd be able to beat that. Then show your language written in less lines of assembly than cc_x86 to prove the point.
It is time for Lisp and FORTH to either deliver or learn to stop talking about something they never were good at in the first place and learn to admit Assembly and C won not because "worse is better" but because objectively they are better languages for bootstrapping new and better tools.
That an inexperienced C programmer wrote what appears to be 5000 lines of assembly in 24 hours implementing a fully working C compiler seems... a rather extraordinary claim. The file size of around 130K would indicate typing non-stop for 24 hours at something like 100 characters per minute.
> typing non-stop for 24 hours at something like 100 characters per minute
Not only that. It would be typing at that speed without testing anything. In assembly. Haha. In reality what would happen is that you would type 24 hours (not in one day, I assume) and then spend the next days finding the places where you wrote R5 instead of R4.
And, of course, without losing any time on refactoring already written code (changing order of parameters, finding a better name for a function, etc.)
People normally touch-type between 300 and 800 characters per minute (50–130 wpm), so even if all of that was typed by hand we're only talking about five hours or so of typing, but you've forgotten about copy and paste.
I think they are talking about the later stages of boot strapping 'stage2' where they are writing a more functional, but still small, C compiler in a C subset that can be compiled by this one
I mean there are Forths that fit inside a boot sector (<512 bytes). My reading of the goal was a minimal kernel of code from which the larger system can be bootstrapped.
> writing a C compiler in assembly that supports structs, unions, arrays, inline assembly and a bunch more was done in less than 24 hours by an inexperienced C programmer.
That’s a weird commit comment (“Implemented C version of cc_knight-native”) for a commit that adds a C compiler.
Also, how do you conclude from that that it took less than 24 hours to write that commit (or any)?
https://bootstrappable.org/