An introduction to deterministic builds with C/C++

W0lf · on Sept 2, 2019

So, it's actually the first time in history of Conan where I personally think it became a useful tool and am using it professionally in several projects for my clients. It still has a lot of quirks though and one can easily get into trouble when setting up dependencies across different platforms.

I've looked more closely into Conan several times in the past and also added bug reports here and there, but it finally became somewhat usable (for me and my projects at least). Also, I'm quite picky when it comes to CMake integration, because I don't want my package manager to interfere with my build system configuration. Conan finally reached a state where one can simply check in a conanfile.txt in the repository and use the `cmake_paths` generator to fetch all dependencies before calling CMake to generate a project for you.

I also decided to setup my own Artifactory server as the official bintray comes with a max. size limitation of 500 MB per artifact which is way too low, considering that Qt5 with Core, Gui and Widgets statically built on Windows (release build) easily outgrows this limit. However, once you have your infrastructure setup, it's easy to add build flavors such as sanitizers etc. for a given set of thirdparty libraries.

dgellow · on Sept 2, 2019

Do you have examples of project configuration using cmake and conan the way you use it? Maybe an open source project, or a short description of the setup?

I'm just curious.

tbenst · on Sept 2, 2019

Curious why you use Conan vs Nix? Reproducible builds and dependency management is the Nix raison d'être.

lenkite · on Sept 2, 2019

Nix only runs on Unix operating systems. Conan runs on windows also.

steeve · on Sept 2, 2019

Bazel [1] gets a lot of this right. Everything the author recommends is done automatically.

1. https://bazel.build/

nwlieb · on Sept 3, 2019

In my experience this is only true for C/C++ (with a decent amount of work to setup CROSSTOOL properly). As soon as you start to get into Python, and Python<->C++ interop, it becomes very leaky.

I heard that Google has some tools internally that build the Python interpreter with Bazel and use that in order to guarantee hermeticity, but that doesn't seem to be possible with public tooling (at least not without some major hacks, for example https://github.com/bazelbuild/bazel/issues/4286 )

It would be interesting to see how Google manages languages such as Python at scale (and other languages that have similarly leaky package management).

m0zg · on Sept 2, 2019

And not only that, but it's also been doing this for well over a decade now, at a massive, multi-billion LOC multi-petabyte artifact scale at Google. Google would not be able to function without it.

rienbdj · on Sept 2, 2019

Indeed. Reproducible builds need to be the default and enforced automatically by the tooling or things slip over time.

dwheeler · on Sept 2, 2019

Reproducible builds are extremely important. Most people use pre-compiled code so it's important to have a way to verify that the pre-compiled code was generated from the expected source code.

Lots more info is here: https://reproducible-builds.org/

gumby · on Sept 2, 2019

We had a Cygnus customer back in the 1990s, DSC, whose customer SLA was measured in minutes/decade (I believe it was less than three minutes of downtime per decade). They paid extra for long term support on a specific GCC release. When they submitted a bug report and got a fix they would examine the binaries to make sure that every change in the binaries could be traced back to that one patch and nothing else. No upgrades, general bug fixes, or anything like that.

Ironically a few years later one of their customers had a multi hour outage and that was the end of (quite a large company) DSC.

ferzul · on Sept 2, 2019

how is that ironic? what actually caused the downtime?

gumby · on Sept 3, 2019

Software bug in their code. The elaborate steps they too to avoid unanticipated glitches were perhaps not worth he effort.

NickGerleman · on Sept 2, 2019

Another benefit of deterministic builds is that build output caching becomes more reliable. Microsoft has a system used for this that is heavily used internally that was recently open sourced: https://github.com/microsoft/BuildXL/blob/master/README.md

seanmcdirmid · on Sept 2, 2019

Hasn’t google been doing C/C++ build caching for more than a decade now?

malkia · on Sept 2, 2019

RBE has been mentioned in the other reply (for bazel, and other compatible systems). For GN (Chromium, Fuchsia, others) - there is https://chromium.googlesource.com/infra/goma/server/ and client.

steeve · on Sept 2, 2019

Remote Shared Cache and also Remote Build Execution

Ididntdothis · on Sept 2, 2019

I work in medical devices and we could use reproducible builds a lot. A lot of tools will spit out different binaries at each run and it’s really hard to justify why you think the source files you claim to have built from actually were used to create the build.

yitchelle · on Sept 2, 2019

>> why you think the source files you claim to have built from actually were used to create the build

Wouldn't a good and robust configuration management plan overcome this problem?

human20190310 · on Sept 2, 2019

How do you know your configuration management plan is doing what you think it's doing if you're getting different output from the same source? It's hard to tell if the sources of variation in the output are harmless or meaningful.

yitchelle · on Sept 2, 2019

Well, if the target is to have the same binary from the same set of sources every time you do a build, then the configuration management plan is not working. A good configuration management plan is to ensure that things that can change are managed. It is not working, it needs to be revised.

In some of the plans I have seen, some deviations are tolerated but those deviations are spelled out in excruciating detail.

How have you dealt with this issue?

human20190310 · on Sept 2, 2019

I've dealt with it by wishing I had reproducible builds.

If I'm debugging a production issue and want to create a debug build from the same source to step through things, it would be nice if I could just build both debug and release, check the hash on the release to confirm it matches production, then start debugging, without reference to any external records or other systems.

Reproducible builds reduce the number of links in the chain needed to verify what you're really doing.

bmm6o · on Sept 2, 2019

You could just have debug boundaries be part included in the output of the build.

Ididntdothis · on Sept 2, 2019

Most of the time when somebody says “you could just do x” I stop listening unless they have intimate knowledge of the situation.

bmm6o · on Sept 4, 2019

I don't get it. Are you saying it's too hard or it doesn't solve the problem? It's worked in my experience.

I mean, my response isn't entirely a rhetorical question. If there's a reason it wouldn't work I'd be curious what makes his situation different from mine.

Ididntdothis · on Sept 4, 2019

“I'd be curious what makes his situation different from mine.”

This should have been the first question to ask yourself before trying to give advice.

bmm6o · on Sept 10, 2019

So I should have said, "Is there a reason you can't just build debug binaries at the same time as you build release"? I mean, I guess, but it's a little disappointing that this is the nugget that you've been dancing around.

Ididntdothis · on Sept 10, 2019

The world "just" is my pet peeve. Before you say "why don't you just xxx" make sure you actually understand the problem thoroughly. Otherwise it's impolite to make a suggestion to "just do ...". If there is one thing I have learned it's that good devs make sure they really understand the situation before making suggestions.

perl4ever · on Sept 2, 2019

http://johnsalvatier.org/blog/2017/reality-has-a-surprising-...

...but what's really surprising is that it's surprising, since everybody is experienced with details in some context.

Ididntdothis · on Sept 2, 2019

I think this is related to "the grass is greener on the other side". Other people's problems seem eminently more solvable than our own. Sometimes that's even true but often it isn't.

Ididntdothis · on Sept 2, 2019

“In some of the plans I have seen, some deviations are tolerated but those deviations are spelled out in excruciating detail.”

I hope we agree that justifying these deviations can be a lot of work and having reproducible builds would be easier.

nickpsecurity · on Sept 2, 2019

That's pretty paranoid if you're already trusting the source or binary. Have you previously ensured that same source acts non-maliciously on all malicious inputs and runs on secure OS? (Sources of most attacks.)

Gibbon1 · on Sept 2, 2019

In medical devices you validate the binary. The problem is if you rebuild the code and now it's different then what? Consider that compilers sometimes produce the wrong code. I've hand compilers do this to me. Build, run tests. Test fails. Investigate, assembly looks 'weird'. Rebuild and now it's correct. GAH!

nickpsecurity · on Sept 2, 2019

That makes sense. It could help there depending on what validation requirements are.

High-assurance systems is my thing btw. There's a few ways we approach this issue. One is a certifying compiler (eg CompCert C) that's verified and tested to basically never screw up. CompCert got SIL qualification recently with Solid Sands partnering to validate them, too.

Another I like is equivalence testing across multiple binaries from multiple compilers with a large, automatically-generated test suit and fuzzers. Such test generators are already great for bug hunting. You just rerun them through the same app compiled with other compilers. Helps catch app and compiler bugs. If you have CompCert, I have another trick: re-compile with CompCert, re-test each case that failed, and any that suddenly pass were likely compiler bugs. The multiple compilers approach should catch them, though.

Now, I should point out that what reproducible builds is addressing are often changes in hashes that aren't changes in functionality. Just stuff like timestamps or symbols. You should be able to show the regulators that the only thing that changed was that kind of stuff. Maybe we need tools that automate that, too, showing diffs annotated by what does or doesn't affect execution.

Anyway, what exactly do they require of you for source-to-object correspondence? And is this U.S. F.D.A. or a non-U.S. regulator?

Ididntdothis · on Sept 2, 2019

“Anyway, what exactly do they require of you for source-to-object correspondence?”

In my case this is not really clear. You have quite a bit of leeway but you never know when you will get rejected. We had things that are approved a few years ago being rejected now because some rules have been tightened.

With reproducible builds you could save a lot of time validating build chains. It would be a huge timesaver and probably also allow us to be more creative with the build chain because if the binary is the same you know it worked.

patthebunny · on Sept 5, 2019

It's a legal and security question.

If I have deterministic builds then the user can verify that the build matches the source. Otherwise it's potentially compromised (or the wrong version).

patrec · on Sept 2, 2019

Assuming you develop on linux (or macOS) -- have you had a look at nix? There is a steep learning curve, but once you're there you'll never want to go back.

Ididntdothis · on Sept 2, 2019

No. Windows and .NET :(

rwallace · on Sept 2, 2019

TIL: Microsoft C++ does reproducible builds! I tried it just now and the /Brepro flag works with the command line compiler.

dataflow · on Sept 2, 2019

Compiler? or linker?

rwallace · on Sept 2, 2019

It works if you just supply it to the compiler.

degski · on Sept 2, 2019

Linker flags are all caps.

umvi · on Sept 2, 2019

Is there a reason you can't just achieve deterministic builds using docker containers as your build environment (which have the appropriate sysroot, compilers, dependencies, etc. inside)?

ryukafalz · on Sept 2, 2019

Containers can be used as part of a deterministic build process, but they're not a silver bullet on their own:

>which have the appropriate ... dependencies

What are your dependencies? Have you specified them precisely?

Does anything in the build process use timestamps at any point? If so, your build isn't deterministic.

Furthermore, if your goal is reproducible builds rather than simple determinism, you may then care about where e.g. the GCC binary in your Docker container came from.

knorker · on Sept 2, 2019

Yes. Did you read about date, time, and actual calls to RNGs? And nondeterminism of listdir? I don't think you actually read the article.

umvi · on Sept 2, 2019

I did, but those seemed like obscure macros/corner cases almost no one uses.

I was more arguing for use of Docker instead of "Conan" or whatever tool they were selling.

knorker · on Sept 3, 2019

Docker won't help with the issues mentioned. I still don't believe you actually read it.

umvi · on Sept 3, 2019

The article: "watch out for these super obscure corner cases that could impede you from making deterministic builds. Use our tool to mitigate them!"

Me: "who cares about those corner cases, almost nobody would run into them. 99% of developers can achieve deterministic builds by standardizing compilers/dependencies inside of a docker image and using that to build the binaries."

knorker · on Sept 4, 2019

I don't disagree that they seem to be pushing their tool for use cases that are not needed.

Still, your "just do X" doesn't address the majority of the provided "sources of variation".

patthebunny · on Sept 5, 2019

The entire toolchain is non-deterministic. People also do stuff to make it worse.

For example the compiler backend is multi-threaded and spits out data in whatever order, doesn't matter normally because they designed it that way.

But in a determinstic scenario you have to change that.

Then you have to worry about timestamps in PE headers, macros in code, other tools messing with things, etc...

Then you have the issues of wanting to cache, that's where things like pathing come in to play.

attilakun · on Sept 2, 2019

In the "The importance of deterministic builds" section they write:

> Security. Modifying binaries instead of the upstream source code can make the changes invisible for the original authors. This can be fatal in safety-critical environments such as medical, aerospace and automotive. Promising identical results for given inputs allows third parties to come to a consensus on a correct result.

I don't understand this argument, or rather what deterministic builds have to do with this. Isn't modifying the binary instead of the source just a very bad idea in a high stakes scenario?

e12e · on Sept 2, 2019

Authors could publish official builds with a signature - but you can't build the same binary, so you can't be sure published source and binaries match.

Say you checkout "libmagnificent" from github, browse the source and like what you see. You can build a binary and see it matches the upstream build. This gives you some (more) confidence in upstream builds.

It also lets you know you're starting from a known good source, if you want to make modifications; the changes in the build come from the changes you made to the source - not through some side effect in the build process.

jacoblambda · on Sept 2, 2019

Yes it is, however it is much more difficult to tell if said binary has been tampered with if you can't reliably build the exact same binary.

msbarnett · on Sept 2, 2019

Yeah, the author’s example is an odd and unlikely one.

The real security concern is “you’ve audited the source code for some level of safety criticalness and validated it. What guarantee do you now have that the binary you’re told was built from that exact state of source code, actually was?”

A concrete example being the UK’s audit of some Huawei firmware source code, which basically concluded “in addition to the issues we found, the builds are highly non-deterministic and we have no way of verifying whether a given firmware blob was actually built from this source or may have differed in any number of ways”

carom · on Sept 2, 2019

Someone could mess with the compiler, inserting am LLVM pass or similar into the build pipeline.

dgellow · on Sept 2, 2019

I can see the arguments in favor of deterministic/reproducible builds. What are the arguments against it?

marcyb5st · on Sept 2, 2019

In my personal experience, is that you need to sandbox everything and that could potentially make the dev experience cumbersome. For example, you might need to bring into the sandbox an external tool that signs your binaries and depending on the build system and your needed add-on this can be a pain in the neck.

Luckily, there are some good tools out there like bazel [1] (the open source version of what we use internally at Google) that can help you get this right.

[1] https://bazel.build/

pjc50 · on Sept 2, 2019

There's no good reason to make builds deliberately nondeterministic, but it is time-consuming to set up and maintain, and forces tooling choices along the way.

I suppose it rules out including build date in the binaries.

dfgdghdf · on Sept 2, 2019

You can make the date an input to the build and push it to the configuration.

muststopmyths · on Sept 2, 2019

That sometimes it's just not necessary ? The main advantage IMO seems to be security (i.e., you can verify that a given build environment produced exactly the same binary that you downloaded, so the NSA or FSB probably hasn't embedded a Trojan in it). If you need this, then you obviously need reproducible builds.

The other advantages and why you may not need them :

- build acceleration : There are ways to do this without caching binaries. Incredibuild, for example does an excellent job (at least on Windows. Haven't used their Linux version). It costs money, so you have to figure out your priorities there. In my experience their support is excellent and it usually works very well.

- Storage savings : The premise here is that if a binary hasn't changed, there is no need to store it again with a new build. Maybe I'm getting rid of my old builds every 30 days (or whatever) so I don't really care about this either.

Absolutism is silly. You should analyze the advantages and costs of doing something instead of doing it because Google (or insert SV darling here) does it.

nickpsecurity · on Sept 2, 2019

It was originally about stopping backdoors done post-compile but before software delivery. They were specifically worried about attacks like Paul Karger's compiler-compiler attack that Ken Thompson made famous. The thing is, almost all attacks are in the code, configuration, etc with transport security and reputation covering most of the risk they're trying to address. It's enormous effort with tiny security gain. I explain here:

https://lobste.rs/s/7zbbzr/reproducing_linux_builds_firefox_...

Another argument is that reproducible builds, at least the final form being reproducible, creates a monoculture that lets people be hit with exploits more easily. Cutting-edge research in CompSci has produced numerous techniques to use compilers to automatically secure source either by addressing root causes (eg Softbound + CETS) or obfuscating the program (eg instruction set randomization by Barrantes et al.). The latter methods are affected in that the hashes won't match when it's working. If you're trying to understand, they usually do things similar to the kinds of mitigations in OpenBSD (below) just with compilers. Personally, the state of software security would make me prioritize running it through a hardening tool over it being reproducible.

https://www.openbsd.org/security.html

The main way reproducible builds help is with debugging. Design-by-contract with runtime checks on can do that, too, on top of a lot of other benefits. They can have a debug version with checks on that people use just for diagnosing problems. At this point, reproducible builds might be easier than getting that going, though, due to all the work already put it. It also looks like some compiler modifications upstream could help a lot, too.

http://se.ethz.ch/~meyer/publications/computer/contract.pdf

https://www.hillelwayne.com/post/contracts/

https://www.hillelwayne.com/post/pbt-contracts/

twic · on Sept 2, 2019

It means there is some metadata you can't include in the output: the time it was built, the user who built it, the host and path where it was built. Sometimes that stuff is rather useful.

patrec · on Sept 2, 2019

Incremental compilation tends to conflict with fully deterministic builds. But you don't need incremental compilation for the final artifact.

patthebunny · on Sept 5, 2019

Effort required vs rewards for doing it.

wiineeth · on Sept 2, 2019

So i'm planning to learn C++ to get into HFT industry can anyone suggest me what all things do i need to learn to get into HFT as a software engineer?

malkia · on Sept 2, 2019

FYI - The links are all broken - but still can be tracked down - it's just he underlying link actually points back to the page.

patthebunny · on Sept 5, 2019

I spent the better part of a year trying to get these switches and changes working in windows when I was at MS.

Wasn't full time for a year, but it was a lot of "try to build, oops compiler error contact them and then wait for a new compiler build. oh look found more non-determinism".

The fun was when we changed the PE header and started getting pushback from random teams.