So, it's actually the first time in history of Conan where I personally think it became a useful tool and am using it professionally in several projects for my clients. It still has a lot of quirks though and one can easily get into trouble when setting up dependencies across different platforms.
I've looked more closely into Conan several times in the past and also added bug reports here and there, but it finally became somewhat usable (for me and my projects at least). Also, I'm quite picky when it comes to CMake integration, because I don't want my package manager to interfere with my build system configuration. Conan finally reached a state where one can simply check in a conanfile.txt in the repository and use the `cmake_paths` generator to fetch all dependencies before calling CMake to generate a project for you.
I also decided to setup my own Artifactory server as the official bintray comes with a max. size limitation of 500 MB per artifact which is way too low, considering that Qt5 with Core, Gui and Widgets statically built on Windows (release build) easily outgrows this limit.
However, once you have your infrastructure setup, it's easy to add build flavors such as sanitizers etc. for a given set of thirdparty libraries.
Do you have examples of project configuration using cmake and conan the way you use it? Maybe an open source project, or a short description of the setup?
In my experience this is only true for C/C++ (with a decent amount of work to setup CROSSTOOL properly). As soon as you start to get into Python, and Python<->C++ interop, it becomes very leaky.
I heard that Google has some tools internally that build the Python interpreter with Bazel and use that in order to guarantee hermeticity, but that doesn't seem to be possible with public tooling (at least not without some major hacks, for example https://github.com/bazelbuild/bazel/issues/4286 )
It would be interesting to see how Google manages languages such as Python at scale (and other languages that have similarly leaky package management).
And not only that, but it's also been doing this for well over a decade now, at a massive, multi-billion LOC multi-petabyte artifact scale at Google. Google would not be able to function without it.
Reproducible builds are extremely important. Most people use pre-compiled code so it's important to have a way to verify that the pre-compiled code was generated from the expected source code.
We had a Cygnus customer back in the 1990s, DSC, whose customer SLA was measured in minutes/decade (I believe it was less than three minutes of downtime per decade). They paid extra for long term support on a specific GCC release. When they submitted a bug report and got a fix they would examine the binaries to make sure that every change in the binaries could be traced back to that one patch and nothing else. No upgrades, general bug fixes, or anything like that.
Ironically a few years later one of their customers had a multi hour outage and that was the end of (quite a large company) DSC.
Another benefit of deterministic builds is that build output caching becomes more reliable. Microsoft has a system used for this that is heavily used internally that was recently open sourced: https://github.com/microsoft/BuildXL/blob/master/README.md
I work in medical devices and we could use reproducible builds a lot. A lot of tools will spit out different binaries at each run and it’s really hard to justify why you think the source files you claim to have built from actually were used to create the build.
How do you know your configuration management plan is doing what you think it's doing if you're getting different output from the same source? It's hard to tell if the sources of variation in the output are harmless or meaningful.
Well, if the target is to have the same binary from the same set of sources every time you do a build, then the configuration management plan is not working. A good configuration management plan is to ensure that things that can change are managed. It is not working, it needs to be revised.
In some of the plans I have seen, some deviations are tolerated but those deviations are spelled out in excruciating detail.
I've dealt with it by wishing I had reproducible builds.
If I'm debugging a production issue and want to create a debug build from the same source to step through things, it would be nice if I could just build both debug and release, check the hash on the release to confirm it matches production, then start debugging, without reference to any external records or other systems.
Reproducible builds reduce the number of links in the chain needed to verify what you're really doing.
I don't get it. Are you saying it's too hard or it doesn't solve the problem? It's worked in my experience.
I mean, my response isn't entirely a rhetorical question. If there's a reason it wouldn't work I'd be curious what makes his situation different from mine.
So I should have said, "Is there a reason you can't just build debug binaries at the same time as you build release"? I mean, I guess, but it's a little disappointing that this is the nugget that you've been dancing around.
The world "just" is my pet peeve. Before you say "why don't you just xxx" make sure you actually understand the problem thoroughly. Otherwise it's impolite to make a suggestion to "just do ...". If there is one thing I have learned it's that good devs make sure they really understand the situation before making suggestions.
I think this is related to "the grass is greener on the other side". Other people's problems seem eminently more solvable than our own. Sometimes that's even true but often it isn't.
That's pretty paranoid if you're already trusting the source or binary. Have you previously ensured that same source acts non-maliciously on all malicious inputs and runs on secure OS? (Sources of most attacks.)
In medical devices you validate the binary. The problem is if you rebuild the code and now it's different then what? Consider that compilers sometimes produce the wrong code. I've hand compilers do this to me. Build, run tests. Test fails. Investigate, assembly looks 'weird'. Rebuild and now it's correct. GAH!
That makes sense. It could help there depending on what validation requirements are.
High-assurance systems is my thing btw. There's a few ways we approach this issue. One is a certifying compiler (eg CompCert C) that's verified and tested to basically never screw up. CompCert got SIL qualification recently with Solid Sands partnering to validate them, too.
Another I like is equivalence testing across multiple binaries from multiple compilers with a large, automatically-generated test suit and fuzzers. Such test generators are already great for bug hunting. You just rerun them through the same app compiled with other compilers. Helps catch app and compiler bugs. If you have CompCert, I have another trick: re-compile with CompCert, re-test each case that failed, and any that suddenly pass were likely compiler bugs. The multiple compilers approach should catch them, though.
Now, I should point out that what reproducible builds is addressing are often changes in hashes that aren't changes in functionality. Just stuff like timestamps or symbols. You should be able to show the regulators that the only thing that changed was that kind of stuff. Maybe we need tools that automate that, too, showing diffs annotated by what does or doesn't affect execution.
Anyway, what exactly do they require of you for source-to-object correspondence? And is this U.S. F.D.A. or a non-U.S. regulator?
“Anyway, what exactly do they require of you for source-to-object correspondence?”
In my case this is not really clear. You have quite a bit of leeway but you never know when you will get rejected. We had things that are approved a few years ago being rejected now because some rules have been tightened.
With reproducible builds you could save a lot of time validating build chains. It would be a huge timesaver and probably also allow us to be more creative with the build chain because if the binary is the same you know it worked.
If I have deterministic builds then the user can verify that the build matches the source. Otherwise it's potentially compromised (or the wrong version).
Assuming you develop on linux (or macOS) -- have you had a look at nix? There is a steep learning curve, but once you're there you'll never want to go back.
Is there a reason you can't just achieve deterministic builds using docker containers as your build environment (which have the appropriate sysroot, compilers, dependencies, etc. inside)?
Containers can be used as part of a deterministic build process, but they're not a silver bullet on their own:
>which have the appropriate ... dependencies
What are your dependencies? Have you specified them precisely?
Does anything in the build process use timestamps at any point? If so, your build isn't deterministic.
Furthermore, if your goal is reproducible builds rather than simple determinism, you may then care about where e.g. the GCC binary in your Docker container came from.
The article: "watch out for these super obscure corner cases that could impede you from making deterministic builds. Use our tool to mitigate them!"
Me: "who cares about those corner cases, almost nobody would run into them. 99% of developers can achieve deterministic builds by standardizing compilers/dependencies inside of a docker image and using that to build the binaries."
In the "The importance of deterministic builds" section they write:
> Security. Modifying binaries instead of the upstream source code can make the changes invisible for the original authors. This can be fatal in safety-critical environments such as medical, aerospace and automotive. Promising identical results for given inputs allows third parties to come to a consensus on a correct result.
I don't understand this argument, or rather what deterministic builds have to do with this. Isn't modifying the binary instead of the source just a very bad idea in a high stakes scenario?
Authors could publish official builds with a signature - but you can't build the same binary, so you can't be sure published source and binaries match.
Say you checkout "libmagnificent" from github, browse the source and like what you see. You can build a binary and see it matches the upstream build. This gives you some (more) confidence in upstream builds.
It also lets you know you're starting from a known good source, if you want to make modifications; the changes in the build come from the changes you made to the source - not through some side effect in the build process.
Yeah, the author’s example is an odd and unlikely one.
The real security concern is “you’ve audited the source code for some level of safety criticalness and validated it. What guarantee do you now have that the binary you’re told was built from that exact state of source code, actually was?”
A concrete example being the UK’s audit of some Huawei firmware source code, which basically concluded “in addition to the issues we found, the builds are highly non-deterministic and we have no way of verifying whether a given firmware blob was actually built from this source or may have differed in any number of ways”
In my personal experience, is that you need to sandbox everything and that could potentially make the dev experience cumbersome. For example, you might need to bring into the sandbox an external tool that signs your binaries and depending on the build system and your needed add-on this can be a pain in the neck.
Luckily, there are some good tools out there like bazel [1] (the open source version of what we use internally at Google) that can help you get this right.
There's no good reason to make builds deliberately nondeterministic, but it is time-consuming to set up and maintain, and forces tooling choices along the way.
I suppose it rules out including build date in the binaries.
That sometimes it's just not necessary ? The main advantage IMO seems to be security (i.e., you can verify that a given build environment produced exactly the same binary that you downloaded, so the NSA or FSB probably hasn't embedded a Trojan in it). If you need this, then you obviously need reproducible builds.
The other advantages and why you may not need them :
- build acceleration : There are ways to do this without caching binaries. Incredibuild, for example does an excellent job (at least on Windows. Haven't used their Linux version). It costs money, so you have to figure out your priorities there. In my experience their support is excellent and it usually works very well.
- Storage savings : The premise here is that if a binary hasn't changed, there is no need to store it again with a new build. Maybe I'm getting rid of my old builds every 30 days (or whatever) so I don't really care about this either.
Absolutism is silly. You should analyze the advantages and costs of doing something instead of doing it because Google (or insert SV darling here) does it.
It was originally about stopping backdoors done post-compile but before software delivery. They were specifically worried about attacks like Paul Karger's compiler-compiler attack that Ken Thompson made famous. The thing is, almost all attacks are in the code, configuration, etc with transport security and reputation covering most of the risk they're trying to address. It's enormous effort with tiny security gain. I explain here:
Another argument is that reproducible builds, at least the final form being reproducible, creates a monoculture that lets people be hit with exploits more easily. Cutting-edge research in CompSci has produced numerous techniques to use compilers to automatically secure source either by addressing root causes (eg Softbound + CETS) or obfuscating the program (eg instruction set randomization by Barrantes et al.). The latter methods are affected in that the hashes won't match when it's working. If you're trying to understand, they usually do things similar to the kinds of mitigations in OpenBSD (below) just with compilers. Personally, the state of software security would make me prioritize running it through a hardening tool over it being reproducible.
The main way reproducible builds help is with debugging. Design-by-contract with runtime checks on can do that, too, on top of a lot of other benefits. They can have a debug version with checks on that people use just for diagnosing problems. At this point, reproducible builds might be easier than getting that going, though, due to all the work already put it. It also looks like some compiler modifications upstream could help a lot, too.
It means there is some metadata you can't include in the output: the time it was built, the user who built it, the host and path where it was built. Sometimes that stuff is rather useful.
I spent the better part of a year trying to get these switches and changes working in windows when I was at MS.
Wasn't full time for a year, but it was a lot of "try to build, oops compiler error contact them and then wait for a new compiler build. oh look found more non-determinism".
The fun was when we changed the PE header and started getting pushback from random teams.
I've looked more closely into Conan several times in the past and also added bug reports here and there, but it finally became somewhat usable (for me and my projects at least). Also, I'm quite picky when it comes to CMake integration, because I don't want my package manager to interfere with my build system configuration. Conan finally reached a state where one can simply check in a conanfile.txt in the repository and use the `cmake_paths` generator to fetch all dependencies before calling CMake to generate a project for you.
I also decided to setup my own Artifactory server as the official bintray comes with a max. size limitation of 500 MB per artifact which is way too low, considering that Qt5 with Core, Gui and Widgets statically built on Windows (release build) easily outgrows this limit. However, once you have your infrastructure setup, it's easy to add build flavors such as sanitizers etc. for a given set of thirdparty libraries.