What I don't understand is why (at least in places I worked at) CI/build systems...

gizmo686 · on July 31, 2021

If something goes wrong it is very hard to debug unless the relevent information is in the logs. Unfourtuantly, the build system doesn't know what information will end up being relevent, so it just shows everything.

Needing to sift through a giant log is annoying, but you can get pretty fast at it for "normal" errors. Needing to debug an issue that is not adequately logged is a nightmare.

dudeman13 · on July 31, 2021

I agree with this assessment.

Sure there are things such as 'too much data' and 'trash data'. The problem is discerning what you need and what you don't need.

The answer isn't always obvious, but it is possible and 'easy' to filter data, whereas getting missing data can be impossible/real hard.

not2b · on July 31, 2021

The build system for the product that I work on produces a giant log with all the commands that are run and a separate filtered version that only shows the errors (if any). So for a build breaking change the issue is quickly found in the filtered log. But to answer questions about why the build is taking so long we have the full log.

jagged-chisel · on July 31, 2021

Because something will fail later on in the process based on an [unintentional?] change early in the process.

Like the App Store helpfully upgraded Xcode over the weekend, and the new version won’t compile the 67th dependency in your project. So you learn to have Xcode dump its version number at the start so it’s in the log. And you tell the junior DevOps engineer about how to prevent automatic upgrades on the Mac mini he never touches. And also how it’d be good to have downloaded a few versions of Xcode and make selecting the one you need a step in the build process… but I digress

gregmac · on July 31, 2021

What often happens is you get a log like (contrived example):

    Warning: http timeout to package server 1
    Warning: could not find x in package server 2
    ..(several lines later)..
    Warning: could not find dependency x in cache
    ....
    Compile Error: unknown reference to "x"

Because the actual problem by itself isn't fatal -- it doesn't yet know if the package is in server 2, and it might already have a suitable package in cache -- it's not a fatal error. Later, the part of the compiler they needs the dependency doesn't have context to the other failures that happened; all it knows is something is missing.

This is a universal problem in almost every application: very rarely can a single log line provide enough information to fix a problem or identify a bug.

In my contrived example, the partial fix is that the package bit should fail. However the final log message needs all the context of previous errors as well or it isn't useful either. It's turtles all the way down.

It's not that this isn't a solvable problem, it's that people building compilers and other tools make assumptions about context, people putting together the build don't spend the time to learn the nuance of every tool they use (rightfully so - who had time for that?), and everyone falls back on the crutch of giant log files. In other words, through the whole stack (language, compiler, tools, build system, and individual application build itself), no one has optimized or built anything based on the idea the CI server or build system should have a single, useful error message.

marcosdumay · on July 31, 2021

The problem isn't the warnings and non-fatal errors. In fact, the largest problem is that the noise obscures them.

The problem is the pages and pages of "did that, everything is fine". Since Linux distros stopped optimizing disk usage by omiting the update time of files a couple of decades ago, I haven't seen build tools lose any single step. The failure is never because something that should be done was ignored. Instead, the failure is often because of some non-fatal error that you have no hope of finding because it's mixed with 10k like of "build tool got here" spam.

gregmac · on July 31, 2021

Actually I totally agree with that. I was being generous in my example, and often what I showed as "Warning" is just a informational-level message. This leads to much more insidious problems like:

    ....(thousand of lines)...
    Info: Package X version 1.2.3 successfully added
    ....(thousand of lines)...
    Compile error: Unknown type Z

Actual problem: Z was added in Package X version 1.3.0, but some other dependency or reference that wasn't updated caused the incorrect version to be loaded.

marcosdumay · on July 31, 2021

If your code has any size, you won't get all that information from a builder log. Even if it is there, if you knew where to look you would have set the version limits correctly from the beginning.

Fixing this on your own computer is quite straight forward, you just need to check the dependencies, you don't need the info line. Fixing that kind of problem on the build server is one of the reasons we have all those practices about reproducible compilation. On this specific case, it's a matter of freezing your dependencies.

formerly_proven · on July 31, 2021

Literally every single sibling comment is missing the point.

Yes, this data should be recorded.

Yes, it should be viewable.

Should it be viewed as a continuous 34000 line log file?

Of course not. But this is what many (most?) CI tools do. The correct way is to group output at least by step, make log output from various parts of the system collapsible etc., for example, dumping all environment variables and component versions at every step CAN be helpful to diagnose issue, but usually isn't. So it should be collapsed by default.

Edit: I have the exact same gripe with all of the log4j inspired logging libraries, which in my opinion provide very little value over a print. To little structure, no way to hold back output until you have an error, temporary redirects are difficult and have to modify global state and so on. Uninspired design providing minimal utility.

Nextgrid · on July 31, 2021

> The correct way is to group output at least by step, make log output from various parts of the system collapsible etc., for example, dumping all environment variables and component versions at every step CAN be helpful to diagnose issue, but usually isn't. So it should be collapsed by default.

Which means test tools should conform to some agreed-upon universal output format for their console (and make sure none of the underlying libraries print anything to stdout) or have an out-of-band way to communicate with the CI system in real-time (XUnit/Junit isn’t good enough as those are only typically written at the end of the process).

formerly_proven · on July 31, 2021

Most CI tools support some notion of multiple stages and tasks; it would already be helpful if they were able to separate the output of those instead of barfing it all into one log file.

IIRC, Travis has been doing this since at least 2016 where they collapse most of the setup steps by default - and if you are interested, it only takes one click.

This of course only applies to the UI, the raw log is complete.

maccard · on July 31, 2021

Out of curiosity what tools _dont_ do this by default? I've been using TeamCity for a decade and it has supported that feature. I used (shudder) electric commander before that and it also supported it.

formerly_proven · on July 31, 2021

The company which makes Jira also makes a CI server.

goguy · on July 31, 2021

This is the case with the gitlab ci runner too. Only logs related to the specific job are present in the UI.

amelius · on July 31, 2021

Instead of collapsing, you can grep through them.

exdsq · on July 31, 2021

Assuming you know the word you're looking for, at least collapsible errors are easier to explore

lumost · on July 31, 2021

But then I can’t grep/search.

cortesoft · on July 31, 2021

“Collapsible” implies using some sort of interface for looking at the logs… that interface can also include grep/search functionality that searches all the lines. It could also include functionality to let you download the entire raw log to grep locally.

amatecha · on July 31, 2021

Diagnostic details so you can see the innocuous process steps that led to a failure/error :) It's entirely possible that a certain value/step early in the process is what ended up resulting in a breakage way later in the process. At least, in my experience that has been the case for all kinds of software/services.

jayd16 · on July 31, 2021

The black box records everything a plane does. Who reads that?

scraptor · on July 31, 2021

If a plane crashes you can't retry with higher log level

PretzelPirate · on July 31, 2021

This is the same in case of an official build: if some attacker was able to modify what happens in a build that makes it to production, I need logs to build an understanding and explain to auditor/governments exactly what happened, what the effect was, and how I’ll mitigate it.

Since all logs should be immediately persisted off box, they are protected from tampering.

Of course there’s no reason the default view of logs in the CI/CD system needs to be so verbose.

emsy · on July 31, 2021

Apples and oranges.

kevincox · on July 31, 2021

I don't think it is. These are both logging devices that you look at after the "incident". Sure there are differences, the CI job is likely easier and cheaper to reproduce to get more data, but at the end of the day you want to be able to figure out what happened by looking at the logs.

alpaca128 · on July 31, 2021

It's fine to log all the info somewhere, but there's no reason to throw it all into one big chunk and then expect the developer to dig through the data. Especially when it's perfectly possible for the software to produce more structured output in the first place.

Looking for a dependency issue shouldn't be remotely comparable to analyzing an airplane's crash logs in difficulty.

exdsq · on July 31, 2021

Difference is one will have multiple trained engineers working through the logs to investigate according to a standardised method, the other is one person skimming to work out what dependency is missing (in a repeatable environment).

There must be a better abstraction than here is all the things, have fun.

dec0dedab0de · on July 31, 2021

It's because you're configuring it using a language you barely know, so when you're troubleshooting you don't know if it's logic or syntax.. and unless you're lucky someone else is actually owning the CI system and you have to worry about their permissions.

So you end up debugging everything, and when it works you're afraid to remove the debugging because the last time you did it took hours, and you ended up leaving it in.

Edit: also, because all the important messages are being sent to a group chat or other appropriate place.

danmur · on July 31, 2021

I know what you mean, it should store everything but not print it all out. That's kind of on application developers too though, so many programs I've worked on had terrible logging (too much information, the right information not at the right level).

aspaceman · on July 31, 2021

Wait...

This comment isn't serious. I read that. You just tail the file to find the error. But it's very useful to go through it all. It's not garbage to me. I find most logs quite readable.

ithinkso · on July 31, 2021

It was serious. You haven't seen what I've seen.

ENV variables at every step? print them, build has a parser/codegen step? print everything it does, sanity check scripts? I always need to know what they did, setup scripts? That's the most important part, every .o file compiled and linked? show me, with the compile command for a good measure so all the -Ilibs's are there? of course, set -x? why not, need to wget/curl something? output goes to the log, is it in parallel? just append as it goes, build failed? start the teardown but make sure you add everything it does to the log

And as I wrote, I hope it all just terminates at the first error so I can tail the logs... but it's not always that pretty. I don't believe you've read those instead of tailing for an error or doing the fast scroll to find something looking somewhat off

I mean, I'm used to it now, I'll grep, less, tail and I'll find what I need, I'm just baffled that it has so much unnecessary data for every single build. At one company I've got good enough at it that I could press-and-hold pageup from the bottom in less and staring at the streaming bytes Matrix-style looking for a place of interest

UweSchmidt · on July 31, 2021

Ideally debugging any issue should include an improvement of the relevant loging. Then, over time, the system can perform relevant self-checks and cross-references, and spit out a list of potential causes, while retaining detailed logs nested deeper to the right or to the bottom.

numpad0 · on July 31, 2021

It tells what it tried and what went right.