In a complex C codebase, valgrind is absolutely indispensable for finding the last few bits of memory that may have leaked out because you got lazy and didn't free() or maybe you wrote somewhere in memory you shouldn't have, it really is like magic sometimes. I know that more recent languages (and some not so recent, looking at you, Ada!) have a lot of this built in by default, but when you need to do these things, you know you need to and it's nice to have a handy all-in-one tool to help out. It's saved me many times from myself when I was writing network code in C and I couldn't figure out where the leaks were coming from, or a performance regression by inspecting the code alone. A quick run of valgrind had that fixed in minutes.
IME the main problem with Valgrind is that it reduces performance too much (I remember something around 10x slowdown or worse). Clang's address sanitizer OTH finds most of the same problems, but with much a smaller performance hit, which makes it also useable for things like games.
Also for leak detection, some IDEs have excellent realtime debugging tools (for instance there a very nice memory debugger in Visual Studio since VS2015, and XCode has Instruments).
Good point, there definitely is the performance hit for the observability data it gives you, but for me, that's a fair trade for the results it gets me. I can see how working with things that have more realtime-ish constraints like games would be more greatly affected. Still, works super well for systems software, so I continue to use it.
I like Valgrind, but day-to-day, I find myself typically reaching for Sanitizers[1] instead (ASan, et. al.), especially since they're built-in to many compilers these days, and are a bit faster IME.
Are there any use cases that people here have experienced where Valgrind is their first choice?
For uninitialized memory reads, which is one of the biggest classes of issues, valgrind can still be invaluable. MSAN is one of the more difficult things to get setup and remove all the false positives. You typically need to transitively compile everything including dependencies, or annotate/hint lots of things to get the signal-to-noise ratio right. Sometimes its just easier/better/faster to run it under valgrind.
IME, sanitizer don't find enough issues compared valgrind, and valgrind doesn't require me to recompile the world.
If I have anything that feels like UB, especially that I can trigger in a unit test, I run it through valgrind first. If the issue can only be triggered by running the whole system and valgrind slows it down too much, there's a good chance that sanitizers will also slow it down too much.
Valgrind works on the same binary executable that you're be shipping, not some special build.
Among the sanitizers the valuable ones are the general undefined behavior ones. Valgrind won't tell you about your integer overflows and whatnot. Like that in b >> a, a turned out to be 32 or greater at run-time.
I think they are complementary. Valgrind is especially useful because you don't need to recompile everything. In other words, you can test what you ship (albeit slowly), and that includes dependent libraries that may have their own bugs. On the other hand, ASAN is essential for fuzzing code.
Callgrind is certainly dispensable; just use a profiler. They are much, much faster, and more accurate as well. (Callgrind is using an idealized model of CPUs as they were around 1995, which doesn't match up all that well with how they work today.)
There are some situations where I find myself using Callgrind, in particular in situations where stack traces are hard to extract using a regular profiler. But overall, it's a tool that I find vastly overused.
perf does not provide me with the complete callstack, it's a sampling profiler.
In effectively all latency-sensitive contexts, sampling is worthless. 99.999999% of the time the program is waiting for IO, and then for a handful of microseconds there's a flurry of activity. That activity is the only part I care about and perf will effectively always miss it and never record it to completion.
I need to know the exact chain of events that leads to an object cache miss causing an allocation to occur, or exactly the conditions which led to a slow path branch, or which request handler is consistently forcing buffer resizes, etc.
I never need a profiler to tell me "memory allocation is slow" (which is what perf will give me). I know memory allocation is slow, I need to know why we're allocating memory.
perf is of course a sampling profiler, but perf record -g most definitely does provide you with a complete callstack, provided you have all your debug info in place.
perf or Intel VTune are the two standard choices AFAIK. Both have a certain learning curve, both are extremely capable in the right hands. (Well, on macOS you're pretty much locked to using Instruments; I don't know if Callgrind works there but would suspect it's an uphill battle.)
Callgrind is a CPU simulator that can output a profile of that simulation. I guess it's semantics whether you want to call that a profiler or not, but my point is that you don't need a simulator+profiler combo when you can just use a profiler on its own.
(There are exceptions where the determinism of Callgrind can be useful, like if you're trying to benchmark a really tiny change and are fine with the bias from the simulation diverging from reality, or if you explicitly care about call count instead of time spent.)
perf on the whole system, with the whole software stack compiled with stack pointers, flamegraphs for visualisation, is an essential starting point for understanding real world performance problems.
Fortune 500’s rarely contribute but use a lot. I work for one. It’s usually a tussle between which vendor has managed to convince c-suite that software engineering is a dying discipline and their new genai tool is the utopia.
Most Fortune 500 c suite are bean counters with abysmal engineering or product know how. They can’t see past the next quarter earnings report. I doubt long term contribution to meaningful open source is on their list.
Eventually they brought in a tool so bad that we were able to show how much less productive it was. I'm not supposed to talk about it publically though.
I remember adding 64-bit fixes to Valgrind to run it on the iOS Simulator when Macs went 64-bit, and I succeeded, it was incredible. Valgrind is one of the wonders created by humans.
(Well, its original nordic pronunciation probably has a more o-like or anyway different 'a', a pointier 'i' sound, and a stronger 'r' than in German. But it seems to be accepted practice between Germanic languages to pronounce everything as if it was your native language - it's easy and ~always intelligible.)
If the “a” is supposed to be as in “value” (i.e. closer to ä in some but not all German dialects), that doesn't match typical German pronunciation, does it? (The -grind part is fine.)
at $OLD_JOB I literally joked that we used a fork of valgrind just to fix the pronunciation. of course, our fork was also years behind the official version.
these days, since I am doing bare metal embedded, I barely use valgrind; but it is game changing in the situations where it's useful.
A key difference, to me at least, is that it's not a made up word (well, no more than any other established word). Ambiguity over its pronunciation comes from lack of familiarity with the source language. That's fine, we all need to learn, but to insist the native speaker is pronouncing it wrong seems a bit odd to me.
One thing I've never understood regarding valgrind - other than intentional leaks, are suppression files ever actually used due to an actual false positive, or is it just due to the bug being in someone else's code that's annoying to fix?
"We implement our own memory allocator" is no excuse; the primitives you use should be hooked by valgrind so at most there should be false negatives due to your allocations being larger than the user-facing ones ...
> are suppression files ever actually used due to an actual false positive,
There used to be one in LuaJIT because it had an optimized string comparison that compared outside of the allocation (which is allowed by the OS as long as you don't cross a page boundary, which LuaJIT's allocation algorithm made sure it never did)
First and foremost, suppressions can be used for more than leaks. For example, for "Conditional jump or move depends on uninitialised value(s)", which yes, there are very much false positives for because of e.g. tricky optimizations LLVM performs and that valgrind can't handle.
But even for leaks, you can also have intentional leaks that valgrind will flag but that you can't really do anything about. One example is how using `putenv` can lead to you having to leak memory on purpose. There are many other cases.
Yes, we use suppression files to cover expected leaks, as well as bugs in other code that is annoying to fix.
For example the OCaml suppressions here are because OCaml (as expected) doesn't free static allocations at program exit. You can also see some real bugs we found:
Maybe it's just me, but I think people tend to drive by memory leaks nowadays because the footprint of binaries has grown, but the size of ram has grown more-erly. Proper big.
I routinely run on 1TB memory 128 core racks, and I don't worry about free() much.
I'm not humblebragging, I actually think this is lazy (!) and I would benefit from more explicitly thinking about the memory consequences of what I do but there are some things which I used to freak out about growing to GB and now, I regard it as an investment on the future me, running the same thing: It's very likely I've got it in a hash structure of some kind already. I just add columns to the dict() elements.
Down the other end, I recall some friends getting code which I expected to have to run on a major rack host to build onto a small memory model rPi and they said rust did that: allowed them to get rid of the overhang of other languages expectations to runtime size and be explicit about use and free in the heap.
In short, in my discipline, if you don’t free your memory you can eat a TB in 10 minutes. Moreover, if you fragment memory too much, then you’ll start to get segfaults when you want large segments of contiguous memory.
Lastly, the same fragmentation will cause advanced libraries like Eigen to use many small memory areas, and jumping between them kills locality, hence causing you performance losses.
I’m running tasks on (and administering) clusters with similar resources to yours, yet I always treat them like 486DXs with 4MB RAM, because I can’t restart them every week.
I think the relative popularity of managed memory languages has grown and continues to grow. Even C++ is largely "managed" via mechanisms like unique_ptr, shared_ptr, or the standard collections. (Using new/delete directly is more or less a code smell these days.) C will always have a place, but despite working in it from 2011-2021 or so, I just don't use it at all these days. With "owned" allocations, you just don't need explicit free(). You're more likely to see memory leaks via unbounded collection size than actual unreferenced pointers.
I get it. And I do wish some systems were quicker to let the OS clean up after them on exit. For instance, the other day loaded a CSV with about 20 columns and 10M rows into a list of Python tuples so I could poke at them a bit. That took a short while and I was ok with that. I was more surprised when I closed the REPL and it hung for way longer than expected as it freed a couple hundred million objects.
Now it could’ve been the case that some of those objects had a __del__ method that needed to be called or something. Absent that case, I’d have preferred the process just exited and been done with it.
If that were a program that ran synchronously from a shell script, the shutdown GC time would’ve been nearly as long as the data loading time. Maybe Python could benefit from a fast_shutdown_GC function that only calls free() if an object is something with a non-trivial delete method. Otherwise, skip it and let the OS do its voodoo it does so well.
I’m picking on Python here because that’s where I last saw this. The basic idea applies in lots of other cases though.
Well not at that scale, but it isn't uncommon just to malloc/HeapAlloc a whole pile of RAM up front, spew your data structures into it and free it when it crashes / exits. It is a perfectly valid strategy.
In fact I did this a couple of years back on a project I was working on maintaining in win32. It had a complicated memory management scheme for a GUI process that kept screwing up. Turned out it couldn't ever use more than 2MB of RAM so I just allocated 10MB up front at the start of the process and wrote a simple incrementing pointing counter in that allocated block and free'd it at the end like a short life pool. Solved all the issues. Machines it's running on all have at least 16GB of RAM so it's not exactly breaking the bank!
For short-lived programs perhaps. For things like servers that may run for months, leaks are absolutely important.
Even memory fragmentation is a problem for us. We once had a chat server that would crash every 3 months, which we tracked down to memory being fragmented by a less than optimal malloc implementation in glibc. (Since the glibc fix was in the "too hard" category, we reluctantly decided to schedule a restart every month.)
No. Purify was the commercial software that inspired Valgrind. It was Reed Hastings' first company, and I came across it working at a bank in 1996.
I moved to Java with the millennium, and missed the rise of Valgrind. I'm just happy that everyone has good tools now. Writing C in those days was a rough business.
It's about time to have hardware support for valgrind. I.E the synthetic CPU provided by libvex being partially moved to the hardware level, eliminating the dynamic translation.
It works okay with Rust, but it's not really needed, except when integrating C code with Rust.
For verifying unsafe code, Rust has a MIRI interpreter that catches UB more precisely, e.g. it knows Rust's aliasing rules, precise object boundaries, and has access to lifetimes (they don't survive compilation).
Non-deliberate leaking of memory in Rust is not possible for the majority of Rust types. In safe Rust it requires a specific combination of a refcounted type that uses interior mutability which contains a type that makes the refcounted smart pointer recursive. Types that meet all three conditions at once are niche.
The only annoyance/incompatibility is that Valgrind complains that global variables are leaked. Rust does that intentionally, because static destructors have the same problem as SIOF[1] in reverse, plus tricky interactions with atexit mean there's no reliable way to destruct arbitrary globals.
Hard disagree that it’s not needed. MIRI’s ability to verify non-trivial examples of code is quite limited so I generally just discount it. Valgrind goes beyond just asan+msan with heap and CPU profiling which can be useful if you’re trying to extract every last bit of performance.
> As a result, I don’t have much use for Memcheck, but I still use Cachegrind, Callgrind, and DHAT all the time. I’m amazed that I’m still using Cachegrind today, given that it has hardly changed in twenty years. (I only use it for instruction counts, though. I wouldn’t trust the icache/dcache results at all given that they come from a best-guess simulation of an AMD Athlon circa 2002.)
The few times I used the cache simulator and compared it to (Linux) Perf I found reasonable results. Someone can recommend a better cache simulator?