I found that a lot of the problems I had been having with mutexes, stem from the...

torginus · 2025-11-18T10:09:13 1763460553

Mutexes suffer from a host of problems, and imo are not a very good concurrency primitive - they were designed to turn single-threaded code into multi-threaded. With todays 8+ cores in most systems, usually a single point of contention quickly becomes a problem.

They're liable to deadlocks/livelocks, and sometimes not only with other explicitly Mutex-like things (it might happen some library you use has a lock hidden deep inside).

They're also often backed byOS primitives (with big overheads) with inconsistent behaviors between platforms (spinlocks, waiting etc). We've run into an issue with .NET, that their version of Mutex didn't wake up the blocked thread on Linux as fast as on Windows, meaning we needed about 100x the time to serve a request as the thread was sleeping too long.

There are questions like when to use spinlocks and when to go to wait sleep, which unfortunately the developer has to answer.

Not assigning blame here, just pointing out that threading primitives and behaviors don't translate perfectly between OSes.

Multi-threading is hard, other solutions like queues suffer from issues like backpressure.

That's why I'm skeptical about Rust's fearless concurrency promise - none of these bugs are solved by just figuring out data races - which are a huge issue, but not the only one.

adwn · 2025-11-18T10:51:08 1763463068

Your view on mutex performance and overhead is outdated, at least for the major platforms: The Rust standard library mutex only requires 5 bytes, doesn't allocate, and only does a syscall on contention. The mutex implementation in the parking_lot library requires just 1 byte per mutex (and doesn't allocate and only does a syscall on contention). This enables very fine-grained, efficient locking and low contention.

torginus · 2025-11-18T11:59:07 1763467147

These are OS primitives I'm talking about - I haven't checked out the standard library version but the parking_lot version uses a spinlock with thread sleep when the wait times get too high - it has no way of getting notified when the mutex gets unblocked nor does it support priority inversion.

It seems it's optimized for scenarios with high performance compute heavy code, and short critical sections.

These assumptions may let it win benchmarks, but don't cover the use cases of all users. To illustrate why this is bad, imagine if you have a Mutex protected resource that becomes available after 10us on average. This locks spins 10 times checking if it has become available )(likely <1us) then yields the thread. The OS (lets assume Linux) wont wake it up the thread until the next scheduler tick, and its under no obligation to do so even then (and has no idea it should). But even best-case, you're left waiting 10ms, which is a typical scheduler tick.

In contrast OS based solutions are expensive but not that expensive, let's say that add 1us to the wait. Then you would wait 11us for the resource.

A method call taking 10ms and one taking 15 us is a factor of 60x, which can potentially kill your performance.

You as the user of the library are implicitly buying into these assumptions which may not hold for your case.

There's also nothing in Rust that protects you from deadlocks with 100% certainty. You can fuzz them out, and use helpers, but you can do that in any language.

So you do need to be mindful of how your mutex works, if you want to build a system as good as the one it replaces.

adwn · 2025-11-18T12:35:36 1763469336

> […] but don't cover the use cases of all users.

No single concurrency primitive covers all use cases. I was addressing your misconceptions about mutex performance and overhead, not whether mutexes are the best solution to your particular problem.

> […] it has no way of getting notified when the mutex gets unblocked […] The OS (lets assume Linux) wont wake it up the thread until the next scheduler tick, and its under no obligation to do so even then (and has no idea it should).

You've misunderstood the parking_lot implementation. When thread B tries to lock a mutex that's currently locked by thread A, then, after spinning a few cycles, thread B "parks" itself, i.e., it asks the kernel to remove it from the Runnable task queue. On Linux, this is done using the futex syscall. When thread A unlocks the mutex, it detects that another thread is waiting on that mutex. Thread A takes one thread from the queue of waiting threads and "unparks" it, i.e., it asks the kernel to move it into the Runnable task queue. The kernel is notified immediately, and if there's a free CPU core available, will tend to dispatch the thread to that core. On a non-realtime OS, there's no guarantee how long it takes for an unblocked thread to be scheduled again, but that's the case for all concurrency primitives.

Someone · 2025-11-18T13:06:57 1763471217

> A method call taking 10ms and one taking 15 us is a factor of 60x

667 (a thousand 15μs calls take 15ms)

galangalalgol · 2025-11-18T12:18:03 1763468283

The best practices I adopt for rust avoid the use of mutex whenever possible precisely because of how easy a deadlock is. It turns out it is always possible. There are entire languages the disallow any mutable state, much less shared mutable state. The question becomes how much performance are you willing to sacrifice to avoid the mutex. By starting with no shared mutable state and adding it when something is too slow, you end up with very few mutexes.

adwn · 2025-11-18T12:42:33 1763469753

> avoid the use of mutex […] It turns out it is always possible

How would you handle the archetypical example of a money transfer between two bank accounts, in which 100 units of money need to be subtracted from one account and atomically added to another account, after checking that the first account contains at least 100 units?

galangalalgol · 2025-11-18T13:32:52 1763472772

The simplest pure functional way would be to copy the whole database instantiating a new copy with the desired change if the condition was met. That obviously doesn't scale, which is where the performance thing comes in. A still pure way would be to use a persistent tree or hash mapped trie that allows efficient reuse of the original db. There are times a purely functional approach doesn't perform well enough, but even with large scale entity component type systems in both rust and c++, the number of times I've had to use a mutex to be performant is small. Atomic is much more common, but still not common. Persistent data structures alleviate most of the need.

pas · 2025-11-18T17:46:46 1763488006

pure or not eventually this comes down to durability, no?

and the way to do it is to either have some kind single-point-of-control (designated actor or single-threaded executor) or mark the data (ie. use some concurrency control primitive either wrapping the data or in some dedicated place where the executors check [like JVM's safepoints])

using consistent hashing these hypothetical accounts could be allocated to actors and then each transaction is managed by the actor of the source (ie. where the money is sent from, where the check needs to happen), with their own durable WAL, and periodically these are aggregated

(or course then the locking is hidden in the maintenance of the hashring as eating philosophers are added/removed)

kragen · 2025-11-18T21:16:14 1763500574

Eliminating the durability constraint doesn't make it any easier to program, just easier to get good performance on.

Distributing accounts among different actors, without two-phase commit or its moral equivalent, enables check kiting.

vrmiguel · 2025-11-18T20:48:12 1763498892

Since the thread mentions Rust: in Rust, you often replace Mutexes with channels.

In your case, you could have a channel where the Receiver is the only part of the code that transfers anything. It'd receive a message Transfer { from: Account, to: Account, amount: Amount } and do the required work. Any other threads would therefore only have copies of the Sender handle. Concurrent sends would be serialized through the queue's buffering.

I'm not suggesting this is an ideal way of doing it

adwn · 2025-11-19T06:49:43 1763534983

What you're describing is called the "Actor model"; in your example, the receiver is an actor that has exclusive control over all bank accounts.

The actor model reaches its limits as soon as you need transactions involving two or more actors (for example, if you need to atomically operate on both the customers actor and the bank accounts actor). Then you can either pull all involved concerns into a single actor, effectively giving up on concurrency, or you can implement a locking protocol on top of the actor messages, which is just mutexes with extra steps.

yencabulator · 2025-11-25T03:11:31 1764040291

> These are OS primitives I'm talking about - I haven't checked out the standard library version but the parking_lot version uses a spinlock with thread sleep when the wait times get too high - it has no way of getting notified when the mutex gets unblocked nor does it support priority inversion.

Uhh no, everyone in Linux userspace uses futexes these days to wait on a contended lock.

https://github.com/Amanieu/parking_lot/blob/03d36d62fecbd85c...

charleslmunger · 2025-11-19T01:38:10 1763516290

Unfortunately the standard library mutex is designed in such a way that condition variables can't use requeue, and so require unnecessary wakeups. I believe parking lot doesn't have this problem.

ahoka · 2025-11-18T14:19:15 1763475555

It's called a futex and supported by both Linux and Windows since ages.

adwn · 2025-11-18T14:23:53 1763475833

The 1-byte-per-mutex parking_lot implementation works even on systems that don't provide a futex syscall or equivalent.

magicalhippo · 2025-11-18T11:26:40 1763465200

How does it avoid cache contention with just a few bytes per mutex? That is, multiple mutex instances sharing a cache line. Say I have a structure with multiple int32 counters protected by their own mutex.

adwn · 2025-11-18T11:59:37 1763467177

Cache contention is (mostly) orthogonal to your locking strategy. If anything, fine-grained locking has the potential to improve cache contention, because

1) the mutex byte/word is more likely to be in the same cache line as the data you want to access anyway, and

2) different threads are more likely to write to mutex bytes/words in different cache lines, whereas in coarse-grained locking, different threads will fight for exclusive access over the cache line containing that one, global mutex.

@magicalhippo: Since I'm comment-rate-throttled, here's my answer to your question:

Typically, you'd artificially increase the size and alignment of the structure:

    #[repr(align(64))]
    struct Status {
        counter: Mutex<u32>,
    }

This struct now has an alignment of 64, and is also 64 bytes in size (instead of just the 4+1 required for Mutex<u32>), which guarantees that it's alone in the cache line. This is wasteful from a memory perspective, but can be worth it from a performance perspective. As often when it comes to optimization, it very heavily depends on the specific case whether this makes your program faster or slower.

magicalhippo · 2025-11-18T12:19:26 1763468366

> different threads are more likely to write to mutex bytes/words in different cache lines

If you got small objects and sequential allocation, that's not a given in my experience.

Like in my example, the ints could be allocated one per thread to indicate some per thread status, and the main UI thread wants to read them every now and then hence they're protected by a mutex.

If they're allocated sequentially, the mutexes end up sharing cache lines and hence lead to effective contention, even though there's almost no "actual" contention.

Yes yes, for a single int you might want to use an atomic variable but this is just for demonstration purposes. I've seen this play out in real code several times, where instead of ints it was a couple of pointers say.

I don't know Rust though, so just curious.

gpderetta · 2025-11-18T12:47:46 1763470066

The issue might be allocating the int contiguously in the first place. No language magic is going to help you avoid thinking about mechanical sympathy.

And allocating the int contiguously might actually be the right solution is the cost of sporadic false sharing is less than the cost of wasting memory.

There's no silver bullet.

magicalhippo · 2025-11-18T12:50:25 1763470225

But the mutex encapsulates the int, so if the mutex ensured it occupied a multiple of cache lines, there would be no contention. At the very small cost of a few bytes of memory.

gpderetta · 2025-11-18T12:54:27 1763470467

the mutex forcing alignment would be extremely wasteful. FWIW, I have used 1-bit spin locks.

torginus · 2025-11-18T12:02:31 1763467351

By not avoiding it. And a year later you get to write a blog post about how you discovered and fixed this phenomenon hitherto unknown to computer science.

kiitos · 2025-11-18T22:14:33 1763504073

this is an overly simplistic and somewhat reductive perspective on a pretty fundamental concept/primitive

> they were designed to turn single-threaded code into multi-threaded

not really

> usually a single point of contention quickly becomes a problem.

not generally, no

> They're liable to deadlocks/livelocks,

deadlocks/livelocks are orthogonal to any specific primitive

> They're also often backed byOS primitives (with big overheads) with inconsistent behaviors between platforms (spinlocks, waiting etc).

the mutex as a primitive is orthogonal to any specific implementation...

etc. etc.

kragen · 2025-11-18T13:23:14 1763472194

Traditionally traditionally, monitors were declared together with the data they contained, and the compiler enforced that the data was not accessed outside the monitor. Per Brinch Hansen wrote a rather bitter broadside against Java's concurrency model when it came out.

csb6 · 2025-11-18T17:10:40 1763485840

Was this the article?

http://brinch-hansen.net/papers/1999b.pdf

kragen · 2025-11-18T17:55:29 1763488529

This is a toned-down, but still scathing, version of what I remember reading.

Nauxuron · 2025-11-18T09:09:54 1763456994

> You can't even access the data without locking the mutex.

It's even nicer than that: you can actually access data without locking the mutex, because while you hold a mutable borrow to the mutex, Rust statically guarantees that no one else can acquire locks on the mutex.

https://doc.rust-lang.org/std/sync/struct.Mutex.html#method....

jstimpfle · 2025-11-18T09:54:33 1763459673

Given a data item of non-thread safe type (i.e. not Mutex<T> etc), the borrow checker checks that there's only ever one mutable reference to it. This doesn't solve concurrency as it prevents multiple threads from even having the ability to access that data.

Mutex is for where you have that ability, and ensures at runtime that accesses get serialized.

dwattttt · 2025-11-18T10:41:10 1763462470

The maybe unexpected point is that if you know you're the only one who has a reference to a Mutex (i.e. you have a &mut), you don't need to bother lock it; if no one else knows about the Mutex, there's no one else who could lock it. It comes up when you're setting things up and haven't shared the Mutex yet.

This means no atomic operations or syscalls or what have you.

jstimpfle · 2025-11-18T10:47:07 1763462827

Do you have an example? I don't program in Rust, but I imagine I'd rarely get into that situation. Either my variable is a local (in a function) in which case I can tell pretty easily whether I'm the only one accessing it. Or, the data is linked globally in a data structure and the only way to access it safely is by knowing exactly what you're doing and what the other threads are doing. How is Rust going to help here? I imagine it's only making the optimal thing harder to achieve.

I can see that there are some cases where you have heap-data that is only visible in the current thread, and the borrow checker might be able to see that. But I can imagine that there are at least as many cases where it would only get in the way and probably nudge me towards unnecessary ceremony, including run-time overhead.

adwn · 2025-11-18T11:03:00 1763463780

When you construct an object containing a mutex, you have exclusive access to it, so you can initialize it without locking the mutex. When you're done, you publish/share the object, thereby losing exclusive access.

    struct Entry {
        msg: Mutex<String>,
    }
    ...
    // Construct a new object on the stack:
    let mut object = Entry { msg: Mutex::new(String::new()) };
    // Exclusive access, so no locking needed here:
    let mutable_msg = object.msg.get_mut();
    format_message(mutable_msg, ...);
    ...
    // Publish the object by moving it somewhere else, possibly on the heap:
    global_data.add_entry(object);
    // From now on, accessing the msg field would require locking the mutex

jstimpfle · 2025-11-18T14:17:57 1763475477

Initialization is always special. A mutex can't protect that which doesn't exist yet. The right way to initialize your object would be to construct the message first, then construct the composite type that combines the message with a mutex. This doesn't require locking a mutex, even without any borrow checker or other cleverness.

adwn · 2025-11-18T14:36:16 1763476576

Dude, it's a simplified example, of course you can poke holes into it. Here, let me help you fill in the gaps:

    let mut object = prepare_generic_entry(general_settings);
    let mutable_msg = object.msg.get_mut();
    do_specific_message_modification(mutable_msg, special_settings);

The point is, that there are situations where you have exclusive access to a mutex, and in those situations you can safely access the protected data without having to lock the mutex.

jstimpfle · 2025-11-18T14:48:58 1763477338

Sorry, I don't find that convincing but rather construed. This still seems like "constructor" type code, so the final object is not ready and locking should not happen before all the protected fields are constructed.

There may be other situations where you have an object in a specific state that makes it effectively owned by a thread, which might make it possible to forgo locking it. These are all very ad-hoc situations, most of them would surely be very hard to model using the borrow checker, and avoiding a lock would most likely not be worth the hassle anyway.

Not sure how this can help me reduce complexity or improve performance of my software.

imtringued · 2025-11-18T11:54:37 1763466877

>I don't program in Rust, but I imagine I'd rarely get into that situation.

Are you sure? Isn't having data be local to a thread the most common situation, with data sharing being the exception?

>Or, the data is linked globally in a data structure and the only way to access it safely is by knowing exactly what you're doing and what the other threads are doing.

That's exactly what the borrow checker does. It tracks how many mutable references you have to your data structure at compile time. This means you can be sure what is local and what is shared.

Meanwhile without the borrow checker you always have to assume there is a remote probability that your mental model is wrong and that everything goes wrong anyways. That's mentally exhausting. If something goes wrong, it is better to only have to check the places where you know things can go wrong, rather than the entire code base.

jstimpfle · 2025-11-18T14:05:47 1763474747

I use lots of locals but only to make my code very "local", i.e. fine-grained, editable and clear, using lots of temporary variable. No complicated expressions. That's all immutable data (after initialization). I rarely take the address of such data but make lots of copies. If I take its address, then as an immutable pointer, maybe not in the type system but at least in spirit.

I keep very little state on the stack -- mostly implicit stuff like mutex lock / mutex unlock. By "state" I mean object type things that get mutated or that need cleanup. I always have a "database schema" of my global state in mind. I define lots of explicit struct types instead of hiding state as locals in functions. I've found this approach of minimizing local state to be the right pattern because it enables composability. I'm now free to factor functionality into separate functions. I can much more freely change and improve control flow. With this approach it's quite rare that I produce bugs while refactoring.

So yes, I have lots of locals but I share basically none of them with other threads. Also, I avoid writing any code that blocks on other threads (other than maybe locking a mutex), so there's another reason why I would not intentionally share a local with another thread. Anything that will be shared with another thread should be allocated on the heap just for the reason that we want to avoid blocking on other threads.

In that sense, the borrow checker is a tool that would allow me to write code more easily that I never wanted written in the first place.

dwattttt · 2025-11-18T20:47:05 1763498825

It's relevant when you have more complex objects, such as ones that contain independent mutexes that lock different sections of data.

You want the object to present its valid operations, but the object could also be constructed in single or multithreaded situations.

So you'd offer two APIs; one which requires a shared reference, and internally locks, and a second which requires a mutable reference, but does no locking.

Internally the shared reference API would just lock the required mutexes, then forward to the mutable reference API.

mgaunard · 2025-11-18T09:01:16 1763456476

I find it better to model that as an Actor than a mutex, but I guess it's inherently the same thing, except the actor also allows asynchronous operations.

gpderetta · 2025-11-18T09:46:53 1763459213

You can go full circle and also make operations on a mutex asynchronous. Hence the realization that message passing and shared memory are truly dual.

mgaunard · 2025-11-18T10:21:22 1763461282

The very idea of a mutex is that it is synchronous. You wait until you can acquire the mutex.

If it's asynchronous, it's not a mutex anymore, or it's just used to synchronously setup some other asynchronous mechanism.

gpderetta · 2025-11-18T11:34:41 1763465681

A mutex is a way to guarantee mutual exclusion nothing more nothing less; You can recover synchronous behaviour if you really want:

    synchronized<Something> something;
    ...
    co_await something.async_visit([&](Something& x) {
        /* critical section here */ 
    });

mgaunard · 2025-11-18T12:04:31 1763467471

that isn't a mutex, that's delegating work asynchronously and delegating something else to run when it is complete (the implicitly defined continuation through coroutines).

In systems programming parlance, a mutex is a resource which can be acquired and released, acquired exactly once, and blocks on acquire if already acquired.

gpderetta · 2025-11-18T12:07:47 1763467667

Do a CPS transform of your typical std::mutex critical section and you'll find they are exactly the same.

mgaunard · 2025-11-18T12:17:56 1763468276

They're not, the interactions with the memory model are different, as are the guarantees.

CPS shouldn't be able to deadlock for example?

gpderetta · 2025-11-18T12:28:34 1763468914

CPS can trivially deadlock for all meaningful definitions of deadlock.

Would you consider this a mutex?

   async_mutex mux;

   co_await mux.lock();
   /* critical section */
   co_await mux.unlock();

What about: my_mutex mux;

   {
      std::lock_guard _{mux};
      /* critical section */
   }

where the code runs in a user space fiber.

Would you consider boost synchronized a mutex?

Don't confuse the semantics with the implementation details (yes async/await leaks implementation details).

mgaunard · 2025-11-19T11:13:14 1763550794

You only achieved a deadlock by re-introducing mutexes.

gpderetta · 2025-11-19T11:33:45 1763552025

Given:

    Something someting;
    async_mutex mtx;
    void my_critical_section(Data&);

1:

    await mtx.lock();
    my_critical_section(something);
    await mtx.unlock();

2:

    auto my_locked_critical_section() {
      await mtx.lock();
      my_critical_section(something);
      await mtx.unlock();
    }
    ...    
    await my_locked_critical_section(something);

3:

    auto locked(auto mtx, auto critical_section) {
      await mtx.lock();
      critical_section();
      await mtx.unlock();
    }

    ...    
    await locked(mtx, [&]{ my_critical_section(something); });

4:

    template<class T>
    struct synchronized {
       async_mutex mtx;
       T data;
       auto async_visit(auto fn) { locked(mtx, [fn,&data]{ fn(data); }); }
    };

    synchronized<Something> something;
    await something.async_visit([](Something& data) { my_critical_section(something); });

If 1 is a mutex, at which point it stops being a mutex? Note that 4 is my initial example.

mgaunard · 2025-11-19T21:14:10 1763586850

it's a mutex iff it's acquiring a resource exclusively.

which you don't need to do for synchronization of coroutines since you can control in which order things are scheduled and whether that's done concurrently or not.

gpderetta · 2025-11-20T09:35:37 1763631337

Not if you have multiple schedulers. Case in point: asio.strand or execution::on [1].

And even with one scheduler it makes sense to explicitly mark your critical sections.

Really, at the end of the day the primary purpose of a mutex is serialization of all operations on some data. The blocking behaviour is just a way to implement it.

[1] https://en.cppreference.com/w/cpp/execution/on.html

mgaunard · 2025-11-20T19:38:24 1763667504

Mutexes are a problematic pattern that doesn't compose, see the article.

indigo945 · 2025-11-18T09:11:44 1763457104

This doesn't solve the deadlock problem, however.

dist-epoch · 2025-11-18T09:08:09 1763456889

Sounds like the Java synchronized class.

masklinn · 2025-11-18T13:21:09 1763472069

No. It’s not a property of the type so you can have multiple items under a mutex and you’re not at the mercy of whoever wrote it, it works fine with POD types, it does not force a lock / unlock on each method call (instead the compiler essentially ensures you hold the lock before you can access the data), and the borrow checker is there to ensure you can not leak any sort of sub-states, even though you can call all sorts of helpers which have no requirement to be aware of the locking.

It’s what synchronized classes wish they had been, maybe.

the_gipsy · 2025-11-18T11:14:01 1763464441

Not at all. With rust you cannot accidentally leak a reference, and here's the killer: it guarantees these properties at compile time.