Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Scipio: A Thread-per-Core Crate for Rust and Linux (datadoghq.com)
167 points by akshaykumar90 on Nov 3, 2020 | hide | past | favorite | 88 comments


I am a fan of the thread-per-core model, but I do not share their views on the data sharding per core

While it increases the data locality, I have seen a few software following this sharding model (notably scylla) that work really bad once the load is not evenly distributed across all shards

When that happens it can be a huge waste of resources and can give lower performance (depending on the type of load)

Imho unless you are absolutely sure about the type of load, leave the sharding to dividing data between servers, or have some mechanism that can shift to sharing the load between threads if the system imbalance is too great


> While it increases the data locality, I have seen a few software following this sharding model (notably scylla) that work really bad once the load is not evenly distributed across all shards

Serial access to hot keys is a hard thing to design around, and you're right that sharding doesn't solve that problem. Worse, it exposes other keys that just happen to share the shard (or the core) to poor performance.

There are a couple of well-understood solutions to this problem. The obvious one is to dynamically re-balance shards based on heat, either moving some keys or split/merge. This is the same tactic that many distributed databases use, and while it's complex to do, it's easier to do locally than distributed. Another option is stochastic re-balancing, like the Stochastic Fairness Queuing (https://ieeexplore.ieee.org/document/91316) model used in networking. Here, shards are randomly re-shuffled occasionally. Doesn't fix the noisy neighbor problem, but does mean that the noisyness moves around. That might seem silly, but it's pretty much what's going to happen under the covers of the non-sharded version of the code when the scheduler gets involved, only easier to reason about.

> When that happens it can be a huge waste of resources and can give lower performance (depending on the type of load)

Lower apparent performance for neighbors of the heavy hitter, sure. Under which other circumstances does it reduce performance?

> Imho unless you are absolutely sure about the type of load, leave the sharding to dividing data between servers, or have some mechanism that can shift to sharing the load between threads if the system imbalance is too great

I'm a bit puzzled by this. Distributed systems have exactly the same problem, and solving that problem is much harder there because the cost of contention is higher, data movement is more expensive, and you have to deal with a lot more failure cases.

The statistics may be better for distributed systems because a hot tenant has to be a lot hotter to make hot box than a hot core. But that's a very specific kind of bet, and if you end up with a tenant that does cause a hot box you have an even harder problem to solve.

Dynamo (https://www.allthingsdistributed.com/files/amazon-dynamo-sos...) solves this by moving the key space around, as do many similar kinds of systems. It's not easy, though, and filled with caveats. If you're scared of sharding, distributed sharding should be scarier than on-box sharding.


There's a few mitigations that can be done for poorly-sharded data. Observability being the first step in the process. We've added a toppartitions method to nodetool to spot such monsters in the wild. The idea that you can just autobalance out of poor sharding methods is difficult in practice. Your "hot partition" could be because of a stochastic event ("the dress that broke the Internet") vs. a perennially "hot key" that, no matter where you place it, it's going to be hot.

https://docs.scylladb.com/operating-scylla/nodetool-commands...


I've been building data sharding architectures for at least a decade, so the problems you raise are ones I am intimately familiar with. Many of the data models I've worked with have unavoidably unpredictable dynamic distributions of both data and load, so it is a problem that can't be ignored. You are correct that this will essentially break naive data sharding architectures.

It is possible to design data sharding architectures where balancing of both data and load across cores is continuous and smooth, with surprisingly minimal overhead and coordination cost. In fact, there are multiple ways of doing it, depending on your workload characteristics. At this point, the design idioms for this style of architecture are refined and robust, so there is no computer science reason the problems you raise need to exist in a real implementation. The reason it seems "difficult" in practice is because so many designs insist on loosely coupling and weakly scheduling storage, execution, network, etc. If your architecture concept is slapping a thin layer on top of RocksDB, it won't be feasible. Every part of the stack needs to understand and be designed to the model. The end result is actually quite elegant in my view, and with unmatched throughput.

People design distributed systems with simple static sharding schemes because they are obvious, easy, and it lets you cut a lot of corners on the rest of your system design. It is not the only way to design a distributed system, continuous adaptive resharding and load shedding is a demonstrably viable option, and it is much easier to implement within a single server than on an actual cluster of networked computers.


> At this point, the design idioms for this style of architecture are refined and robust

Would you be willing to point us toward at least the zipcode of some citations which could outline these refined and robust design idioms?


Some examples:

Methods and apparatus for optimizing resource utilization in distributed storage systems

https://patents.google.com/patent/US9990147B2/en?inventor=Ja...

Load rebalancing for shared resource

https://patents.google.com/patent/US8539197B1/en?q=monitorin...

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

"Since our system has a high tolerance to failures, we can leverage this for maintenance operations that cause segment unavailability. For example, heat management is straightforward. We can mark one of the segments on a hot disk or node as bad, and the quorum will be quickly repaired by migration to some other colder node in the fleet."

http://news.cs.nyu.edu/~jinyang/ds-reading/aurora-sigmod17.p...


> zipcode

The parent developed a high perf distributed GIS database. I assume you knew this.


Which GIS database? His github account only has hash functions... MetroHash, AquaHash


Interesting, can you give pointers to some resources o this? What do you consider state of the art?


Shard per core has challenges. Scylla solves it partially by making the client shard aware (topology aware), so the client submits requests directly to the right cpu core using a different port. This way no data gets blocked due to hot keys or imbalance in other shards. You do need to have a good key distribution. We're implementing a new mechanism to split key ranges dynamically. It will be a fast an parallel mechanism.

If the number of connections is not small, Scylla will crank up any other implementation with traditional locking.


I'll go against the grain here and say that async/await, wether implemented by one thread-per-core like here or by stackless coroutines is not the solution.

Async/await will make complexity explode because of the colored function problem [1].

The solution to expensive context switches is cheap context switches, plain and simple. User-mode lightweight threads like go's, or upcoming Java's with Loom [2] have proven that this is possible.

Yes, it does mean that it can only happen in a language that controls its stack (so that you can slice it off and pop a continuation on it). I sincerely believe this is Rust's ballpark; hell they even started the project with that idea in mind.

[1] https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...

[2] https://jdk.java.net/loom/


The "Colored function problem" is a complete fallacy: aside from the fact that async and non-async fns are interoperable in Rust (block_on) and other languages, the same argument could be made against functions with any preconditions.

> The solution to expensive context switches is cheap context switches, plain and simple.

Except the performance difference between a kernel-mode context switch and a user-mode one is only going to narrow in the future. The overhead that cannot be eliminated from context switches is their effect on the cache, since you start to run into the laws of physics at that point...

The real solution to expensive context switches is to just do fewer of them... No context switch is always faster than a "fast" context switch.

> I sincerely believe this is Rust's ballpark

I think it's plausible that Rust could get a library-level solution for fibers that does not rely on unstable details of the compiler. Rust will never again have that baked in to the language, as it would make the language completely unsuitable for many low-level tasks.

Fibers, especially the way they are implemented in Go, come with a lot of their own complexity.

Just look at this issue: https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/

This is just one of the segfaults caused by Go's complex stack control. I don't want to rely on a runtime that contains these sorts of bugs, and the best way to avoid that is to avoid having a runtime in the first place.


> are interoperable in Rust (block_on)

Blocking an OS thread as a mean to be compatible is not exactly what we're trying to do here.

> Except the performance difference between a kernel-mode context switch and a user-mode one is only going to narrow in the future

OS overhead can be minimized, but program stacks are a function of the language's. And if you're not right sizing preemption points in your stack, you'll be switching large parts of it. This means you _must_ have stackful coroutines if you want to keep switching threads.

> The real solution to expensive context switches is to just do fewer of them... No context switch is always faster than a "fast" context switch.

Sure, but writing the perfect assembly and using gotos has always been the fastest. Abstraction has a cost, and some runtimes/languages are currently proving that they can reduce this cost to zero in the current conditions of IO being much costlier than a few 100s of nanos. We're just happening to be at a time where the compiler is starting to be smarter than the user. But I guess the benchmarks will settle all this.

> Just look at this issue: https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/

So that's a bug in the go compiler. They can either fix it or pay a "a small speed penalty (nanoseconds)" as a workaround, which the author qualifies as "acceptable".

Yes, that's not the absolute performance possible. But why care about that? At some point it all comes down to TCO (except for latencies in HFT); and TCO tells you that it's ok. Development complexity and maintainability matters. Especially when you can max out your IO usage for the decades to come.


> Blocking an OS thread as a mean to be compatible is not exactly what we're trying to do here.

It's what Go does whenever you call into a C function or make a system-call. As long as those blocking functions are not the bottleneck then it works fine.

My problem with the "colored function" analogy is that it implies that the problem is somehow due to the surface syntax, when in reality the problem still exists in all languages that support procedural IO: some of those languages like to just pretend that the problem doesn't exist.

The only language I'm aware of which truly solves that is Haskell, since all IO happens via a monad.

> Yes, that's not the absolute performance possible. But why care about that?

This point was not about performance. It's about the pitfalls of writing all of your code on top of a complex and buggy runtime.

Programming is a lot simpler, and development is a lot faster, when I don't have to worry about that.

There are also several things that are a lot more complicated when you bake a complex runtime into the language like Go does. Thread-local storage is completely broken for one. If you do any kind of GUI programming, you may need to use `runtime.LockOSThread` as most GUIs expect function calls from a single thread. etc. etc.


I don't know too much about Rust, but in .NET, blocking on Task.Result is considered an anti-pattern and a Bad Thing To Do. Not in the least because it very easily leads to deadlocks.


This is considered bad in .NET because callbacks can "Post" back to the original SynchronizationContext which depends on what async/await is being used for. For example, a call to await on a WPF UI thread will join to the calling thread so if you call Task.Result without configuring the task to not join back to the calling thread then you can deadlock the callback processing queue. To avoid this you would use ConfigureAwait(false) depending on your situation. It's the source of a lot of confusion in .NET. I don't believe that Rust has this "feature" and if you somehow wrote code to achieve the same thing as .NET does then it probably wouldn't compile in Rust due to the ownership rules.

Source: https://devblogs.microsoft.com/dotnet/configureawait-faq/


Wait, Rust doesn't associate executors with futures? If it does, and there is a single-threaded executor available, then it's absolutely possible to deadlock just like in .NET.


That is a good question. It is possible to avoid deadlocks in a single threaded executor with higher priority interrupts but I’m no authority in this area. Maybe someone else can comment. Most of my understanding in this area comes from reading this article: https://os.phil-opp.com/async-await/


In Rust a task which is started on a given executor never leaves it. It can not switch executors like it can in C# where the continuations could be called on an arbitrary thread. In Rust, wakeups for an async task essentially just schedule it for running again, but the execution happens on the previous executor.

Anyway, in both models you can have deadlocks. And even if there is no deadlock, blocking the eventloop is still an antipattern, since it prevents other tasks which might be able to make progress from running.


`block_on` does not block the calling thread: it runs the closure on a separate thread-pool based executor designed for CPU-bound tasks.

The calling thread immediately yields until the blocking work task completes and wakes it up again.


When I read "Colored function problem" and "make complexity explode" I thought it was some weird NP-complete scheduling issue having to do with graph coloring or something, but it turns out it's just a fancy term for not wanting to add async to everything that calls async. Basically just an ergonomics issue.

First, in Rust this isn't really a problem. You can always turn async calls into blocking ones in Rust by calling block_on [0]. In some languages block_on doesn't exist, like in in-browser js, because here, code is supposed to be async. But in Rust there is no requirement, so there's no colored function problem here.

Second, I don't think it's a big problem in the first place. In one of my projects, I'm using an async library and have isolated the async-ness by creating a dedicated thread that communicates with the library. The thread provides a queue of messages that the remaining code of my project can handle.

[0]: https://docs.rs/tokio/0.3.3/tokio/runtime/struct.Runtime.htm...


> Basically just an ergonomics issue.

A viral one, and number-of-possible-states-exploding one at that.

> First, in Rust this isn't really a problem. You can always turn async calls into blocking ones in Rust by calling block_on

That's not exactly what we're trying to do here.


I must admit I had no idea what that was either and the first thing that came to my mind was some combinatorial explosion thing.

I truly don't care about this issue...


the issue is turning blocking calls (or non-async calls) into non-blocking ones, or simply yielding from a callback deep into a callstack (the usual example is turning an internal iterator into an external one).

Of course you can add async to the whole callstack, but it could be third party code and it might require code duplication if async adds a penalty to compared to non-async code.

Ideally the fixed stack size/conversion to state machine would be an optimization that the compiler would apply if it can prove that the coroutine ever yields form top level (or from a well known and fixed stack depth) and resort to dynamic stacks otherwise. I have been thinking a lot about this, and I think the key is reifying the incoming continuation and, as long as it doesn't escape the called coroutine , the optimization can be guaranteed. I believe that rust lifetime machinery might help, but it is something I'm not familiar with.


It's not that hard, it's just pointless. Async is more general, so the optimization would have to go into the opposite direction: everything starts as implicitly async and things that provably don't need to be can just be converted to regular synchronous stackful code as an optimization pass.


that's what many functional programming languages do (or did, I think it went a bit out of syle), but it is expensive and the interoperability story with C is not good.

edit: it also requires heap allocating activation frames in the most general case, which is slow.


After full CPS transformation, you absolutely can allocate the activation frames on the stack. Cf. CHICKEN Scheme that does precisely that, and more generally, uses the stack as the 0-generation. When the stack hits a certain depth, it longjmp's to the GC, copies whatever is alive to the heap and restarts the current continuation on the now-trimmed stack.


yes, you can even use the original C stack as a bump allocator (That's Cheney on the MTA, right?), but then you need GC, which is not appropriate for rust.


In browser JS can drop down to the old Promise callbacks (.then) to avoid coloring synchronous callers.


If any part of your function depends on the result of a promise then it has to return a promise; this is true whether you use async/await syntax or then(). Unfortunately JS doesn't have something like block_on, which now that I think about it is probably because it's single-threaded, so any block would block the entire app.


There is an easy solution to the colour problem: make all functions async. All of them. Of course, having code like this

   async func foo(x int) task[int] { return await(2 * x); }

   // No "async" only because cpu_bound_future_factory() returns a task
         func bar(y int) task[int] { return cpu_bound_future_factory(async func() task[int] { return await(y - 5); }); }

   async func baz(a,b int) task[int] { return await(await foo(a) + await bar(b)); }
is not fun, especially when you have to await the result of the arithmetic operators, and write all those now-redundant async and await, so let's also drop "async" and make "await" implicit--although we'll need something for non-awaiting. Let's call it "nowait". Now we can write code like this:

   func foo(x int) task[int] { return 2 * x; }

   func bar(y int) task[int] { return nowait cpu_bound_future_factory(func() task[int] { return y - 5; }); }

   func baz(a, b int) task[int] { return foo(a) + bar(b); }
A-a-and we're back to multithreading, basically. So yeah, looks like cheap context switches is the solution.


Agree :) . Semantically there is no difference, but in practice async functions are compiled differently than normal functions. what you are proposing in practice is to do a global CPS transform and possibly reconstruct back the original function when the continuation is not needed (i.e. it is always invoked in strict LIFO order). There are functional languages where this is the norm, but I don't think this is appropriate for a system language like rust (it would greatly complicate interop with C for example).

My idea is to make the continuation explicit and CPS transform all and only the functions that have a continuation as parameter (and any generic function with a type that is a continuation or contains a continuation). Fall back to dynamic stacks if the continuation escapes.

It is always continuations all the way down.


No, it's not necessary to perform a CPS transformation. After all, the OS manages to pre-emptively schedule any non-cooperating processes, right?

So, leaving aside the problem of interacting with the OS for I/O, you can do interleaved CPU-bound in a single thread, by instrumenting your code with basically an instruction counter. When it grows a bit too large, stop, reset it, and switch to another task. Erlang does this, although being a functional language, it counts only function calls (no other way to make a loop than to (tail-)call itself). No CPS needed.


Yes of course, a stack per thread is always an option.


Maybe I am part of the problem by emphasizing context switching too much, but as penberg noticed, there is more to thread per core than context switching.


I'm quite tired of “What Color is Your Function”. In the five years since it was written I have yet to care about this supposed explosive problem in any language that uses async/await. It just seems like something that gets trotted out whenever anyone dares to suggest there are benefits to the approach or to talk about Rust. I'm not sure why I am supposed to accept it as meaningful truth.


That doesn't work because async functions need to be compiled for minimal stack usage (i.e. segmented stacks, packed variables), while non-async functions need to be compiled for minimal CPU usage (i.e. no stack checking or allocation, no variable packing, only one stack adjustment per function).

If you compile both for the same goal the system will have suboptimal memory usage or performance, and in general you don't know whether a function will be interrupted in advance if you don't have colored functions (also you can't move the stack if you have references unless you have a GC, which is terrible).


So thread-per-core is not just about eliminating context switches. It's also about partitioning application-level data to reduce inter-core synchronization to let speculative, out-of-order cores run independently as much as possible. Don't get me wrong, user-level threads are great, and arguably much simpler programming model than async/await. But they're not _the solution_ either.


They're not the solution to well-sliced data processing, but I can't see how they would hurt the performance.


User-level threads do not solve the problem of synchronization and data movement. That is, with a “thread-per-core” model, you eliminate most needs to synchronize between multiple CPU cores, and, therefore, eliminate the overhead of acquiring and releasing a lock and allow the out-of-order CPU cores run at full speed. Furthermore, “thread-per-core” model ensures that memory accesses are always CPU local, which eliminates the need to move data between CPU caches, which eliminates the expensive CPU cache consistency protocol (that makes scaling to multiple cores more difficult).

That said, I am not claiming thread-per-core is _the solution_ either, just saying that if you can partition data at application-level, you can make things run plenty fast. Of course, you’re also exposing yourself to other issues, like “hot shards” where some CPUs get disproportionately more work than others. However, as we scale to more and more cores, it seems inevitable that we must partition our systems at some level.


It's pretty much common knowledge that if you want linear performance scaling on a multiprocessor then each workload has to be effectively independent from the others. Synchronization and communication is expensive and only takes away from your per core performance but it never makes a given core faster. So yes, sharding is a critical part of this "thread per core" architecture.


To be fair, concurrent writes do not scale, but single writers multiple readers work just fine, so you only need to an hard partition of writers and can allow all threads (or the appropriate NUMA subset) to access any data.


binding user level threads to hardware cores without preemption is completely equivalent to the thread-per-core + coroutines, except that user level threads allow deep stacks (with all the benefits and issues that it implies).

In fact a well designed generic executor can be completely oblivious to whether it is scheduling plain closures, async functions or fibers. See for example boost.asio.


I agree -- I was just around in the last days of Windows 3.1 and co-operative multitasking (which is close to async with a different coat), and it was a horrible experience, and fighting hangs was a never-ending task.

I would go further, and say assuming your program is doing "interesting work" (not just calling a database / reading a file and returning the results), use a threadpool. In almost all applications threads are fine, certainly when you are comparing threaded Rust to people using Python/Ruby.


>I sincerely believe this is Rust's ballpark; hell they even started the project with that idea in mind.

The problem with custom stacks is you essentially need to have a custom runtime and it makes calling into C functions a lot more difficult (cgo isn’t a cakewalk)

This runs counter to Rust’s C++ inherited belief that you don’t pay for what you don’t use and would have have made Rust less feasible for all kinds of other projects.

async/await might not have the best developer ergonomics, but it does have the best implementation ergonomics from a language point of view


Is there a mechanism to group some goroutines onto a single CPU thread to avoid locking? In the environments that have to rely on thread-per-core (sensitive to even 50us), blocking functions will become obvious in the profiling report.


> Is there a mechanism to group some goroutines onto a single CPU thread to avoid locking?

Python's GIL comes to mind :p


Nothing stops a model with lightweight threads from limiting a thread from having a separate pool of fibers running on it. It would probably require a construct like a nursery configured to run the fibers in a dedicated thread.


it would be good if Rust would have as many programming models as possible. Different applications are likely to benefit from different approaches so having a wide array of choice is good.


No, the point of async "coloring" is to indicate which function call does context switching, so that you would know in which piece of code you could do synchronization without something like a semaphore, as you'd need a semaphore if you synchronize shared memory access across async calls, but you don't if you do it in between async calls. And also to know which piece of code blocks event loop (i.e. everything in between async calls), so that you could optimize for performance: throughput, latency. Assuming, of course, that you run one event loop per thread, not doing dumb things like some Rust executors scheduling tasks to different cores.

Goroutine (loom) style concurrency only makes things worse both ergonomically and performance-wise, pushing programmers towards slow buggy lock-riddled code as the only way to use such models.


You turn a real engineering issue into a programming language dogma war. Really?

Dont use async-await then. Its just syntactic sugar (minus a few details that make it so ergonomic that you can do things that otherwise would be more work than they are worth). Just use plain futures. Be joyful as there will be no "colored" functions in your editor - only 10x more code (and less readable!).

I think there is a place for arguing about async await syntax and whether fibers are the better solution (try implementing them without a large runtime! Rust got rid of theirs.). Clearly, the benefits provided by continuations are worthwhile in avoiding context switches - to argue otherwise is wild. Context switches are very expensive and, in performance critical code, a huge red flag.


Finally, a Seastar clone for Rust! Really impressed by some of the work coming out from Datadog. Curious if one of the popular Rust web frameworks will choose it as I/O backend, could be a great way to push down latencies even further.


Seastar folks are really smart.


With a name like Scipio I would imagine it also uses a thread-per-core of neighboring computers as well.


Scipio has seen some action on HN before in "C++ vs Rust: an async Thread-per-Core story": https://news.ycombinator.com/item?id=24444347


Why not use multiple processes instead?


To turn this question on its head, why use multiple processes? That's not just rhetorical, I don't understand why you'd ask that question. The main advantages of separate processes are (1) any global variables are not shared between them, which is sometimes what you want (like Python's GIL), and (2) there is memory protection. Neither of those seem relevant to a Rust data processing program.

To actually answer your question a bit: Threads allow all the benefits of processes (except the two I just mentioned), with the added benefit of being lower overhead and allow sharing information more easily (but you can still use full-blown IPC to communicate between them if you wish).


Not all the benefits.

Threads also introduce security bugs and possible instability (one thread crash brings the whole thing down).


I want the whole thing to come down. To me, that's preferable to a process silently crashing. Worse still, Linux's OOM killer can kill anything it wants. This is all fine if your processes are truly independent, but if they're not then you've not really gained much by using processes.


processes do provide well defined boundaries that can guarantee the integrity of each component separately, so it is realistic to be able to design an application that can survive the loss of one component even in a memory unsafe language.


I didn't say "all the benefits", I said

> all the benefits ... (except the two I just mentioned)

One of those two exceptions was indeed memory protection.


Fair enough.


Excellent question! I think VoltDB, for example, uses the same application-level data partitioning approach, but with processes instead of threads. One advantage of the "thread-per-core" approach is that it allows fast communication between shards because the underlying threads share the same address space. In contrast, processes need to use a more heavy-weight approach of inter-process communication (IPC). Of course, process-based systems will have a reliability advantage, because processes cannot easily mess with each others state.


You can still use shared memory with processes.


for a more constructive reply than a downvote, using multiple threads allows you to post closures between executors which is a very flexible and powerful message passing model. I haven't looked deep enough into scipio, but as it is inspired on seastar I assume it allows that as well.

Multiple processes works better if you have little or no sharing or communication between the threads.


Exactly right. Scipio doesn't support that yet, but keep in mind that when Hannibal was out there putting everyone to shame Scipio was still a child. There is already a PR to support that that I intend to merge soon.

Indeed without that, you might as well use multiple processes


Thread-per-core has been around for a while! It's how a lot of modern multicore languages work. You've probably heard of "green" or "lightweight" threads before... it's all the same idea. Typically, you'd want to use a scheduler (probably work-stealing https://en.wikipedia.org/wiki/Work_stealing) under the hood to dynamically assign tasks to processors, which is much more effective at load-balancing than "sharding".

All of these languages/libraries use a dynamic scheduler for load-balancing:

* Rayon (Rust) [https://github.com/rayon-rs/rayon]

* Goroutines (Go) [https://golangbyexample.com/goroutines-golang/]

* OpenMP [https://www.openmp.org/]

* Task Parallel Library (.NET) [https://docs.microsoft.com/en-us/dotnet/standard/parallel-pr...]

* Thread Building Blocks (C++) [https://software.intel.com/content/www/us/en/develop/tools/t...]

* Cilk (C/C++) [http://cilk.mit.edu/]

* Java Fork-Join and Parallel Streams [https://docs.oracle.com/javase/tutorial/collections/streams/...]

* ParlayLib (C++) [https://github.com/cmuparlay/parlaylib]


If it's not partitioning data per processing unit then it would not be considered a "thread-per-core" architecture based on the definition the article provided. Work stealing means more than one thread can be responsible for a single piece of data. This could result in crossing NUMA nodes or servers.


Hmm I was separating the concept of "thread-per-core" from sharding. I would argue that typical cooperative task schedulers (e.g. work-stealing) get the performance benefits of thread-per-core without requiring any static partitioning.

But if thread-per-core is fundamentally tied to the idea of sharding, then I think I see what you're saying.


So this sounds a lot like an application of Virding's First Rule of Programming.

Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.

http://rvirding.blogspot.com/2008/01/virdings-first-rule-of-...


How is this in any way related to erlang?


State being assigned to a single thread removing the need to use locks is a major selling point of Erlang/BEAM.


This post is about the architecture to pin multiple tasks (read erlang processes) that use the same data shard (read shared data) on to a single thread to avoid lock and furthermore, never block on that thread to avoid context switch.


The only requirement the actor model demands is that actors only have access to their own data but how actors are scheduled is up to the runtime. It could be that Erlang/BEAM does pin actors to a single thread and then pins those threads to a single core but it is not 100% necessary.


nonetheless, the BEAM does do that, and only revokes actors from a core under certain conditions (power saving). Also it's important to note that the BEAM is not "the actor model" so it is not bound by the requirements thereof, it just "looks like an actor if you squint at it a little bit".


thread-per core with userland coroutine scheduling is very much an erlang thing since... the late 1980s (though erlang is much, much more powerful than just that).


This only makes sense if the local I/O is really fast and in many cases you contact external databases where the context switches are negligible to the I/O time spent on a query.


Suru, but this is a good architecture for writing databases in the first place.


Can we make it work in Windows by any chance?


In principle you could likely do it. Either call SetProcessAffinityMask on every process on the system to keep them off your dedicated cores, and add a hook to call it on every newly started process. Or simply start your threads with the highest possible priority, with affinity for one core, and never use blocking APIs. In the Windows scheduler a thread of higher priority always gets priority over threads with lower priority, so if you set your priority sufficiently high and never tell the kernel that you are waiting on something you shouldn't get interrupted.


In practice, we've experienced 20 microsecond-100 millisecond thread interruption issues on Windows. The resources we've found are inadequate to address the questions. No such problems on Linux. Windows scheduler experts appear to be rare. We wonder if Windows is simply not amenable to certain types of high-performance applications.


Probably, but you will have to rip out all the io_uring specific bits and replace them with their Windows equivalent. Probably something with I/O completion ports.

If you are interested, look into the new Windows I/O scheduler that was implemented for GHC 9.0 (in Haskell, obv). There is also some preliminary work to integrate io_uring into the Haskell runtime if it' available, but the Haskell runtime philosophy is rather different from the Rust approach.


Worth noting that having number of IOCP worker threads == number of cores has always been the official suggestion for building IOCP applications.

Eg https://docs.microsoft.com/en-us/windows/win32/fileio/i-o-co...

>The concurrency value of a completion port is specified when it is created with CreateIoCompletionPort via the NumberOfConcurrentThreads parameter. This value limits the number of runnable threads associated with the completion port.

>[...]

>The best overall maximum value to pick for the concurrency value is the number of CPUs on the computer.


Do it!


Would a thread-per-core architecture help mitigate security vulnerabilities like spectre and meltdown?


This prevents the application's thread from moving cores, but doesn't prevent an unrelated thread from using that core (especially in a hypervised environment). That unrelated thread could still perform those attacks against the application's thread.


Threads are still going to make syscalls into kernel code, so... not as much, no.


You are better with processes per core, with no threads.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: