Video games can scale just fine with many cores, including MMOs and games like M...

forrestthewoods · on Sept 5, 2021

> There are many hard problems in video game performance, but "games just can't make good use of multiple cores" is not one of them.

That's not quite what I said.

I'm happy you're using specs to work on an open source game. Specs and bevy and all the ECS work being done is super exciting and fun. I <3 Rust.

In the meantime there are no major games that effectively scale to, let's say, 64 cores. I don't know of anything shipped that can saturate a 12-core/24-thread Ryzen. ECS alone will not get us to that level of scaling.

Yes it's trivial to to throw audio, networking, and a few other subsystems onto separate threads. Modern games definitely leverage 4 cores. Although several of those cores will be severely under utilized.

Modern ECS designs are rapidly evolving and rapidly improving our ability to better leverage multiple cores. But we're not yet to a point where games can easily and efficiently saturate 10+ cores.

Personally I'd love to see a game like Eve Online that can effectively simulate a universe with tens of thousands of players either spread across the universe or all in one places during one giant battle.

> It is certainly far far easier to see real performance wins with multicore than by using multiple servers, which introduce very heavy coordination costs.

This is extremely true.

Jweb_Guru · on Sept 5, 2021

If you're just using ECS to parallelize disjoint subsystems then no, it won't get you there. But if you're using it (as Veloren is, and as I hope more people do) in a more deliberate way, to further parallelize within a system, you can indeed scale quite well to large numbers of cores. I've done some theoretical bottleneck calculations and we will still have tons of work to do with 64 cores available, if the game is written properly. We can already get decent utilization out of 32 threads at busy times, and the server was not really close to peak load in terms of player count (and we are very far from done optimizing): https://media.discordapp.net/attachments/539518074106413056/....

That's just for our server. Our clients can make use of cores in even more ways, although they have less work to do and generally have fewer cores available, and you can see Veloren taking advantage of 16 client threads with similar utilization here: https://twitter.com/sahajsarup/status/1431837669391142916.

The most important thing I want to note is that in both cases, you are not seeing tremendous imbalance between the cores most of the time. While there are definitely single-threaded bottlenecks in games, you have to be working pretty hard before they start bottlenecking the workload! Instead, we are just suffering from a combination of general inefficiency and lack of work to do.

So no, I'm going to push back against this notion that multicore scaling for games is some sort of crazy intractable problem. It's not. Like any other kind of parallel scaling, it's trivial in some places, more challenging in others, and depends a lot on your workload (including you actually having enough work to saturate the cores in the first place!). But there's nothing special about games here.

forrestthewoods · on Sept 5, 2021

The original question was:

> Is there a specific reason why Minecraft hasn’t been “adapted” from the original game to a robust design that could scale to larger worlds and player populations before your project ?

The answer is because it's a lot of hard work.

I am happy that a bunch of smart and talented people are working really hard to optimize Veloren. Good for you. I hope you help push the state of the art.

Jweb_Guru · on Sept 5, 2021

We have plenty of contributors who have never even programmed before, and certainly have no experience with parallel programming, and the bulk of our physics parallelization (easily the trickiest part to make concurrent) has been done by people who aren't professional programmers. And I don't think I can recall a single change (proposed or implemented) by any of these contributors that resulted in a significant reduction in parallelization opportunities, nor any that found the difficulty of contributing to have been increased by the fact that much of the game is parallelized--so this is not primarily a result of benevolent guardians gatekeeping unfriendly features, or anything like that. I can also count on one hand the number of bugs we've had due to race conditions caused by parallelizing things that were previously single-threaded. I think that modern tooling, languages, and libraries have gotten good enough that correct parallel programming is no longer nearly as hard as it's reputed to be.

Nor is scaling well on multicore the "point" of the game or even an explicit goal (though handling lots of players is)--taking advantage of multiple CPU cores is just one of many ways to improve performance, which we try to tackle on multiple fronts (including increased utilization of the GPU, explicit SIMD, smarter algorithms, allocation reduction, structure compression for improved cache locality and network utilization, etc. etc.). We haven't made any special effort to parallelize at the expense of single-threaded optimizations, and generally only do parallelization within a system, move things to the background, etc. where it is revealed as a bottleneck. And we've mostly done so by utilizing existing Rust libraries like crossbeam, specs, rayon, and wgpu, not rolling our own stuff. So again, there is nothing at all special about Veloren's design or focus here that makes it more amenable to parallelization than any other game would be, despite it being in a genre that is supposedly difficult to make scale.

And that's the thing I'm specifically trying to push back on--the idea that multicore scaling for games is only possible if you have some dedicated cabal of programming wizards who want to push the state of the art. We live in the age of libraries, and wizards are only needed deep in the guts of the implementations of those libraries (just as they always have been, and probably will be until the end of time). A game programmer does not need to understand how a lockfree work stealing queue is implemented (or even what it is!) in order to use "parallel for" to beat the snot out of a carefully optimized single-threaded version of the same task, and it's usually far easier to do the former than the latter.

I certainly understand why for a game like Minecraft, with a lot of legacy mechanics and mods that were never designed to be threadsafe, or a game engine like Unity, Unreal, or Roblox, that similarly have lots of plugins and customer code they would like to keep working, it would be very challenging to parallelize after the fact. And naturally, there are limits to what you can do on a single system, and your game design options become far more restricted once you're talking about 10k rather than 1k concurrent players. But for a brand new game without any legacy baggage, there's really no reason why it should scale poorly on multicore systems.

imtringued · on Sept 7, 2021

I apologize for the late response but here is a challenge to you.

Try to to parallelize a simplified form of applied energistics.

Applied energistics is a mod that lets you create an item transportation network. There are storage containers and machines with an inventory (for the sake of simplicity make them hold exactly 1 item and let the machines just turn A into B, B into C, C into A). The network interacts with inventories through interfaces. A storage interface makes items in that inventory accessible to every machine. Machines receive inputs through exporter interfaces and send outputs through importer interfaces.

It effectively is a database for items and that is exactly what makes it difficult to parallelize. The vast majority of games have 1:1 interactions between entities. In this system interactions can be as bad as n:m. That's also why it lags so badly with large networks. A hundred machines periodically scan hundreds of inventories.

Jweb_Guru · on Sept 8, 2021

So, firstly, I'm not entirely clear on what you're asking for here. If you just need to transfer ownership in parallel between different machines, the answer is to use a channel from one machine to the other. There are very efficient channels provided in crossbeam, and we commonly use them for tasks like this. If a channel between every machine would be too costly, a hub-spoke model can be used pretty easily, with routing performed between regions.

Similarly, fine-grained parallelism can be employed by storing each storage container behind a mutex or reader-writer lock, or even avoiding locking entirely and just using copy-on-write to update the item state when it is changed (we can either do this by executing all our state changes for each tick at once, in parallel, using Arc::make_mut, which is usually fastest, or if we need to do it asynchronously by using a crate like arcswap, which is slower). This is less efficient than a channel, but it has the advantage that the current inventory of a machine can be read without extracting the item (something you didn't specify as a requirement, but which I'm including for completeness).

Note from what I said previously that we don't actually need to continuously scan inventories for updates at all. The obvious optimization to perform is instead to have channel writes push changes directly to a change queue (this can be parallelized or sharded with some difficulty, but from experience a single channel usually suffices). The change queue can then be read or routed (in parallel or otherwise) to the appropriate storage devices to deliver its payload. If need be (since you haven't given a lot of details), we can also track which storage interfaces are being read by players, and each tick (in parallel) iterate through any players attached to the interface to notify them of new updates to that interface. There are other crates that automatically implement the incremental updates I mentioned, such as Frank McSherry's https://github.com/TimelyDataflow/timely-dataflow, for when you have something more complex to do; however, I have never had to reach for this because (which is why I wrote this post) it's actually uncommon to have something super complicated to parallelize!

From what I understand, this does not sound like it has nearly the complexity of a database :) The major thing that makes database performance harder to parallelize (though to be clear--they parallelize extremely well!) is not knowing what transactions are needed. In this case, though, we have perfect forward knowledge of what kinds of transactions there could be; the only things we would likely want to serialize would be attaching and detaching storage interfaces, and we can batch them up very easily on each tick due to the relatively "low" concurrent transaction count (keep in mind that some databases can process millions of transactions per second on a 16 core machine). And even if we did need to parallelize attaching and removing storage interfaces, it's not a strict requirement that we do that serially--crates like dashmap provide parallel reads, insertions, and deletions, and are basically an off-the-shelf, in-memory key-value database.

Finally, the kind of load you're talking about (hundreds of machines and hundreds of inventories) does not sound remotely sufficient to lag the game if it's optimized well, particularly since if we did do the naive scan strategy, it parallelizes easily (to see why: each scan tick, we first parallelize all imports into storage, then parallelize all scans from storage).

I suspect the problem here is not that the challenge you've provided is difficult to parallelize, or that it implements the functionality of a database or is M:N (by the way--something that is M:N in a hard to address way are entity-entity collisions!), but that the solution is designed in a very indirect way on top of existing Minecraft mechanics. As far as I can tell from what I've read about Redstone, it's completely possible to parallelize for most purposes to which it's put, since blocks can only update other blocks in very limited, local ways on each tick--it might even be amenable to GPU optimizations (in our own game, we would make sure that updates commuted on each tick to avoid needing to serialize merging operations on adjacent Redstone tiles). However, I could easily be misunderstanding both what you're asking for, and how Redstone works. If this is the case, please let me know!

Even more speculatively: I think a lot of game designers, when they think about parallelizing something, think about doing it in the background, or running things concurrently at different rates. While this can be done, this is primarily useful for performing a long-running background operation without blocking the game, not for improving the game's overall performance! In fact, running in the background in this way is often slower than just running single threaded, especially if it interacts with other world state. Many game developers therefore conclude that the task can't be profitably parallelized and move on. But the best (and simplest) solutions often involve keeping a sequential algorithm, but rewriting it so that each step of the algorithm can be performed massively in parallel, as in several of the possible solutions I outlined above. This is the bulk synchronous parallel model, which is the most commonly used parallelization strategy in HPC environments, and is also the primary parallelization strategy for GPU programming. It allows mixing fine-grained state updates with partitioning to maximize your utilization of all your cores, and because you're parallelizing a single workload and partitioning by write resources, it usually has far less contention with other threads than if you were trying to parallelize many workloads at once, each hitting the same stuff. This is the model we almost always turn to to parallelize things unless it's extremely obvious that we don't want them blocking a tick (like chunk generation, for example) and it reliably gives us great speedups without making the algorithm incomprehensible.

Kiro · on Sept 5, 2021

You're way too humble. Why are you downplaying your achievements with Veloren? It's a miracle made by geniuses and should be advertised as such.

Jweb_Guru · on Sept 5, 2021

Not sure if you're being sarcastic, but either way, I'm not saying there aren't some very impressive developers contributing to Veloren, or some really tricky and highly optimized code. But for the parallelization part, we're really not doing anything special. Pretty much all the cleverness there is in the libraries we use, and newer programmers can parallelize stuff about as easily as the more experienced devs.

Of course, if you wanted to say that libraries like crossbeam or rayon are "miracles made by geniuses" then I'd be more inclined to agree :) But there are similar facilities in other languages too, e.g. folly and OpenMP for C++.