I think it is so important to not arbitrarily throw words like 'RAM' in such dis...

timClicks · on Oct 8, 2018

But wouldn't the consequence of spawning millions of native threads be to spill those virtual addresses into many, many pages and incur many more TLB misses and page faults generally?

nostrademons · on Oct 8, 2018

Only the portions of the stack that are actually touched trigger a TLB miss and page fault. If you've got a million threads but each only touches the first 4K of stack space, you end up only touching 4G of RAM even though you've mapped 1T of it.

cryptonector · on Oct 9, 2018

When you have terabytes of RAM you want large pages anyways.

gpderetta · on Oct 9, 2018

As each stack will need to use [1] at least one page of physical memory, large pages are not great for stacks as they will normally waste memory.

[1] for classic contiguous stacks of course. That doesn't apply for separately allocated frames or segmented stacks.

cryptonector · on Oct 9, 2018

Sure, you can have millions of 1MB stacks, but unless you have terabytes of RAM performance is going to suffer.

If you write your app with more explicit state rather than implicit state (bound up in the stack), such as by writing a C10K style callback-hell, thread-per-CPU application, you're going to get much much better performance. The reason is that you'll be using less memory per-client, which means fewer cache fills to service any one client, which means less cache pressure, all of which means more performance.

The key point is that thread-per-client applications are just very inefficient in terms of memory use. The reason is that application state gets bound up in large stacks. Whereas if the programmer takes the time to make application state explicit (e.g., as in the context arguments to callbacks, or as in context structures closed over by callback closures) then the program's per-client footprint can shrink greatly.

Writing memory-efficient code is difficult. But it is necessary in order to get decent throughput and scale.

apoorvgarg · on Oct 9, 2018

Nobody is suggesting that you create a million threads. But to think that it is not possible without TBs of physical memory is a fallacy (for 64-bit machines).

The thread switching costs itself could be prohibitive to performance. IIRC, goroutine switching cost is roughly 1/10th of linux thread switching costs.

cryptonector · on Oct 9, 2018

It's "possible". If you care about performance, then it's not. Obviously I'm not defining "performance", but you'll know this when you see this. Paging is the kiss of death for performance.

apoorvgarg · on Oct 9, 2018

Paging has nothing to do with this. When I say 1 MB thread stacks, it means the maximum size of that thread's stack in the virtual address space [0]. Each of these million threads could be using only a few KBs for its stack (out of that 1 MB of stack space). That would imply a few gigs of physical memory => no paging.

[0] http://man7.org/linux/man-pages/man3/pthread_attr_setstacksi...