So if I'm reading it right, the quote from the original article that started this thread was ballpark correct?
> we are still stuck with 2 GB/s per SSD
Versus the ~2.7 GiB/s your benchmark shows (bit hard to know where to look on mobile with all that line-wrapped output, and when not familiar with the fio tool; not your fault but that's why I'm double checking my conclusion)
If you still have this machine, I wonder if you can get this bandwidth in parallel across all SSDs? There could be some hypervisor-level or host-level bottleneck that means while any SSD in isolation will give you the observed bandwidth, you can't actually reach that if you try to access them all in parallel?
it'd be great if you'd manage to throw together quick blogpost about i4g io perf, there obviously something funny going on and I imagine you guys could figure it out much easier than anybody else, especially if you are already having some figures in the marketing.
Last I checked, Linux splits up massive IO requests like that before sending them to the disk. But there's no benefit to splitting a sequential IO request all the way down to 4kB.
These settings control writing back modified pages. The experiments in the paper are read-only. With writes the situation is even worse than shown in the paper (though tuning these settings may help a bit).
This would be exactly the kind of innovation we would need in computer science. Instead we often get stuck in local minima (in this case a 40-year old POSIX interface) without realizing how much pain this causes.
Even after watching these videos and reading lots of articles on the topic, I still find the full C++ memory model extremely hard to understand. However, on x86 there are actually only a couple of things that one needs to understand to write correct lock free code. This is laid out in a blog post: https://databasearchitects.blogspot.com/2020/10/c-concurrenc...
C++'s "seq_cst" model is simple. If you're having any issues understanding anything at all, just stick with seq_cst.
If you want slightly better performance on some processors, you need to dip down into acquire-release. This 2nd memory model is faster because of the concept of half-barriers.
The compiler, CPU, and cache is ALLOWED to rearrange the code into the following:
acquire_barrier(); // Optimizer moved a() and b() from outside the barrier to inside the barrier
a();
b();
d();
c();
e();
g();
f();
release_barrier(); // Optimizer moved g() and f() from outside the barrier to inside the barrier
You're allowed to move optimizations "inside" towards the barrier, but you are not allowed to rearrange code "outside" of the half-barrier region. Because more optimizations are available (for the compiler, the CPU, or the caches), half-barriers execute slightly faster than full sequential consistency.
----------
Now that we've talked about things in the abstract, lets think about "actual" code. Lets say we have:
As the optimizer, you're only allowed to optimize to...
int i = 1; // a() and b() rearranged to the same line
full_barrier(); // Not allowed to optimize past this line
i+= 9; // c, d, and e rearranged
full_barrier();
i+= 11; // f, g rearranged.
Because all code can be rearranged to the "inside" of the barrier, you can simply write:
i = 21;
Therefore, half-barriers are faster.
----------
Now instead of the compiler rearranging code: imagine the L1 cache is rearranging writes to memory. With full barriers, the L1 cache has to write:
i = 1;
full_barrier(); // Ensure all other cores see that i is now = 1;
i = 10; // L1 cache allows CPU to do +2, +3, and +4 operations, but L1 "merges them together" and other cores do NOT see the +2, +3, or +4 operations
full_barrier(); // L1 cache communicates to other cores that i = 10 now;
i = 21; // L1 cache allows CPU to do +5 and +6 operations
// Without a barrier, L1 cache doesn't need to tell anyone that i is 21 now. No communication is guaranteed.
----------
Similarly, with half-barriers instead, the L1 cache's communication to other cores only has to be:
i = 21; // L1 cache can "lazily" inform other cores, allowing the CPU to perform i+=1, i+=2... i+=6.
So for CPUs that implement half-barriers (like ARM), the L1 cache can communicate ever so slightly more efficiently, if the programmer specifies these barriers.
----------
Finally, you have "weak ordered atomics", which have no barriers involved at all. While the atomics are guaranteed to execute atomically, their order is completely unspecified.
There's also consume / release barriers, which no one understands and no compiler implements. So ignore those. :-) They're trying to make consume/release easier to understand in a future standard... and I don't think they got all the "bugs" out of the consume/release standard yet.
-------
EDIT: Now that I think of it, acquire_barriers / release_barriers are often baked into a load/store operation and are "relative" to a variable. So the above discussion is still inaccurate. Nonetheless, I think its a simplified discussion to kinda explain why these barriers exist and why programmers were driven to make a "more efficient barrier" mechanic.
To the edit: right. I like the description using half barriers, but I have trouble reconciling that with the Linux Kernel's READ_ONCE/WRITE_ONCE macros, which guarantee no tearing/alignment issues, but boil down to reads/writes through casts to volatile qualified pointer dereferences. I guess those don't have the same notion of memory ordering that the C++11 API has... Maybe rmb()/wmb()...
Well... I didn't describe the C++11 memory model above. I had a gross simplification, because I didn't account for how the C++11 memory model acts "relative to a variable". (And this "variable" is typically the mutex itself).
My understanding is that WRITE_ONCE / READ_ONCE are meant for this "relative to a variable" issue. Its _precisely_ the issue I ignored in my post above.
All C++11 atomics are "relative to a variable". There are typically no memory-barriers floating around by themselves (there can be, but, you probably don't need the free-floating memory barriers to get the job done).
So you wouldn't write "acquire_barrier()" in C++11. You'd write "atomic_var.store(value, memory_order_release)", saying that the half-barrier is relative to atomic_var itself.
----------
a();
b();
while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
c();
d();
e();
atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock
f();
g();
So the C++ acquire/release model is always relative to a variable, commonly the spinlock.
This means that "c, d, and e" are protected by the spinlock (or whatever synchronization variable you're working with). Moving a or b "inside the lock" is fine, because that's the "unlocked region", and the higher-level programmer is fine with "any order" outside of the locked region.
Note: this means that c(), d(), and e() are free to be rearranged as necessary. For example:
while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
for(int i=0; i<100; i++){
value+=i;
}
atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock
The optimizer is allowed to reorder the values inside into:
while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
for(int i=99; i>=0; i--){ // decrement-and-test form is faster on many processors
value+=i;
}
atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock
Its the ordering "relative" to the spinlock that needs to be kept. Not the order of any of the other loads or stores that happen. As long as all value+=i stores are done "before" the atomic_store(spinlock) command, and "after" the atomic_swap(spinlock) command, all reorderings are valid.
So reordering from "value+=0, value+=1, ... value+=99" into "value+=99, value+=98... value+=0" is an allowable optimization.
----------
It seems like WRITE_ONCE / READ_ONCE was written for DEC_Alpha, which is far weaker (less guarantees about order) than even ARM. DEC_Alpha was the first popular multicore system, but its memory model allowed a huge number of reorderings.
WRITE_ONCE / READ_ONCE probably compile into no-ops on ARM or x86. I'm not 100% sure, but that'd be my guess. I think the last 20-years of CPU design has overall said that the DEC_Alpha's reorderings were just too confusing to handle in the general case, so CPU designers / low-level programmers just avoid that situation entirely.
"dependent memory accesses" is very similar to the confusing language of memory_order_consume. Which is again: a model almost no one understands, and almost no C++ compiler implements. :-) So we can probably ignore that.
Wrt Alpha, I think a large part of the weirdness was that some early variants had split caches which weren't coherent with each other or something like that. So if a pointer and the value it pointed to where in different cache banks you could get funny effects.
Everyone interested in the history of computing should read The Dream Machine by M. Mitchell Waldrop. The book pretends to be the biography by a little-known, but highly-influential guy named Licklider, but is in fact maybe the best general history of computing. It covers Turing, von Neumann, ARPA, Multics, DARPA (the internet), and Xerox PARC. Alan Key recommends it as the best history of PARC.
Well, the biggest Ivy Bridge EX has 15 cores! There's speculation that the 18 core Haswell-EPs are actually Haswell-EX dies that Intel wants to get rid as fast as possible because these chips have buggy TSX (transactional memory).
But cold caches are an unrealistic assumption. The top-most levels of a tree will always be in cache, unless you almost never access them -- in which case there's no problem either. Additionally, a radix tree is ordered, whereas a hash table is not.
Yes, that's correct. However, that's still over a thousand cycles for a tree of depth 5 below the cached part. That's a modestly sized tree (or several smaller trees.) Don't forget lot's of things compete for cache, it's usually safer to assume a cold cache unless you know your data structure is very high traffic.
I wonder how long it will take until there is a market for IP addresses. I suspect once such a market is in place IPv6 will not see widespread adoption, since most IPv4 addresses are not really used.
I suspect you're joking, but the old timers really don't like IP speculation. It took them years to agree on the current pseudo-market because they had to figure out how to keep out speculators.
At which point the routing properties of IP will be destroyed (all addresses in a given /16 or /24 routing through the same link helps router performance a lot). I think at that point you'll get a real ISP push to switch to IPv6
What has happened so far: A curfew was put in place, it has done absolutely nothing, the streets are full of people everywhere in the country. There are reports of dozens of deaths. The headquarter of president Mubaraks party is on fire for hours, no firefighter are there. The headquarter is next the most important Egyptian museums. The police has no control over the streets, the army was ordered in to enforce the curfew. The people are actually cheering as the military is moving in. It is still unclear what the military will do. Hillary Clinton has issued a statement to the Egyptian government to restrain security forces and avoid violence.