More

sprachspiel · on Feb 20, 2024

This is for 8 SSDs and a single modern PCIe 5.0 has better specs than this.

nik_0_0 · on Feb 20, 2024

Is it? The line preceding the bullet list on that page seems to state otherwise:

“”

  Each storage volume can deliver the following performance (all measured using 4 KiB blocks):

  * Up to 8000 MB/second of sequential reads

“”

sprachspiel · on Feb 20, 2024

Just tested a i4i.32xlarge:

  $ lsblk
  NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
  loop0          7:0    0  24.9M  1 loop /snap/amazon-ssm-agent/7628
  loop1          7:1    0  55.7M  1 loop /snap/core18/2812
  loop2          7:2    0  63.5M  1 loop /snap/core20/2015
  loop3          7:3    0 111.9M  1 loop /snap/lxd/24322
  loop4          7:4    0  40.9M  1 loop /snap/snapd/20290
  nvme0n1      259:0    0     8G  0 disk 
  ├─nvme0n1p1  259:1    0   7.9G  0 part /
  ├─nvme0n1p14 259:2    0     4M  0 part 
  └─nvme0n1p15 259:3    0   106M  0 part /boot/efi
  nvme2n1      259:4    0   3.4T  0 disk 
  nvme4n1      259:5    0   3.4T  0 disk 
  nvme1n1      259:6    0   3.4T  0 disk 
  nvme5n1      259:7    0   3.4T  0 disk 
  nvme7n1      259:8    0   3.4T  0 disk 
  nvme6n1      259:9    0   3.4T  0 disk 
  nvme3n1      259:10   0   3.4T  0 disk 
  nvme8n1      259:11   0   3.4T  0 disk

Since nvme0n1 is the EBS boot volume, we have 8 SSDs. And here's the read bandwidth for one of them:

  $ sudo fio --name=bla --filename=/dev/nvme2n1 --rw=read --iodepth=128 --ioengine=libaio --direct=1 --blocksize=16m
  bla: (g=0): rw=read, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio, iodepth=128
  fio-3.28
  Starting 1 process
  ^Cbs: 1 (f=1): [R(1)][0.5%][r=2704MiB/s][r=169 IOPS][eta 20m:17s]

So we should have a total bandwidth of 2.7*8=21 GB/s. Not that great for 2024.

Aachen · on Feb 20, 2024

So if I'm reading it right, the quote from the original article that started this thread was ballpark correct?

> we are still stuck with 2 GB/s per SSD

Versus the ~2.7 GiB/s your benchmark shows (bit hard to know where to look on mobile with all that line-wrapped output, and when not familiar with the fio tool; not your fault but that's why I'm double checking my conclusion)

Nextgrid · on Feb 20, 2024

If you still have this machine, I wonder if you can get this bandwidth in parallel across all SSDs? There could be some hypervisor-level or host-level bottleneck that means while any SSD in isolation will give you the observed bandwidth, you can't actually reach that if you try to access them all in parallel?

highfrequency · on Feb 28, 2024

The aggregate throughput matches the advertised number of 22,400 MB/s: https://aws.amazon.com/blogs/aws/new-storage-optimized-amazo...

dekhn · on Feb 20, 2024

Can you addjust --blocksize to correspond to the block size on the device? And with/without --direct=1

zokier · on Feb 20, 2024

I wonder if there is some tuning that needs to be done here, it seems suprising that the advertised rate would be this much off otherwise.

jeffbee · on Feb 20, 2024

I would start with the LBA format, which is likely to be suboptimal for compatibility.

zokier · on Feb 20, 2024

somehow I4g drives don't like to get formatted

    # nvme format /dev/nvme1 -n1 -f
    NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x2001)
    # nvme id-ctrl /dev/nvme1 | grep oacs
    oacs      : 0

but the LBA format indeed is sus:

    LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)

jeffbee · on Feb 20, 2024

It's a shame. The recent "datacenter nvme" standards involving fb, goog, et al mandate 4K LBA support.

zokier · on Feb 20, 2024

it'd be great if you'd manage to throw together quick blogpost about i4g io perf, there obviously something funny going on and I imagine you guys could figure it out much easier than anybody else, especially if you are already having some figures in the marketing.

dangoodmanUT · on Feb 20, 2024

that's 16m blocks, not 4k

wtallis · on Feb 20, 2024

Last I checked, Linux splits up massive IO requests like that before sending them to the disk. But there's no benefit to splitting a sequential IO request all the way down to 4kB.

jeffbee · on Feb 20, 2024

Those claims are per device. There isn't even an instance in that family with 8 devices.

sprachspiel · on Jan 16, 2022

These settings control writing back modified pages. The experiments in the paper are read-only. With writes the situation is even worse than shown in the paper (though tuning these settings may help a bit).

sprachspiel · on Jan 14, 2022

This would be exactly the kind of innovation we would need in computer science. Instead we often get stuck in local minima (in this case a 40-year old POSIX interface) without realizing how much pain this causes.

sprachspiel · on Dec 1, 2020

Even after watching these videos and reading lots of articles on the topic, I still find the full C++ memory model extremely hard to understand. However, on x86 there are actually only a couple of things that one needs to understand to write correct lock free code. This is laid out in a blog post: https://databasearchitects.blogspot.com/2020/10/c-concurrenc...

dragontamer · on Dec 1, 2020

C++'s "seq_cst" model is simple. If you're having any issues understanding anything at all, just stick with seq_cst.

If you want slightly better performance on some processors, you need to dip down into acquire-release. This 2nd memory model is faster because of the concept of half-barriers.

Lets say you have:

    a();
    b();
    acquire_barrier(); // Half barrier
    c();
    d();
    e();
    release_barrier(); // Half barrier
    f(); 
    g();

The compiler, CPU, and cache is ALLOWED to rearrange the code into the following:

    acquire_barrier(); // Optimizer moved a() and b() from outside the barrier to inside the barrier
    a();
    b();
    d();
    c();
    e();
    g();
    f();
    release_barrier(); // Optimizer moved g() and f() from outside the barrier to inside the barrier

You're allowed to move optimizations "inside" towards the barrier, but you are not allowed to rearrange code "outside" of the half-barrier region. Because more optimizations are available (for the compiler, the CPU, or the caches), half-barriers execute slightly faster than full sequential consistency.

----------

Now that we've talked about things in the abstract, lets think about "actual" code. Lets say we have:

    int i = 0; // a();
    i++; // b();

    full_barrier(); // seq_cst barrier

    i+=2; // c();
    i+=3; // d();
    i+=4; // e();

    full_barrier(); // seq_cst barrier

    i+=5; // f();
    i+=6; // g();

As the optimizer, you're only allowed to optimize to...

    int i = 1; // a() and b() rearranged to the same line
    full_barrier(); // Not allowed to optimize past this line
    i+= 9; // c, d, and e rearranged
    full_barrier();
    i+= 11; // f, g rearranged.

Now lets do the same with half barriers:

    int i = 0; // a();
    i++; // b();

    acquire_barrier(); // acquire

    i+=2; // c();
    i+=3; // d();
    i+=4; // e();

    release_barrier(); // release

    i+=5; // f();
    i+=6; // g();

Because all code can be rearranged to the "inside" of the barrier, you can simply write:

   i = 21;

Therefore, half-barriers are faster.

----------

Now instead of the compiler rearranging code: imagine the L1 cache is rearranging writes to memory. With full barriers, the L1 cache has to write:

    i = 1;
    full_barrier(); // Ensure all other cores see that i is now = 1;

    i = 10; // L1 cache allows CPU to do +2, +3, and +4 operations, but L1 "merges them together" and other cores do NOT see the +2, +3, or +4 operations

    full_barrier(); // L1 cache communicates to other cores that i = 10 now;

    i = 21; // L1 cache allows CPU to do +5 and +6 operations

   // Without a barrier, L1 cache doesn't need to tell anyone that i is 21 now. No communication is guaranteed.

----------

Similarly, with half-barriers instead, the L1 cache's communication to other cores only has to be:

    i = 21; // L1 cache can "lazily" inform other cores, allowing the CPU to perform i+=1, i+=2... i+=6.

So for CPUs that implement half-barriers (like ARM), the L1 cache can communicate ever so slightly more efficiently, if the programmer specifies these barriers.

----------

Finally, you have "weak ordered atomics", which have no barriers involved at all. While the atomics are guaranteed to execute atomically, their order is completely unspecified.

There's also consume / release barriers, which no one understands and no compiler implements. So ignore those. :-) They're trying to make consume/release easier to understand in a future standard... and I don't think they got all the "bugs" out of the consume/release standard yet.

-------

EDIT: Now that I think of it, acquire_barriers / release_barriers are often baked into a load/store operation and are "relative" to a variable. So the above discussion is still inaccurate. Nonetheless, I think its a simplified discussion to kinda explain why these barriers exist and why programmers were driven to make a "more efficient barrier" mechanic.

ndesaulniers · on Dec 1, 2020

To the edit: right. I like the description using half barriers, but I have trouble reconciling that with the Linux Kernel's READ_ONCE/WRITE_ONCE macros, which guarantee no tearing/alignment issues, but boil down to reads/writes through casts to volatile qualified pointer dereferences. I guess those don't have the same notion of memory ordering that the C++11 API has... Maybe rmb()/wmb()...

jabl · on Dec 1, 2020

Yeah, the Linux kernel memory model is different from the C/C++11 one.

dragontamer · on Dec 1, 2020

Well... I didn't describe the C++11 memory model above. I had a gross simplification, because I didn't account for how the C++11 memory model acts "relative to a variable". (And this "variable" is typically the mutex itself).

I don't know much about the Linux Kernel, but I gave the document a brief read-over (https://www.kernel.org/doc/Documentation/memory-barriers.txt).

My understanding is that WRITE_ONCE / READ_ONCE are meant for this "relative to a variable" issue. Its _precisely_ the issue I ignored in my post above.

All C++11 atomics are "relative to a variable". There are typically no memory-barriers floating around by themselves (there can be, but, you probably don't need the free-floating memory barriers to get the job done).

So you wouldn't write "acquire_barrier()" in C++11. You'd write "atomic_var.store(value, memory_order_release)", saying that the half-barrier is relative to atomic_var itself.

----------

    a();
    b();
    while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
    c();
    d();
    e();
    atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock
    f(); 
    g();

So the C++ acquire/release model is always relative to a variable, commonly the spinlock.

This means that "c, d, and e" are protected by the spinlock (or whatever synchronization variable you're working with). Moving a or b "inside the lock" is fine, because that's the "unlocked region", and the higher-level programmer is fine with "any order" outside of the locked region.

Note: this means that c(), d(), and e() are free to be rearranged as necessary. For example:

    while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
    for(int i=0; i<100; i++){
      value+=i;
    }
    atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock

The optimizer is allowed to reorder the values inside into:

    while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value

    for(int i=99; i>=0; i--){ // decrement-and-test form is faster on many processors
      value+=i;
    }

    atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock

Its the ordering "relative" to the spinlock that needs to be kept. Not the order of any of the other loads or stores that happen. As long as all value+=i stores are done "before" the atomic_store(spinlock) command, and "after" the atomic_swap(spinlock) command, all reorderings are valid.

So reordering from "value+=0, value+=1, ... value+=99" into "value+=99, value+=98... value+=0" is an allowable optimization.

----------

It seems like WRITE_ONCE / READ_ONCE was written for DEC_Alpha, which is far weaker (less guarantees about order) than even ARM. DEC_Alpha was the first popular multicore system, but its memory model allowed a huge number of reorderings.

WRITE_ONCE / READ_ONCE probably compile into no-ops on ARM or x86. I'm not 100% sure, but that'd be my guess. I think the last 20-years of CPU design has overall said that the DEC_Alpha's reorderings were just too confusing to handle in the general case, so CPU designers / low-level programmers just avoid that situation entirely.

"dependent memory accesses" is very similar to the confusing language of memory_order_consume. Which is again: a model almost no one understands, and almost no C++ compiler implements. :-) So we can probably ignore that.

jabl · on Dec 1, 2020

> I don't know much about the Linux Kernel, but I gave the document a brief read-over (https://www.kernel.org/doc/Documentation/memory-barriers.txt).