System design implications of declining network latency vs. disk latency

chollida1 · on Nov 2, 2015

> Since 1995, we’ve seen an increase in datacenter NIC speeds go from 10Mb to 40Gb, with 50Gb and 100Gb just around the corner.

This has been a huge factor for our firm.

Trading systems used to be, by necessity, monolithic in design. Your risk system, feed handlers, portfolio management, strategies, execution management, and drop copy/logging systems were all built into one process due to poor speed of interproceess communication, even on the same vlan. You just couldn't get the required performance and certainty with a distributed system. I mean its nice that your distributed database is eventually consistant over the course of 500 milliseconds, but that's a couple of order of magnitude too slow

Now, networks are sooooooo fast and bandwidth is so plentiful that we can afford to split many of the major subsystems onto different machines.

It's important to note that we don't split these subsystems up for redundancy, I mean we aren't Netflix, we can't pull out major subsystems and keep trading. if one goes down we intentionally take down the entire system. You can laugh at me here if you'd like....

Rather we split the systems up so that we can test and upgrade them easier. Before the split, each upgrade was a major change that might have happened 2x a month, now we can upgrade individual systems 2-3 times a week. Agian, you may laugh now, but I put a very big premium on stability and certainty.

I remember talking to someone at Google who told me that their low latency RPC framework was considered to be one of their key technological advantages. Having made the move over the course of a couple of years, I'd be inclined to agree, a robust and fast RPC framework is an incredibly freeing piece of technology.

MichaelGG · on Nov 2, 2015

If a single scaled up server can actually handle the load, why introduce a network? Obviously cross-network comms can't be faster than cross-process comms. And with current NUMA machines, you have a system that's really more like a cluster of computers anyways (with a fast inter-proc but still slower than intracore). What's the real benefit of using RDMA or something to write to a foreign machine's memory versus nontemporal writes to a foreign socket's memory?

BTW any suggested reading on the technical implementation design of trading systems? I read that Island's system was distributed (horizontally scalable) even in the 90s. Is that doing things like running separate trading instances per symbol? (Which should work unless there's atomicity required across symbols, like in spread orders?)

openasocket · on Nov 2, 2015

I have no idea about the OP's system, but I can see some advantages to breaking components up across a network, at least for vertical scaling. First is that server prices do not scale linearly: four 10-core machines will generally be cheaper than a single 40-core machine. Second is that some programs can hog memory/CPU time; forcing programs to run on their own machine can ease some of that contention. These gains can seem rather small, but can make a big difference at large scales. If you don't mind the drop in latency it's a perfectly acceptable architecture.

noir_lord · on Nov 3, 2015

Third advantage is it forces proper separation, it can be very tempting to do things you shouldn't to local resources for "speed".

Florin_Andrei · on Nov 2, 2015

> You can laugh at me here if you'd like

Why laugh? Different businesses, different requirements, different solutions. Horses for courses, as they say.

footiethrowa · on Nov 2, 2015

> I remember talking to someone at Google who told me that their low latency RPC framework was considered to be one of their key technological advantages. Having made the move over the course of a couple of years, I'd be inclined to agree, a robust and fast RPC framework is an incredibly freeing piece of technology.

Have you found Google's RPC framework to be robust and fast, or something else? And if so, would you mind sharing what it is? Thanks!

chollida1 · on Nov 2, 2015

We've experimented with Protocol Buffers, Thrift and CapnProto.

We settled on protobuf but capn proto was super fast. I just didn't have the time to play around with it or fortitude to put it into production with it being so new.

I should point out, that's my own hang up and not mean to disparage Capn Proto at all, call it the web 2.0 of "Nobody ever got fired for buying IBM".

kanslanf · on Nov 2, 2015

I really like what I've read of CapnProto and would try it for my next distributed system project.

corysama · on Nov 2, 2015

CapnProto has good competition from Google's Flatbuffers. Which one is better comes down to the specifics of your data.

https://google.github.io/flatbuffers/

https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-...

https://news.ycombinator.com/item?id=7901991

rictic · on Nov 2, 2015

The newest version was actually open sourced earlier this year: http://www.grpc.io/

_yy · on Nov 2, 2015

Protobuf, probably.

elipsey · on Nov 2, 2015

Problem: most regular users have horrible asymmetric internet connections, and it certainly has not improved at 50x/decade as stated. As I recall, I had 2Mbit/512kbit cable in 2000, and now I Have 10Mbit/1Mbit.

"Youtube, Netflix, and a lot of other services put a very large number of boxes close to consumers to provide high-bandwidth low-latency connections. A side effect of this is that any company that owns one of these services has the capability of providing consumers with infinite disk that’s only slightly slower than normal disk.[...]The most common counter argument to disaggregated disk, both inside and outside of the datacenter, is bandwidth costs. But bandwidth costs have been declining exponentially for decades and continue to do so."

I have, at best, 1 Mbit upstream at my house. The author states:

"A typical consumer 3TB HD has an average throughput of 155MB/s, making the time to read the entire drive 3e12 / 155e6 seconds = 1.9e4 seconds = 5 hours and 22 minutes."

but my upstream connection is three orders slower (in spite of costing the same amount). How am I supposed to get my data to the provider? Just emailing an attached photo is already painful...

These are really intersting ideas for business and the data center, but I think the author is out of touch with how bad consumer internet service is for us regular chumps.

hga · on Nov 2, 2015

"Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway." - Andy Tanenbaum

Amazon has formalized their Import/Export service with their own hardware, the "Snowball": https://aws.amazon.com/importexport/

$200 for 10 days in your possession, up to 50 TB, no extra charge if you're importing to Amazon. Useful to get your media collection backed up in the cloud, if you don't want to take a few years as I'm currently planning for my AT&T DSL soda straw. But otherwise, latency slower than a round trip to the New Horizons space probe, which is currently ~ 9 hours.

Tepix · on Nov 2, 2015

It's correct for me (i started with 300 Baud in the 80s and i have 200/10 MBit/s now), but I'm not so sure it'll scale up like this in the future. We'll see!

AnthonyMouse · on Nov 2, 2015

You have to be careful with articles like this. Most system design people work at large scale companies because they're the ones who can afford them. But it's a mistake to think that because something makes sense for Google or Amazon, it also makes sense for your small/medium business.

For example, cloud providers pay less for disk than you. They may even charge you less than it would cost for you to buy it yourself. But how much storage do you actually use? For most companies your entire storage budget is probably a rounding error in your cost of doing business. Amdahl's law says if you're looking to save money there you're wasting your time, particularly if it comes at the cost of anything important.

Which it kind of does. Even putting aside the entire privacy debate, cloud storage creates third party dependencies. Put the only copy of your data on cloud storage and now you're dependent not only on the cloud provider, but also their ISP and your ISP. If any of them has downtime, so do you. If the cloud provider loses all your data, they say "oops" and have bad PR for a while, you go out of business.

And as the article mentions, these numbers are idealized. It is possible to build a network that has a 2ms latency to cloud storage, but you have little control over that. I currently have a 28ms ping to youtube.com. And it's not as if you can measure the latency and then make assumptions as if it will never change. If your ISP gets into a peering dispute you may go from 5ms today to 50ms tomorrow. If your ISP over-provisions its network tomorrow you may start seeing three digit round trip latencies when the network exceeds capacity. And it's not as if you have much choice in ISPs if this happens; if you have any at all the alternative could be as bad or worse.

But probably the most important thing to understand is that the question is not even relevant to most people. With the amount of memory supported by even fairly old servers, your entire working set may fit in memory and make disk access latency quite irrelevant.

pjc50 · on Nov 2, 2015

I'd look at this the other way round: everyone should now be aware that a computer is a series of processing units connected by network links whose latency is many times that of a register or cache access.

Some of those network links are labelled "PCIe" or "SATA", and some are labelled "Ethernet" or "DisplayPort", but they're all high-speed serial links. A PCIe lane is only 8Gbit, comparable to 10Gbit ethernet.

This means that your computer "backplane" design can be de-constrained; you don't have to draw a hard distinction between elements in the same case and those in the case of the machine next to you in the rack.

A rotating disk is roughly able to saturate a 1Gb ethernet link. Maybe we should just go full SAN and put Ethernet connectors on the drives.

jo909 · on Nov 2, 2015

Very true. Most servers are multiple processors (sockets) interconnected via a high speed network (QPI, HT), and each processor has links to its own memory and some PCI lanes. You could almost see that as multiple distinct computers on one board and in one case.

Hard disks with direct ethernet connections exist (f.e. Seagate Kintec).

pjc50 · on Nov 2, 2015

Seagate Kinetic looks like an excellent idea, although horribly expensive.

MichaelGG · on Nov 2, 2015

It goes deeper than that. Some of those links are called QPI. Accessing memory on a multi-socket system can be much slower (40%) if the memory isn't connected to the socket you're on. In that case the request has to go to the other socket, then to RAM. In a quad-socket system, this might mean making two hops, first.

Even accessing cache in a multi-socket system sucks. Taking ownership of a cache line that's in another socket can cost 300 cycles. Ouch.

So in a way, programming current Intel NUMA systems is a lot more like writing distributed systems in the first place, if you care about performance. Yes, it's more reliable (your cross-socket read isn't gonna sporadically fail), but if you treat it as a single unit, things get slow.

acveilleux · on Nov 2, 2015

To be fair, PCIe/SATA are a hell of a lot more reliable as "network links" as 10gb ethernet. SANs however, have the same failure modes as typical LANs.

acconsta · on Nov 2, 2015

>A rotating disk is roughly able to saturate a 1Gb ethernet link.

Locality is key there. Read randomly and you won't saturate a 100Mb link.

ChuckMcM · on Nov 2, 2015

Back in 2002 when I was with NetApp we would show customers that NFS (Network Attached Storage) was faster than direct attach storage. Oddly it took a while to convince people of that. But it was very successful running Oracle over NFS in their "cloud" data center in Texas.

That said, as network speeds surpass what were previously internal system interconnect speeds it does make new design configurations possible. That said, the lack of predictability of those inter-connects makes things a bit more complex. If you're running stuff in AWS for example, and you are disk bound. Its probably worth your while to benchmark using a multidisk system running ZFS as a "local" NAS point for your disk storage. You might find that you can run a dozen or more instances of that single storage box for less money than a dozen machines with chunks of disk.

e12e · on Nov 2, 2015

Is that another way of saying that the Netapp boxes had battery backed RAM that'd promise to really, for sure, absolutely, persist data written to said RAM? (And, also, that two boxes with twice the RAM has, well, twice the RAM)?

djfergus · on Nov 2, 2015

"But bandwidth costs have been declining exponentially for decades and continue to do so. "

I wonder how pronounced this effect will be in places where bandwidth costs are not following the same downward trend as in North America. e.g. Australia:

https://blog.cloudflare.com/the-relative-cost-of-bandwidth-a...

e.g. Will data centers be designed differently?

mcbridematt · on Nov 2, 2015

There is plenty of dark fibre around in Australia (driven by a need to bypass monopoloy Telcos) that the bandwidth costs cited by Cloudflare have no real effect on datacenter design.

(The 'true' cost of transit in AU, ignoring Telstra is closer to the Asia region average of $30 IIRC)

I think higher energy and real estate costs in Australia have more of a bearing - it is quite possibly cheaper to 'import' bits into Australia than to store/process/etc locally (any data sovereignty issues notwithstanding)

CountSessine · on Nov 2, 2015

If it's dark fibre, how is it having any effect at all on transit cost?

rasz_pl · on Nov 2, 2015

they will be happy to light it for you, at $10K a month

https://www.youtube.com/watch?v=HeEAVj2Szbg

mcbridematt · on Nov 2, 2015

I didn't say it did. It was a side comment that the situation was not as dire as Cloudflare said - if you can avoid buying transit directly from one specific provider.

CountSessine · on Nov 3, 2015

#include <pedantic.h>

Yes, you did:

A: There is plenty of dark fibre around in Australia (driven by a need to bypass monopoloy Telcos)

B: the bandwidth costs cited by Cloudflare have no real effect on datacenter design

and

C: (The 'true' cost of transit in AU, ignoring Telstra is closer to the Asia region average of $30 IIRC)

Based on simple sentential logic, you are saying A->B and we're provided with C as additional information. Unless your word that isn't being used as an implicative connective, which would be silly.

How does fallow, unused, dark fibre have any effect on transit cost? Is there a 'transit cost' options market or something?

spacecowboy_lon · on Nov 2, 2015

No because WAN vs LAN Speeds are quite different your not going to get Infinband to the home anytime soon

mindslight · on Nov 2, 2015

I understand where the author is coming from - high speed serial transceivers really have undermined our traditional concept of a "machine". And in the uniform administrative domain of a data center, hardware refactorings are straightforward. So it's pretty nifty to think about these changes to fundamental constants (although the continual rewriting of software systems because they're not abstracted enough to treat it all as one cache hierarchy is disheartening).

But when moving across administrative boundaries, especially to a more ad-hoc environment, serious problems crop up. Some assumptions the author is handwaving away:

1. That 2ms to Youtube is to a single specific Youtube cache. If you're about to get on a plane, are you supposed to inform Google so that they can move your files across the country?

2. "The network is reliable" - what do you do while you're on that plane? DSL is still our national infrastructure policy so in many places that is the best that is available, especially if you want a choice of providers to avoid anti-consumer behavior.

2a. Or wifi. What is the latency of modern best-of-class consumer gear? I'm getting 16ms with a simple TL-WR703N. Never mind mobile...

3. Access patterns being so long-tail points to the continuing utility of end-user SSDs as non-volatile cache. The devices that would be most likely to drop them (cost, space, power) are generally on the worst network links.

4. The datacenter containing the notion of what data is authoritative (otherwise you wouldn't need to hit it for every access as outlined). What is the need for this?

5. The overhead of implementing oblivious transfer (or the alternative of being surveilled and having decreased economic power and freedom).

When you think in terms of datacenters, it's tempting to think that the next advancement is end-user computers becoming part of that system, but that is a skewed perspective. The pendulum swings back and forth, and I think we're pretty close to Peak Butt. The next deluge will be user-centric software that utilizes the butt for its strengths, but respects privacy and tracks the authoritative copy on user owned machines.

dave_ops · on Nov 2, 2015

When I internalized this trend ~3 years ago, it's what spurred my interest in distributed storage systems across machines, racks, and datacenters, that are only memory backed.

mprovost · on Nov 2, 2015

Any examples of such a system?

nostrademons · on Nov 2, 2015

All the major webscale systems - Google, Facebook, Microsoft's cloud services.

At a lower scale, there are also services like Hacker News, Plenty of Fish, and Mailinator that serve out of RAM on a single box.

chubot · on Nov 2, 2015

Besides speed, there is another difference between local and remote storage: remote storage is on a different failure domain. A machine is still a single failure domain since the storage/RAM/CPU share a power supply / battery backup.

If you change an application to use remote storage rather than local storage, you may not get severe performance degradation in some cases, but your application will experience more failure modes than it did before.

w_t_payne · on Nov 2, 2015

This all seems to be driven by latency figures, predicated on the application accessing data in a random pattern.

What if you are doing some data-intensive computations where the data access pattern is fixed and computation parameters change?

This particular application wouldn't care about latency, since you can linearize or otherwise arrange your data on disk so it is in the optimal position for your application. Instead, throughput would be the limiting factor.

Has anybody done a similar analysis on this basis?

MichaelGG · on Nov 2, 2015

Somewhat related: Is there any solid open source (or commercial?) software that can just join several servers into a redundant, SPOF-less, RAM-based block storage device? Perhaps exposed via iSCSI or something so it'd work for any client?

Idea would be that I could setup two separate racks with UPSes, specify a safety factor, then have super-fast shared disk/block available. (DB clustering or whatever.)

mdk754 · on Nov 5, 2015

I could be wrong and haven't explored the idea myself, but couldn't you accomplish this with something like ceph or scaleio and tmpfs/ramdisk for your block devices?

amelius · on Nov 2, 2015

The problem with network latency is that it is bound by the speed of light.