The other replies are assuming networking in a big network is inherently slower than in a small network. I used to work at Google in Tech Infra, so I'll offer an alternate perspective while trying not to spill secrets.
First, Google has enough money that they can build their entire network out of custom hardware, custom firmware, and patch the kernel + userspace. A datacenter at Google scale is architecturally similar to a supercomputer cluster running on InfiniBand. You will never be able to replicate the performance of Google's network by buying some rackmounted servers from Dell and plumbing them together with Cisco switches.
Second, assuming a reasonably competent design, adding more machines to a network doesn't significantly increase the latency of that network. You'll see better latencies between machines in the same rack than between two racks, but this is a matter of single microseconds rather than milliseconds. Additional latency from intermediate switches is measured in nanoseconds.
Third, Google publishes an SLA on round-trip network latency between customer VMs at <https://cloud.google.com/vpc/docs/vpc>. Their "tail latencies less than 80μs at the 99th percentile" translates to ~40μs for one way, and honestly for customer VMs a lot of that happens in the customer kernel + virtualization layer. A process running on bare-metal, such as a kernel reading a remote network block device, can (IIRC) expect single-microsecond latencies to get one packet onto a nearby machine.
> The other replies are assuming networking in a big network is inherently slower than in a small network.
well yes, not true with modern switches that support cut-through forwarding
it's super-common in our space to bypass the kernel entirely, writing into the NIC buffers directly with prepared packet headers, and the card has pushed part of the packet out onto the wire, through switches and into the target machine's NIC buffers before it's even finished being written
typical "SLA"s are 0 packets dropped during a session, where a single drop raises an alert that is then investigated
> You will never be able to replicate the performance of Google's network by buying some rackmounted servers from Dell and plumbing them together with Cisco switches.
and yet, somehow we do quite a bit better (admittedly they are very, very expensive switches)
I get that people that work at Google like to think they're working on problems more advanced than those of mere mortals, but with the latencies you've described we'd be out of business several times over
(not to mention none of the clouds support multicast)
That's just different niche. I assume that you work in trading based on the word "multicast".
What Google needs is "Big+Cheaper" datacenters, and it has to work with codes written by 100000 different mere morals. What you described is in the "Small+Expensive" field, but with extreme worst-case performance demand.
"Big+Expensive" = Supercomputer
"Small+Cheaper" = ??? (Note that the Big+Cheaper solutions not necessarily work for this, as you can't amortize and ignore one-time R&D/ops cost anymore)
The other replies are accepting multi-millisecond latencies as a given, and think that Google's network must be slower than even a basic copper-wired LAN because it's bigger.
My response is something like "just because the network's bigger doesn't mean it's slower".
>> You will never be able to replicate the performance of Google's network
>> by buying some rackmounted servers from Dell and plumbing them together
>> with Cisco switches.
>
> and yet, somehow we do quite a bit better (admittedly they are very,
> very expensive switches)
With respect, if you're in the trading business, your network almost certainly contains custom hardware. I bet it looks a lot closer to Google's than it does to the guy plugging cat5e into a Dell.
First, Google has enough money that they can build their entire network out of custom hardware, custom firmware, and patch the kernel + userspace. A datacenter at Google scale is architecturally similar to a supercomputer cluster running on InfiniBand. You will never be able to replicate the performance of Google's network by buying some rackmounted servers from Dell and plumbing them together with Cisco switches.
Second, assuming a reasonably competent design, adding more machines to a network doesn't significantly increase the latency of that network. You'll see better latencies between machines in the same rack than between two racks, but this is a matter of single microseconds rather than milliseconds. Additional latency from intermediate switches is measured in nanoseconds.
Third, Google publishes an SLA on round-trip network latency between customer VMs at <https://cloud.google.com/vpc/docs/vpc>. Their "tail latencies less than 80μs at the 99th percentile" translates to ~40μs for one way, and honestly for customer VMs a lot of that happens in the customer kernel + virtualization layer. A process running on bare-metal, such as a kernel reading a remote network block device, can (IIRC) expect single-microsecond latencies to get one packet onto a nearby machine.