Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Vector Packet Processing (netgate.com)
66 points by teleforce on Nov 21, 2023 | hide | past | favorite | 39 comments


100x faster than Linux is certainly fast. IIRC, Linux packet processing is considered rather slow (though full featured, well behaved and configurable).

VPP here seems to be a "user-mode network stack", as far as I can tell. I was kind of attracted to the title because I was hoping for SIMD / Vector compute maybe even GPUs, but that doesn't seem to be the case.

Still, a usermode network stack is apparently a must-have for any very-high performance network application. I've never needed it, but a lot of optimizers talk about how "slow" Linux networking is when you actually benchmark it.


Actually, in its root it is based on simd and prefetching. In short, each part of the packet processing graph is a node. It receives a vector of packets (represented as a vector of packet indexes), then the output is one or more vectors, each goes as an input to the next step in the processing graph. This architecture maximizes cache hits and heats the branch predictor (since we run the same small code for many packets instead of the whole graph for each packet).

You can read more about it here: https://s3-docs.fd.io/vpp/24.02/aboutvpp/scalar-vs-vector-pa...


I can certainly imagine some SIMD concepts in that. Particularly stream-compaction (or in AVX512 case: VPCOMPRESSD and VPEXPANDD instructions)

EDIT: I guess from a SIMD-perspective, I'd have expected an interleaved set of packets, a-la struct-of-arrays rather than array-of-structs. But maybe that doesn't make sense for packet formats.


The NIC gives you an array (ring buffer) of pointers to structs (packets). Interleaving them into SOA format would probably cost more than any speedup from SIMD.


Yeah, but its difficult to write a SIMD / AVX512 routine if things aren't in SOA format.

I can see how this approach described is "vector-like", even if the vector is this... imaginary unit that's parallelizing over the branch predictor instead of an explicit SIMD-code.

This "vector" organization probably has 99.999%+ branch prediction or something, effectively parallelizing the concept. But not in the SIMD-way. So still useful, but not what I thought originally based on the title.


A ring buffer of pointers to structs is friendly to gather instructions. That said, the documentation shows a graph of operations applied to each packet. I'd expect that to lead to a lot of "divergence", and therefore being non-SIMD friendly.

(also, x86-64 CPUs with good gather instructions are rare, and sibling comments show that this is aimed at lower end CPUs. That makes SIMD even less relevant.)


Most packets follows the same nodes in the graph. You have some divergence (eg. ARP packets vs IP packets to forward), but the bulk of the traffic does not. So typically the initial batch of packets might be split in 2 with a small "control plane traffic" batch (eg. ARP) and a big "dataplane traffic" batch (IP packets to forward). You'll not do much SIMD on the small controlplane batch which is branchy anyway, but you do on the big dataplane batch, which is the bulk of the traffic.

And VPP is targeting high-end system and uses plenty of AVX512 (we demonstrated 1TBps of IPsec traffic on Intel Icelake for example). It's just very scalable to both small and big systems.


> "user-mode network stack"

Kernels typically cannot use vector instructions because if they did they would need to save and restore the vector register state when servicing interrupts. There is a very large performance cost to doing that.

Moving packet processing into userspace means adding latency, including TLB pressure, in order to do the context switch.

I imagine that we might get some innovation by allowing to configure the system such that the kernel owns the vector registers and userspace is not allowed to use them. If your primary interest in vector registers/instructions is packet processing, and you're doing that in kernelspace, you might not mind it if userspace can't use those registers.


The vector in vector packet processing has little to nothing to do with vector instruction sets (SSE/AVX, VMX, RVV, etc.). Only the very latest CPUs (and historical supercomputers) have scatter/gather instructions capable of efficiently extracting packet header fields from multiple packets in parallel and if you have a large enough batch of packets process switching the vector register file is worth it to the kernel. It's just that most kernels don't spill the userspace vector registers on every context switch because it's more common to switch back to the same thread (or an other userspace thread) than using the vector registers inside the kernel. Both Linux and *BSD can and do make limited use of vector registers inside the kernel e.g. for fast encryption/decryption because it's worth the start up cost.

If I understood the VPP design and implementation details correctly they try to reduce the amortised cache misses for a batch of packets by running all packets in a batch through each software pipeline stage before processing the next stage. This should result in very good average instruction cache hit rates and should also help with data cache hit rates because packet headers are small and can be prefetched while the forwarding data structures e.g. 1 million IPv4 prefixes and their next hops can be hard to fit into L2 data caches and won't fit into L1 data caches.

I assume a carefully tuned implementation can make further gains by dedicating cores to specific pipeline stages to keep the data caches hotter at the cost of copying processed packet headers their next stage in new (sub-)batches. The actual packet content is only relevant for a few operations like encryption/decryption and modern highend NICs have line rate crypto engines to help with IPsec or TLS.


Batching packets bring several benefits:

  - amortizing cache misses are you mentioned

  - better use of out-of-order, superscalar processors: by processing multiple independent packets in parallel, the processor can fill more execution units

  - enable the use of vector instructions (SSE/AVX, VMX etc): again, processing multiple independent packets in parallel means you can leverage SIMD. SIMD instructions are used pervasively in VPP


Moving packet processing into userspace means adding latency, including TLB pressure, in order to do the context switch.

This isn't the case because VPP polls the NIC from userspace and never enters the kernel. There are no context switches.


Is there something like an IOMMU to provide secure access to the NIC?


Yes, DPDK (which VPP is built on) heavily utilizes IOMMU to provide host protection.

https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html discusses it a bit.


> Kernels typically cannot use vector instructions because if they did they would need to save and restore the vector register state when servicing interrupts. There is a very large performance cost to doing that.

While this is true, it definitely doesn't seem to stop people like Apple [1] from using SIMD extensions in the kernel anyways. On ARMv8, it's an extra 512 bytes (32 qword registers) or eight (and change, depending on alignment) dirty cache lines. Whether or not this causes a serious performance impact will depend on how well the kernel can actually make use of SIMD (saving and restoring might cost 1% perf but if speeding up the kernel wins 5%, who cares!). It could be interesting to play with this on other kernels to verify these assumptions, it might be worth enabling in the kernel on some devices! Kernels do tend to do a lot of moving data around, which these extensions (on ARM, anyways) excel at.

[1] https://github.com/apple-oss-distributions/xnu/blob/main/osf...


Calico CNI notably has beta support for VPP, including the userland memif interface if you really really really need speed. https://www.tigera.io/blog/high-throughput-kubernetes-cluste... https://docs.tigera.io/calico/latest/getting-started/kuberne...

With memif especially, it's fast as heck. But you need to rebuild your apps to target memif. There's some pretty good drop in stdlib replacements for languages like Go, but it's still some work to use the DMA accelerated shared memory packet processing high speed userland mode that VPP is capable of. Ex: https://github.com/KusakabeShi/wireguard-go-vpp


I have been developing a product that uses vpp in production for a few years now. It is very cool to see how much you can squeeze out of cheap low power CPUs. You can easily handle tens of gbits in iMIX with a a few ARM cortex A72s.

Vpp has very good documentation: https://s3-docs.fd.io/vpp/24.02/ A very cool unique feature is the graph representation for packet processing, and the ability to insert processing nodes to the graph dynamically per interface at some point in the processing using features (https://s3-docs.fd.io/vpp/24.02/developer/corearchitecture/f...)


VPP has been shown to run at 22.1 Mpps on a single core of Gracemont (the efficient / Atom core in Alder Lake), and 42.3 Mpps on 2 cores. (Intel E810 4x25 NIC, DPDK 22.0, VPP 22.06, GCC 9.4.0, RFC 2544 test with packet loss <= 0.1%.

The same core will do 14.99Gbps of IPsec (aes-128-gcm, 1480 byte packets) using VPP, largely because it supports (VEX-encoded) VAES.

While these aren't ARM cortex A72s, they're quite close (cheap low power) for Intel.


Reminds me of: "PacketShader - GPU-accelerated Software Router". https://shader.kaist.edu/packetshader/ , http://keonjang.github.io/papers/sigcomm10ps.pdf


you trade a lot of latency to make GPU parallelism work for packet processing/classification. some massively clever work around hiding it — but simply no way to avoid it. thus its a niche solution


With GPUDirect and active-wait kernels, you can get a tight controlled latency and saturate PCIe bandwidth without touching main memory. StorageDirect if you need to write to (or read from) disk.


packet sojurn time is bounded by the latency of the GPU memory architecture. which as I understand has the design dial cranked to ten for parallelism and not so much for expediency


People have been using GPU + DMA for low latency / real-time / high compute intensity applications for some time (using them for adaptive optics of all things). My PhD student's been cranking it to 100/200/400G, with 'just' DPDK, gpudev, and persistent cuda kernels..

Depends on the application, batching policy, compute intensity, etc. But you can put 8 NICs and 8 GPUs in one node (and have them communicate through nvlink, so huge intergpu bandwidth!) which I can't for CPUs. You can maybe also get some unobtainium A100X or CX7+H100 to skimp on PCIe if you're well funded...


They disclose their latency in the paper in section 6.4, pages 10 and 11: http://keonjang.github.io/papers/sigcomm10ps.pdf#page=10 . It's something like 0.4 ms.


We were running our core routers with BGP and VPP for several years pushing around 40-50 Gbps on a software stack without the need of those expensive ASICs. Worked great and stable. VPP is a great piece of technology.


Noob question, but don't this potentially add per-packet latency and processing variance?

My thinking is that one needs to have a set of packets before being able to start processing. The first packets arriving must wait until enough packets has arrived to fill min size of the vector. And if the last packet comes "late", the arrival time of the last packet adds to the time for the other packets, thus adding something that looks like variance.

I assume there are parameters setting min number of packets in a vector, and timeouts for when to accept packets into a given vector.


This definitely add per-packet latency and processing, that's why the recommended way to run VPP with high performance is to use some kind of network accelerator library, the same way the VPP based solutions like TNSR by Netgate will work best with DPDK for kernel bypass, or use similar technique like XDP with eBPF but XDP does not bypass the kernel hence eBPF is needed.

For Linux user space solutions without kernel bypass if the above mentioned network accelerators are not installed (devices in customer's premise, etc), the recommended way is to use Netmap since it enabled direct access to the network interface card (NIC) buffers from user space otherwise you are at the mercy of Linux own notorious sk_buff [1].

Another alternative perhaps for more efficient buffering in Linux is to use PF_ring or the the-kid-on-block IO_ring but not sure they are being currently being utilized in VPP or not.

For good introduction on Linux Networking acceleration technology, this presentation is a good start [2].

[1] VPP docs: Create netmap:

https://docs.fd.io/vpp/17.04/clicmd_src_vnet_devices_netmap....

[2] Linux Networking: The meaning of acronyms eBPF, DPDK, XDP, VPP [video]:

https://news.ycombinator.com/item?id=38376380


The standard solution to this is to trigger the batch process when N packets are queued OR M amount of time passes. As long as you set M to below your latency threshold, you should be good. If you don't want your CPU to burn up cycles polling a usually empty queue, you can add some logic to switch between polling-based and interrupt-based rx depending on throughput. The Linux networking stack already does this for drivers that support NAPI, and I'm sure that DPDK has an equivalent.


(Answering myself now, oh my.)

... And doesn't this also adds, creates a relationship between otherwise independent packets, potentially creating a way to tag packets through a network. Basically if I can control the arrival time of my packets to a router (I send them at a baseline fixed rate, but delay the transmit time with a pattern), packets that are then bunched together to be vector processed in the router will also be affected by this delay. I could possibly then observe this pattern at other places in the network. Thus tracing packets.

Possibly.


It is a latency/throughput tradeoff. I haven't really see how VPP works, but I don't expect it actually waits for a vector of packets to be completed. Likely it buffers batch 2 while it is still processing batch 1. As soon as it is done, it starts processing batch 2 and begins buffering batch 3.

I wouldn't be surprised if this actually reduces variance.


VPP is really neat tech. I recently worked on a product that employed it, and it was impressive to see a commodity low-power CPU pushing tens of gigabits of traffic.


In short: batch processing at the multi-packet level. Increases throughput, at the cost of latency.


What kept this from being available til now? Seems like Cisco had it for ages.

> With experimental technologies, Linux has been shown to make some gains in artificial benchmarks, such as dropping all received packets

Is this a joke? A jab?


VPP has been publicly available since 2017 or before. it’s incredibly fast and feature-full

dropping a packet after processing overhead and/or explicit classification work is a useful benchmark yes


Dropping packets efficiently is really important for handling DDoS. AFAIK that was the original motivation for Cloudflare to adopt XDP.


Oh man, on first reading I definitely thought it was saying "with experimental technologies, Linux can drop all traffic", which would have been a hilarious dig... but I think "drop all incoming packets" is a useful benchmark for evaluating overhead, like doing conntrack for every packet that hits then immediately discarding it.


Yeah that was my interpretation as well. Like the fastest Java Garbage Collector, the one that actually doesn't collect any garbage and lets it pile up til the JVM crashes.


>What kept this from being available til now? Seems like Cisco had it for ages.

Cisco is the one who wrote it and open sourced it. Netgate is just putting a wrapper around other people's (Cisco's) code.

>Is this a joke? A jab?

No, dropping packets is step 1 to proving you can get past some of the current CPU bottlenecks. Actually doing something useful is obviously significantly more work, but no point bothering with that work if the CPU is still the bottleneck.


> Netgate is just putting a wrapper around other people's (Cisco's) code.

Cisco open sourced VPP in 2016 and we've been busy working on it ever since.

https://www.stackalytics.io/unaffiliated?module=github.com/f...


This is the same project that was developed at Cisco originally




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: