Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PacketMill: Toward per-Core 100-Gbps networking (2021) (acm.org)
87 points by teleforce on Nov 24, 2023 | hide | past | favorite | 26 comments


Only somewhat related, I've been wondering if you could use zero copy networking with io_uring to send the same bytes over and over. Say you have some data that needs to be send to the different clients (imagine maybe it's a video game sending updates to players), from what I can tell you should be able to create a "fixed" buffer with this contents and re-use it between calls in the ring. Not only would this be zero copy, but you also wouldn't ever need to touch the data at all to send it after the first time. (normally, zero copy is more like 1 copy, since you need to copy it in to the kernel allocated buffer even with zero copy).

My understanding is that there's some contention from the locking involved though which can make this not scale well, but maybe that could be avoided (e.g., maybe by trying to have per core buffers used by the kernel, although I'm not sure you'd have any control over the kernel threads).


> although I'm not sure you'd have any control over the kernel threads

You don't have control from the application, but if you're really trying to get the most performance from your networking heavy application, you want to set up your server where the NIC rx and tx queues are cpu pinned, and the application thread that processes any given socket is pinned to the same cpu that the kernel is using for that socket. Then there's no cross cpu traffic for that socket (if all else goes well).

Hopefully all the memory used is NUMA local for the CPU and the NIC as well. That gets trickier with multiple socket systems, although some NICs do have provisions to connect to two PCIe slots so they can be NUMA local to two sockets.


I imagine that wouldn't be very useful, since most production traffic is encrypted (e.g. TLS) and key negotiations necessitate a per-client cipher stream.


Advanced NICs can do bulk TLS encryption... probably other encryption if you work with the manufacturer. See the work Netflix CDN has done, at high bitrates, memory bandwidth is a major bottleneck for them, eliminating the memory reads and writes to do TLS in software (user level or kernel level) allowed them to hit much higher speeds.


Multicast (and broadcast) was supposed to do things like this, but I have rarely seen it used for any sort of "visible" client application.

(I mean visible like a videoconferencing application as opposed to something like invisible multicast dns)


> Multicast was supposed to do things like this, but I have rarely seen it used for any sort of "visible" client application.

Well multicast is not possible on internet so I guess that's why you don't see it often.

In private network centric protocols though (e.g. finance), multicast is ubiquitous.


Multicast does get used for IPTV doesn’t it?


Interesting. But wouldn't this mean some clients would need to wait a long time for their download to start? (I'm imagining a multi-gigabyte game update - but maybe you meant a continuous stream of in-game positional updates of players.) They basically idle until the next "train" shows up at their station, and then they hop on, right? Also, if they have a network hiccup, they can't resume their download until one of the trains reaches the point where their download failed.

(That is, if you get my weird train analogy, and if I'm understanding your idea correctly.)


I was thinking of a game like minecraft or similar (essentially, a ~large world with lots of entities updating in a grid of chunks). In that case, you generally have people fairly clustered in different grid tiles, and they're all likely to get similar information about each chunk, so it seems like re-using the buffers between calls would be beneficial.

So instead of:

  for each player:
      for each block player needs:
          # Write block to player's connection
You could do:

  for each player:
      for each block player needs:
          # Check if the block is already mapped in a kernel buffer. If it is, just append a write for it to the io ring.
          # Otherwise, copy the block into a new buffer and send it.


This sounds pretty straightforward using sendfile(); you shouldn't even need io_uring.


I guess you would have some kind of in memory fd that you would write to and then use with sendfile calls? I'm not sure if you'd run into similar contention on the data structures involved there. I guess it would depend on specifically what mechanism you're using to store the data (something like tmpfs?), but probably doesn't have to involve contention? (as there wouldn't have to be any refcounting if you control the lifetime of the data).

I was thinking of a situation where you want to also limit the system call overhead per send. (think maybe there's some number of 64k-ish chunks that you want to send to different clients, so each connection would involve multiple write per frame etc.,)


Reminds me of a Kafka topic, where different Consumers each have their own commit that indicates how far along the stream they’ve caught up to.


There is nothing that forces at least a single copy. You could do some processing and write directly into a buffer registered to the kernel.


I mean if we want to be pedantic, the buffer has to be copied from main memory into whatever the HW uses for staging to generate the PHY signal. But that’s typically an DMA operation not an CPU memcpy. Even with io_uring though, afaik there’s inevitably a memcpy to create the sk_buf to hand off to the network driver no? I’m more fuzzy on how that stuff works but I don’t think the sk_buf is implemented as a list of iovecs.

You could go the DPDK route but that has downsides of your application needing exclusive access to the network interface. AFAIK, any network stack that supports multiple simultaneous applications typically involves at least 1 copy but I could be off.


“Even with io_uring though, afaik there’s inevitably a memcpy to create the sk_buf to hand off to the network driver no? I’m more fuzzy on how that stuff works but I don’t think the sk_buf is implemented as a list of iovecs.”

Having worked with sk_bufs a bit I hope a copy isn’t needed for a network send. All the tooling I worked with around those was zero copy.


> You could go the DPDK route but that has downsides of your application needing exclusive access to the network interface.

Never did it, but I guess you could reroute traffic you don't care about to the Linux network stack for regular routing.

https://doc.dpdk.org/guides-16.07/howto/flow_bifurcation.htm...

Not sure that's a realistic scenario though. You rarely expose a DPDK app directly on a regular / open subnet. You often have a dedicated app, with a dedicated NIC, on a dedicated subnet, for a dedicated traffic.


Project home page:

* https://packetmill.io


Too bad they aren't sharing source code. Ah well, one can dream of 100gbps per core.


What are you talking about, it's the second link on the page called try now.

https://github.com/aliireza/packetmill


You can do 100G per core with DPDK with bigger frame sizes


I thought I saw a comment about PacketShader here last time I looked but gone now. GPU accelerated has downsides, but it was fun to see. With vilkan or webgpu, there's much much more power/flexibility on tap!

And we are getting ever increasingly much much better about p2p dma- the future where the network card sends direct to the GPU which replies direct to the host, without ever bouncing through the cpu or it's caches, is in reach!


I feel like a lot of these could be contributed back to DPDK instead of built as a layer on top of it.

Very good article though. I'm an addict user of DPDK but more for latency rather than throughput, and it's interesting to see the challenges involved there.


One wonders why they saddled VPP with DPDK, rather than a VPP native driver that will do 35Mpps or more, but allowed themselves to modify a DPDK driver to show their results.


The approach is interesting but I'm afraid that the benchmarking is bogus, as it's usually the case is in this kind of papers.

Part of the problem I think is that they have to compare to other existing solutions (click, VPP etc) otherwise everybody will ask, but at the same time it is very difficult to do a fair comparison.

I'm a VPP developer hence I'm both biased and a knowledge-domain expert, but focusing on what I know, which is VPP: figure 11b, they compare VPP to PacketMill for some L2 patch workload. They claim their approach is fare because they use automated tooling to benchmark, good and we also do for VPP, but our results don't necessarily match theirs - and, surprisingly, our results are higher than what they claim for VPP.

PacketMill paper figure 11b for VPP for L2 patch for 64-bytes packets at 1.2GHz using DPDK MLX5 driver [1]: ~5Gbps

VPP for similar configuration [2]: ~7.5Gbps (already a 50% error margin?)

VPP using a more optimized NIC driver (native AVF vs DPDK MLX5) [3]: ~18Gbps (almost 2x what they seem to claim for PacketMill...)

All this to say comparing different solutions is terribly hard, and I'm not sure of the value of this kind of benchmarks.

For VPP we built CSIT [4] which is opensource under the Linux Foundation Networking to automate tests in the same environments, to compare between VPP and DPDK between releases and platforms.

[1] https://packetmill.io/docs/packetmill-asplos21.pdf

[2] http://csit.fd.io/report/#eNp1kd0OwiAMhZ8Gb0zN6MRdeaHuPQxidS...

[3] http://csit.fd.io/report/#eNp1kd0OgjAMhZ9m3pgaVkS88ULlPcwcVU...

[4] http://csit.fd.io/


Also, the claims they make in the paper is inaccurate relative to VPP: they claim VPP uses "overlay/convert" method, which is true for DPDK drivers, but we also maintain native drivers for NIC we care about (eg. Intel's) - especially because going through the "overlay/convert" method is costly at high-packet rate.

IOW, one of their strong claim is that PacketMill is innovative because it avoids copying/converting uneeded metadata, but VPP is already doing that since years.

Finally, their claim to break the 100Gbps on single core @2.3GHz is cute, but again I'm afraid they're late to the party. They claim 12-13Mpps per core for 64-bytes packets for example but VPP can achieve 20+Mpps per core already for L3 forwarding (routing).

Again, benchmarking is hard, but I keep reading there claims over and over in academic papers when they're factually wrong for area I know about. I can only imagine what is happening for area I don't know :(


“Your IP has been blocked”. I’m using Apple’s private relay




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: