The core functionality required is a "zero copy" networking lib : dpdk, netmap. ...

wahern · on July 2, 2016

You can do zero copy I/O on Linux using vmsplice+extensions, and on FreeBSD using regular read/write calls. Both can do DMA with user space buffers with much less disruption to traditional design patterns for implementing network daemons.

I think the real benefit of DPDK and netmap is that you're avoiding all the logic of the existing IP stacks, not to mention firewall rules, etc. At the same time you're now responsible for all of that. And IMO the amazing throughput most people claim to see with DPDK is a result of simply neglecting to implement all the hard logic which makes the Internet actually work. All the weirdness, head scratching, and hair pulling these solutions cause is an externality engineers will never care about, and in most cases probably be oblivious about. The exception to this state of affairs is when they're doing read-only packet sniffing and filtering, simply passing the packets back out another interface.

If most of what you care about is avoiding the cost of the kernel/userspace split, then you can just use NetBSD Rump or similar unikernel frameworks.

cm3 · on July 2, 2016

Do you know if the most optimal FreeBSD zero copy support is only available for Tigeon NICs? I couldn't quite figure it out from reading the following pages:

- http://www.freebsd.org/cgi/man.cgi?query=zero_copy

- http://www.kegel.com/c10k.html#zerocopy

- http://people.freebsd.org/~ken/zero_copy/

wahern · on July 6, 2016

I don't actually know. I never went down the rabbit hole of zero-copy, only watched various teams crash and burn.

I've had success simply reducing the number of data copies in userspace. I use tools like Ragel, lean fifo buffers in C with moving read and write windows (i.e. slices), etc.

For example, for a [now defunct] startup, I implemented a real-time streaming radio transcoder which inserted targeted, per-listener, dynamically selected ad spots into live streams. So, it would take an existing radio source (Flash, ICY, MMS, etc), transcode the codec and format to suite the listener, and when it detected ad spots select and insert a targeted ad.

On a _single_ E3 Haswell core (core, not chip) I was able to transcode 5,000 streams in real-time, each with dynamically selected and inserted ad spots, cycling between 30 seconds of the stream and a new 30 second ad for the stress test. All software; no hardware or even SIMD (other than what GCC could squeeze out). The Linux kernel was spending more CPU time handling interrupts and pushing out the packets then my daemon. At that point I knew I was already at the 80% solution and moved on to more feature development. I knew that after fiddling with the kernel I'd have more than enough performance.

FFMpeg, GStreamer, etc couldn't even come close to that. I had written all my own stream parsers and writers, both so I had control over buffer management, but also because I needed frame-level control for splicing in ad spots. The only libraries I used were for the low-level codecs and for resampling. Notably, there was actually more copying then you'd think; enough to keep the interfaces relatively clean. The key, apparently, was maintaining data locality. Some libraries go through extraordinary lengths for zero-copy, but they end up with so much pointer indirection that it's a net loss.

chmike · on July 3, 2016

Avoiding the logic of tcp stack processing does indeed come into play. My colleague process data sent through a tcp connection in a LAN. So he basically ignores all the tcp header of the incomming data. He simply checks the source ip.

He will test mTCP to get a full fledged tcp/ip stack so he will be able to measure it's overhead.

aduitsis · on July 2, 2016

FreeBSD Netmap user here. You actually have to recompile the kernel with "device netmap" added in your kernconf. Piece of cake, after 20' you are good to go. But you need a real network card and the FreeBSD driver must be ready for netmap. Using intel 10Gbps adapters (~200euros) is a safe avenue (FreeBSD ixgbe driver). Even in VMWare, you can pass-through the PCI address of the adapter port to your virtual machine and have it talk to the card directly. Everything works very good.

The gain in performance is mind boggling! Trying to sniff approx. 2+Gbps traffic with Suricata using the "normal" avenue of libpcap ends up dropping a small percentage of the packets. And the machine will waste incredible CPU. Using Suricata with netmap (no need to recompile, Suricata pkgng binary build from FreeBSD comes ready) uses exactly one capture thread and drops ZERO packets. This behavior is stable for days!

Netmap is hands down awesome.

cm3 · on July 2, 2016

I was looking at Chelsio NICs and there were mentions of Netmap support. Do you know what it means for a NIC to support Netmap vs one that doesnt? Is it an extra optimization/fast-path?

aduitsis · on July 3, 2016

Can't give a good technical answer to that. But I suspect that it should be a matter of driver mostly. When you mmap /dev/netmap from userland, the OS TCP/IP stack is disconnected and you get access to the card tx/rx rings. Obviously the driver has to facilitate this.

gonzo · on July 2, 2016

Netmap is in GENERIC now.

Netmap will work over any NIC now, but not at speed. For all the speed gains, you need netmap support in the driver(s) you want to use.

aduitsis · on July 3, 2016

Yes GENERIC has it! But 10.3-RELEASE doesn't have it by default yet, so I had to compile.

virtuallynathan · on July 2, 2016

You can also get very fancy with Intel DDIO (Data-Direct IO) where you can have the NIC write packets directly into L3 cache.

nitrogen · on July 2, 2016

Do you know if any DPDK or netmap drivers can be made to ignore the layer 2 Ethernet CRC/checksum of incoming packets? There are some interesting applications for using Wireshark to diagnose protocols that use Ethernet for layer 1 and sort of for layer 2, but slightly vary the layer 2 protocol (e.g. a malformed packet CRC, or unusual packet sizes, or no inter-frame spacing), but I don't know of hardware that will pass packets that fail the CRC or arrive closer than 96nsec.

oakwhiz · on July 2, 2016

I think in that case it would make sense to build your own NIC so that it can operate out of spec. Ethernet PHY ICs that speak MII or RMII are usually only a few dollars each. One of those might give you the signal you are looking for.

gjulianm · on July 2, 2016

The CRC is often checked directly in hardware, so you would have to configure the NIC properly, and I have not seen many high-performance NICs that allow to modify that setting.

gonzo · on July 2, 2016

Both can. You can disable hw CSUM for dpdk, and netmap doesn't yet support it.

nitrogen · on July 3, 2016

Is that the layer 3 or layer 2 checksum?

mikevm · on July 2, 2016

I don't understand where the performance gain comes from if you're just replacing the in-kernel TCP/IP handling with user-mode TCP/IP handling.

neomantra · on July 2, 2016

As the article says, it is significantly a matter of lock-free structures and not crossing the kernel/userland barrier. Also cache locality (combined with DirectIO which somebody else mentioned) is great.

With a userspace stack, you also get lots of tuning capability, which is not available with the kernel. One cool tunable is how long to busy-wait before sleeping for an interrupt.

This benchmark I posted on HN once shows how significant these effects are (using TCP loopback): https://news.ycombinator.com/item?id=9027365

chmike · on July 3, 2016

I'm not sure. My colleague explained me that the main gain came from avoiding to do a malloc in kernel space and a data copy. But it's true also that many small reads in a single block implies many system calls. I don't know the real relative processing cost of these operations.

gjulianm · on July 2, 2016

For what I have seen, most applications using DPDK do not require TCP/IP handling. For example, packet monitoring, data transfers between nodes that are just connected by a single cable (thus no need for routing or TCP arrival guarantees)...

But even reimplementing the TCP/IP stack it would probably be more faster: you don't make system calls (which are expensive), you don't go around copying every other to socket buffers, you don't have to decide which application should receive the packet... It's hard, but it can yield performance improvements.

On the other hand, I'd try to activate other features for improved performance, such as running applications in the same cores that the queues are running on, configuring the NIC RX queues and using jumboframes (>1514 bytes) if you control the network path. All of these can yield noticeable improvements without much effort.

toast0 · on July 2, 2016

I haven't yet run into a case where it makes sense to do it, but you can gain some efficiency because you avoid context switching between the kernel and the program.