The core functionality required is a "zero copy" networking lib : dpdk, netmap.
Normally, when the kernel recieves data from the network, it allocates a block in the kernel and copy the data into it. Then your read operation copies that data in your user space.
The "zero copy" networking stacks avoids the data copy. The way it works, as I was explained, is that they use a shared memory mapped zone. This zone is organized as a pool of blocks managed with non blocking lists. Blocks have a fixed size big enough to hold the ~1500 IP blocks. I never used it so I don't know the details.
When data arrives, it is directly written in place in a block of the memory mapped zone. In the user space you use select/epoll/kqueue or polling if the waiting time is very small. Once you have a block you can process it. This block contains raw network data received from the network card. So it's up to you to code and decode the TCP/IP headers or use an existing lib like mTCP that does that for you. I was told that it can work with dpdk and netmap.
My colleague is currently using netmap for a high performance data acquisition application in a LAN and plan to test dpdk with mTCP this summer. mTCP should simplify programming. At CERN they are now testing data acquisition setups using dpdk to be able to use commodity component hardware.
My colleague told me that Netmap is available in the BSD kernel so that you can use it right away. It is not included in the Linux kernel and you then need to patch it in. Zero copy is the future of network programming. Linux is late on this one. Then there is dpdk on which I don't have much info yet except that it is made by Intel, it is open source and compatible with AMD processors. It is apparenly not easy to install.
Since dpdk and netmap communicates directly with the network card, it only works with supported network cards.
The gain in performance is significant, but I have no numbers at hand to give.
You can do zero copy I/O on Linux using vmsplice+extensions, and on FreeBSD using regular read/write calls. Both can do DMA with user space buffers with much less disruption to traditional design patterns for implementing network daemons.
I think the real benefit of DPDK and netmap is that you're avoiding all the logic of the existing IP stacks, not to mention firewall rules, etc. At the same time you're now responsible for all of that. And IMO the amazing throughput most people claim to see with DPDK is a result of simply neglecting to implement all the hard logic which makes the Internet actually work. All the weirdness, head scratching, and hair pulling these solutions cause is an externality engineers will never care about, and in most cases probably be oblivious about. The exception to this state of affairs is when they're doing read-only packet sniffing and filtering, simply passing the packets back out another interface.
If most of what you care about is avoiding the cost of the kernel/userspace split, then you can just use NetBSD Rump or similar unikernel frameworks.
Do you know if the most optimal FreeBSD zero copy support is only available for Tigeon NICs? I couldn't quite figure it out from reading the following pages:
I don't actually know. I never went down the rabbit hole of zero-copy, only watched various teams crash and burn.
I've had success simply reducing the number of data copies in userspace. I use tools like Ragel, lean fifo buffers in C with moving read and write windows (i.e. slices), etc.
For example, for a [now defunct] startup, I implemented a real-time streaming radio transcoder which inserted targeted, per-listener, dynamically selected ad spots into live streams. So, it would take an existing radio source (Flash, ICY, MMS, etc), transcode the codec and format to suite the listener, and when it detected ad spots select and insert a targeted ad.
On a _single_ E3 Haswell core (core, not chip) I was able to transcode 5,000 streams in real-time, each with dynamically selected and inserted ad spots, cycling between 30 seconds of the stream and a new 30 second ad for the stress test. All software; no hardware or even SIMD (other than what GCC could squeeze out). The Linux kernel was spending more CPU time handling interrupts and pushing out the packets then my daemon. At that point I knew I was already at the 80% solution and moved on to more feature development. I knew that after fiddling with the kernel I'd have more than enough performance.
FFMpeg, GStreamer, etc couldn't even come close to that. I had written all my own stream parsers and writers, both so I had control over buffer management, but also because I needed frame-level control for splicing in ad spots. The only libraries I used were for the low-level codecs and for resampling. Notably, there was actually more copying then you'd think; enough to keep the interfaces relatively clean. The key, apparently, was maintaining data locality. Some libraries go through extraordinary lengths for zero-copy, but they end up with so much pointer indirection that it's a net loss.
Avoiding the logic of tcp stack processing does indeed come into play. My colleague process data sent through a tcp connection in a LAN. So he basically ignores all the tcp header of the incomming data. He simply checks the source ip.
He will test mTCP to get a full fledged tcp/ip stack so he will be able to measure it's overhead.
FreeBSD Netmap user here. You actually have to recompile the kernel with "device netmap" added in your kernconf. Piece of cake, after 20' you are good to go. But you need a real network card and the FreeBSD driver must be ready for netmap. Using intel 10Gbps adapters (~200euros) is a safe avenue (FreeBSD ixgbe driver). Even in VMWare, you can pass-through the PCI address of the adapter port to your virtual machine and have it talk to the card directly. Everything works very good.
The gain in performance is mind boggling! Trying to sniff approx. 2+Gbps traffic with Suricata using the "normal" avenue of libpcap ends up dropping a small percentage of the packets. And the machine will waste incredible CPU. Using Suricata with netmap (no need to recompile, Suricata pkgng binary build from FreeBSD comes ready) uses exactly one capture thread and drops ZERO packets. This behavior is stable for days!
I was looking at Chelsio NICs and there were mentions of Netmap support. Do you know what it means for a NIC to support Netmap vs one that doesnt? Is it an extra optimization/fast-path?
Can't give a good technical answer to that. But I suspect that it should be a matter of driver mostly. When you mmap /dev/netmap from userland, the OS TCP/IP stack is disconnected and you get access to the card tx/rx rings. Obviously the driver has to facilitate this.
Do you know if any DPDK or netmap drivers can be made to ignore the layer 2 Ethernet CRC/checksum of incoming packets? There are some interesting applications for using Wireshark to diagnose protocols that use Ethernet for layer 1 and sort of for layer 2, but slightly vary the layer 2 protocol (e.g. a malformed packet CRC, or unusual packet sizes, or no inter-frame spacing), but I don't know of hardware that will pass packets that fail the CRC or arrive closer than 96nsec.
I think in that case it would make sense to build your own NIC so that it can operate out of spec. Ethernet PHY ICs that speak MII or RMII are usually only a few dollars each. One of those might give you the signal you are looking for.
The CRC is often checked directly in hardware, so you would have to configure the NIC properly, and I have not seen many high-performance NICs that allow to modify that setting.
As the article says, it is significantly a matter of lock-free structures and not crossing the kernel/userland barrier. Also cache locality (combined with DirectIO which somebody else mentioned) is great.
With a userspace stack, you also get lots of tuning capability, which is not available with the kernel. One cool tunable is how long to busy-wait before sleeping for an interrupt.
I'm not sure. My colleague explained me that the main gain came from avoiding to do a malloc in kernel space and a data copy. But it's true also that many small reads in a single block implies many system calls. I don't know the real relative processing cost of these operations.
For what I have seen, most applications using DPDK do not require TCP/IP handling. For example, packet monitoring, data transfers between nodes that are just connected by a single cable (thus no need for routing or TCP arrival guarantees)...
But even reimplementing the TCP/IP stack it would probably be more faster: you don't make system calls (which are expensive), you don't go around copying every other to socket buffers, you don't have to decide which application should receive the packet... It's hard, but it can yield performance improvements.
On the other hand, I'd try to activate other features for improved performance, such as running applications in the same cores that the queues are running on, configuring the NIC RX queues and using jumboframes (>1514 bytes) if you control the network path. All of these can yield noticeable improvements without much effort.
I haven't yet run into a case where it makes sense to do it, but you can gain some efficiency because you avoid context switching between the kernel and the program.
Normally, when the kernel recieves data from the network, it allocates a block in the kernel and copy the data into it. Then your read operation copies that data in your user space. The "zero copy" networking stacks avoids the data copy. The way it works, as I was explained, is that they use a shared memory mapped zone. This zone is organized as a pool of blocks managed with non blocking lists. Blocks have a fixed size big enough to hold the ~1500 IP blocks. I never used it so I don't know the details.
When data arrives, it is directly written in place in a block of the memory mapped zone. In the user space you use select/epoll/kqueue or polling if the waiting time is very small. Once you have a block you can process it. This block contains raw network data received from the network card. So it's up to you to code and decode the TCP/IP headers or use an existing lib like mTCP that does that for you. I was told that it can work with dpdk and netmap.
My colleague is currently using netmap for a high performance data acquisition application in a LAN and plan to test dpdk with mTCP this summer. mTCP should simplify programming. At CERN they are now testing data acquisition setups using dpdk to be able to use commodity component hardware.
My colleague told me that Netmap is available in the BSD kernel so that you can use it right away. It is not included in the Linux kernel and you then need to patch it in. Zero copy is the future of network programming. Linux is late on this one. Then there is dpdk on which I don't have much info yet except that it is made by Intel, it is open source and compatible with AMD processors. It is apparenly not easy to install.
Since dpdk and netmap communicates directly with the network card, it only works with supported network cards.
The gain in performance is significant, but I have no numbers at hand to give.