The original article claims that having the TCP stack in the kernel causes perfo...

toast0 · on July 2, 2016

I've only poked at the FreeBSD TCP stack and not the Linux stack, but it seems like if the problem is locking, you should be able to get good results from working on the locking (finer grained locks / tweaking parameters) in less time than building a full tcp stack.

What kind of limitations are people seeing with the Linux kernel? If I'm interpreting Netflix's paper[1] correctly, they could push at least 20 Gbps of unencrypted content with a single socket E5-2650L (document isn't super clear though, it says they were designed for 40Gbps). My servers usually run out of application CPU before they run out of network -- but I've run some of them up to 10Gbps without a lot of tuning.

[1] https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf Context is accelerating https downloads, but some decent numbers anyway.

mioelnir · on July 2, 2016

Unencrypted Netflix pushes 80~90+ GBit/s from their most recent OpenConnect revisions.

[1] https://twitter.com/ed_maste/status/655120086248763396

[2] https://twitter.com/scott4long/status/656219076629368832

[3] https://twitter.com/facepalm_tar_gz/status/71066891267375104...

[4] https://media.netflix.com/en/company-blog/how-netflix-works-...

yxhuvud · on July 2, 2016

Gbps is not created equal. Traffic with many small packets takes a lot more resources compared to traffic with fewer but bigger. Netflix packages would be as big as they come.

bogomipz · on July 2, 2016

Yes, in fact for things like Juniper/Cisco firewalls they will always quote PPS in full MTU packets. If you want to bring that shiny new firewall to its knees try sending it traffic with the minimum MTU of 68 bytes at line rate for the NIC.

toast0 · on July 2, 2016

Ah, I'm also dealing with large packets, generally.

omellet · on July 2, 2016

The problem isn't locking so much, it's that you have to dispatch to a kernel thread when you're requesting and sending data, paying the cost of that context switch every time. In userspace you can spin a polling thread on its own core and DMA data up and down to the hardware all day long without yielding your thread to another one.

bogomipz · on July 2, 2016

The kernel is mapped into the top of the address space of each user spaces process. That is generally pretty efficient which is why it is done.

hendzen · on July 3, 2016

sure, that saves you from dumping TLB state - but you still need to save register state, copy data from a user supplied buffer in to a kernel-owned device-mapped buffer - wiping L1 data and instruction caches in the process.

For 99% of use cases this isn't a problem, but if you're trying to save every possible microsecond, then it definitely does.

bogomipz · on July 3, 2016

Sure, I was more commenting on the parent post that suggested that the cost was doing to a "context switch" when its not a context switch at all its mode switch - to "kernel mode."

If you are trying to save microseconds you are probably running special hardware like the SolarFlare network cards which also run the drivers in user space. These are generally hedge funds or high frequency trading shops. I can't imagine anyone else could justify the price.

corysama · on July 2, 2016

For example, you could use the BSD TCP stack which has been refactored into a user space library as part of the rump kernel project.

Matthias247 · on July 2, 2016

I guess most locking is in place in order to allow multiplexing - different applications need access to different sockets, which however send and receive data from the same NIC.

If you implement the whole stack in userspace and have only a single thread which processes all data you might get away with less locking. However as soon as there are multiple threads which want to send/receive on multiple endpoints there is the same need for synchronization and it would need to be implemented in userspace.

hendzen · on July 3, 2016

in practice you will bind a single core to each NIC and then run a single polling thread for each networking core that runs with rtprio enabled.

anandrm · on July 2, 2016

The problem here mostly is not about the throughput,which i guess most of the highend hardware could help, but in terms of how many connections it can handle per second is where the bottleneck is on the Linux Kernel.I cannot imagine a 1 million CPS linux kernel support now .. But the same is possible with a UserSpace TCP Stack .