The original article claims that having the TCP stack in the kernel causes performance problems because it needs to do excessive locking.
I can't judge, but if really that is true, then in principle, a user-space library could be written to take care of all those corner cases you mention, and still be faster than the kernel stack.
Of course that wouldn't be everyone rolling their own.
I've only poked at the FreeBSD TCP stack and not the Linux stack, but it seems like if the problem is locking, you should be able to get good results from working on the locking (finer grained locks / tweaking parameters) in less time than building a full tcp stack.
What kind of limitations are people seeing with the Linux kernel? If I'm interpreting Netflix's paper[1] correctly, they could push at least 20 Gbps of unencrypted content with a single socket E5-2650L (document isn't super clear though, it says they were designed for 40Gbps). My servers usually run out of application CPU before they run out of network -- but I've run some of them up to 10Gbps without a lot of tuning.
Gbps is not created equal. Traffic with many small packets takes a lot more resources compared to traffic with fewer but bigger. Netflix packages would be as big as they come.
Yes, in fact for things like Juniper/Cisco firewalls they will always quote PPS in full MTU packets. If you want to bring that shiny new firewall to its knees try sending it traffic with the minimum MTU of 68 bytes at line rate for the NIC.
The problem isn't locking so much, it's that you have to dispatch to a kernel thread when you're requesting and sending data, paying the cost of that context switch every time. In userspace you can spin a polling thread on its own core and DMA data up and down to the hardware all day long without yielding your thread to another one.
sure, that saves you from dumping TLB state - but you still need to save register state, copy data from a user supplied buffer in to a kernel-owned device-mapped buffer - wiping L1 data and instruction caches in the process.
For 99% of use cases this isn't a problem, but if you're trying to save every possible microsecond, then it definitely does.
Sure, I was more commenting on the parent post that suggested that the cost was doing to a "context switch" when its not a context switch at all its mode switch - to "kernel mode."
If you are trying to save microseconds you are probably running special hardware like the SolarFlare network cards which also run the drivers in user space. These are generally hedge funds or high frequency trading shops. I can't imagine anyone else could justify the price.
I guess most locking is in place in order to allow multiplexing - different applications need access to different sockets, which however send and receive data from the same NIC.
If you implement the whole stack in userspace and have only a single thread which processes all data you might get away with less locking. However as soon as there are multiple threads which want to send/receive on multiple endpoints there is the same need for synchronization and it would need to be implemented in userspace.
The problem here mostly is not about the throughput,which i guess most of the highend hardware could help, but in terms of how many connections it can handle per second is where the bottleneck is on the Linux Kernel.I cannot imagine a 1 million CPS linux kernel support now .. But the same is possible with a UserSpace TCP Stack .
I can't judge, but if really that is true, then in principle, a user-space library could be written to take care of all those corner cases you mention, and still be faster than the kernel stack.
Of course that wouldn't be everyone rolling their own.