Please don't rewrite your network stack unless you can afford to dedicate a team to support it full time.
Twice in my career I have been on teams where we decided to rewrite IP or TCP stacks. The justifications were different each time, though never perf.
The projects were filled with lots of early confidence and successes. "So much faster" and "wow, my code is a lot simpler than the kernel equivalent, I am smart!" We shipped versions that worked, with high confidence and enthusiasm. It was fun. We were smart. We could rewrite core Internet protocol implementations and be better!
Then the bug reports started to roll in. Our clean implementations started to get cluttered with nuances in the spec we didn't appreciate. We wasted weeks chasing implementation bugs in other network stack that were defacto but undocumented parts of the internet's "real" spec. Accommodating these cluttered that pretty code further. Performance decreased.
In both cases, after about a year, we found ourselves wishing we had not rewritten the network stack. We started making plans to eliminate the dependency, now much more complicated because we had to transition active deployments away.
I have not made that mistake a 3d time.
If you are Google, Facebook or another internet behemoth that is optimizing for efficiently at scale and can afford to dedicate a team to the problem, do it. But if you are a startup trying to get a product off the ground, this is Premature optimization. Stay far, far away.
Agreed. Also, it likely makes you and your company a bad Internet citizen to roll your own, for the reasons you've mentioned. My first company sold proxy servers, and I can't count the number of times a buggy router stack or other embedded thing broke the web for some users some of the time. If you make an off-by-one mistake in your PMTU discovery code, you're gonna waste somebody's day. If you don't deal with encapsulation right, you're going to waste someone's day. If you respond incorrectly to ICMP messages, you're going to waste someone's day. The list of things that can go wrong is endless.
I kinda feel like it's comparable to rolling your own encryption. Yes, most of the standards are well-written and well-defined, and you can spend a couple weeks really grokking Stevens' book (or whatever the modern equivalent is, I dunno, as I don't implement anything at that level anymore), but you're gonna spend years becoming bug-compatible with the rest of the Internet (or coming to realize your interpretation and the rest of the world's interpretation of the spec differ).
2 million requests a second sounds amazing. But, the price is high. As you note, if you have a team dedicated to it, cool. And, if you want to do it for fun and experience, that's also cool. But, Linux has a lot of wisdom built-in. There's like a gazillion commits over its life that have fixed various networking bugs and quirks.
Why does it make someone a bad internet citizen to implement internet protocols? Is there something official about the Linux kernel (the insecure, buggy, and written-in-C Linux kernel). I would start clean-sheet for a lot of reasons, mostly security, but I don't know what that has to do with my internet virtues.
No, there's nothing official about the Linux kernel implementation. Though you could make the case for BSD...
That aside, the reason it makes you a bad citizen is the lack of follow through. If you're a small organization writing a network implementation, you know from the start that your implementation is unlikely to get the decade+ of follow-through that is necessary to fix your interoperability problems. Instead you're putting that cost on every other internet host that is going to have to interact with your out-of-spec implementation until the day the last instance of your code is shut down.
I wrote a ip stack for my little toy OS a few years back and even that experience is miserable and hard. It never works quite right because either your implementation sucks or the people you're talking to are fast and loose with standards. More often both. It taught me though that the papers don't matter and the real standards are what's in the wild, and that what's happening in the wild is an order of magnitude more insane than you can imagine.
The specification, as written in the RFC, is pretty useless... But luckily people sat down and formalised TCP/IP as spoken on the Internet (10 years ago, Linux-2.4.20/FreeBSD-4.6/Windows XP): https://www.cl.cam.ac.uk/~pes20/Netsem/index.html
What is missing in 10 years of TCP? Apart from SACK and delayed ACK not much it seems (in FreeBSD congestion control algorithms are pluggable nowadays).
> "Please don't rewrite your network stack unless you can afford to dedicate a team to support it full time."
Generally agreed but there's a caveat to this statement.
> "Twice in my career I have been on teams where we decided to rewrite IP or TCP stacks."
This is the caveat. Don't try to rewrite a protocol that's already running billions of devices on the internet which you have to interoperate with. It's a much more tractable problem when you control both ends.
Twice in my career, I've worked on protocols (with success... the first company went public, second has it's protocol working well in millions of devices). Both times, the protocols were key to the business of course and both times the "custom" protocol only had to interoperate against itself on the other end.
I've always thought custom implementations of this flavor were smart or dumb. It's never an OK choice, there's never a middle ground. I think in your case, it's a big win. In the op's case, it was a bad choice.
I'm not saying the op's team aren't smart and capable people. They just made a bad choice.
Fundamentally, it's a hard thing to get perfect. There are lots of pointy edges in that code. If it is to be undertaken, it needs to be a huge win. The potential for a huge win is worth the effort.
Did you work at Onlive or something? I was always curious what their protocol was doing. I never bothered to look, because I figured it would take a while to figure out what I was looking at.
No it wasn't Onlive. The company that went public was Riverbed and they used their custom TCP between their appliances. My current company (which I founded) is PacketZoom and our custom protocol (built on top of UDP) communicates between our SDK in the mobile app and our servers distributed around the world.
We use Riverbeds on either side of our sat shots here in Antarctica. Work fairly well for the most part. I didn't know they were running a custom TCP implementation in between the end devices.
Veteran of a similar affair. Well, two stacks, and once, an HTTP proxy.
You may be smarter than the average bear, but you have to deal with other people's um . . . questionable decisions (and bugs).
Customers won't care that the FuppedUckTron-9000 web server they bought on eBay is non-compliant and that its Content-Length needs to have special casing to work around some spectacular drain-bamage, they only care about their valuable business data ^H^H^H porn.
The original article claims that having the TCP stack in the kernel causes performance problems because it needs to do excessive locking.
I can't judge, but if really that is true, then in principle, a user-space library could be written to take care of all those corner cases you mention, and still be faster than the kernel stack.
Of course that wouldn't be everyone rolling their own.
I've only poked at the FreeBSD TCP stack and not the Linux stack, but it seems like if the problem is locking, you should be able to get good results from working on the locking (finer grained locks / tweaking parameters) in less time than building a full tcp stack.
What kind of limitations are people seeing with the Linux kernel? If I'm interpreting Netflix's paper[1] correctly, they could push at least 20 Gbps of unencrypted content with a single socket E5-2650L (document isn't super clear though, it says they were designed for 40Gbps). My servers usually run out of application CPU before they run out of network -- but I've run some of them up to 10Gbps without a lot of tuning.
Gbps is not created equal. Traffic with many small packets takes a lot more resources compared to traffic with fewer but bigger. Netflix packages would be as big as they come.
Yes, in fact for things like Juniper/Cisco firewalls they will always quote PPS in full MTU packets. If you want to bring that shiny new firewall to its knees try sending it traffic with the minimum MTU of 68 bytes at line rate for the NIC.
The problem isn't locking so much, it's that you have to dispatch to a kernel thread when you're requesting and sending data, paying the cost of that context switch every time. In userspace you can spin a polling thread on its own core and DMA data up and down to the hardware all day long without yielding your thread to another one.
sure, that saves you from dumping TLB state - but you still need to save register state, copy data from a user supplied buffer in to a kernel-owned device-mapped buffer - wiping L1 data and instruction caches in the process.
For 99% of use cases this isn't a problem, but if you're trying to save every possible microsecond, then it definitely does.
Sure, I was more commenting on the parent post that suggested that the cost was doing to a "context switch" when its not a context switch at all its mode switch - to "kernel mode."
If you are trying to save microseconds you are probably running special hardware like the SolarFlare network cards which also run the drivers in user space. These are generally hedge funds or high frequency trading shops. I can't imagine anyone else could justify the price.
I guess most locking is in place in order to allow multiplexing - different applications need access to different sockets, which however send and receive data from the same NIC.
If you implement the whole stack in userspace and have only a single thread which processes all data you might get away with less locking. However as soon as there are multiple threads which want to send/receive on multiple endpoints there is the same need for synchronization and it would need to be implemented in userspace.
The problem here mostly is not about the throughput,which i guess most of the highend hardware could help, but in terms of how many connections it can handle per second is where the bottleneck is on the Linux Kernel.I cannot imagine a 1 million CPS linux kernel support now .. But the same is possible with a UserSpace TCP Stack .
I suspect this is not limited to networking stacks.
I see a whole lot of fretting over technical debt here on NH, and the more i see it the more i find myself thinking that said debt do not just happen. There will always be a reason, and it likely it can never be fully avoidable.
As such, whenever i see someone joyfully ripping out old code or rewriting something from scratch, i can't help think that at best it has reset the clock with a few years. We will be right back where we started, and then some, soon enough.
For example, the company I currently work for has 15 years of unaddressed technical debt, and it has enormous impact on everything we're doing today. Fixing it is absolutely necessary, and where it's been possible to do so it's already made a significant impact.
However, in a few years things will likely still be awful and even the shiny new improvements will have been dirtied up, because the cultural and managerial decisions that led to the current state are still in full effect. You're correct in that technical debt is ironically usually as much a people problem as a technical one and solely-technical solutions are usually inadequate.
That doesn't mean that I'm going to stop fixing things where I can and pushing for change though, because at the very least it means my life there will be a little saner (until I finally burn out on fighting the tide and find somewhere different to go).
I have seen variations on this so, so many times over the years. "Why use that generic version, when we can roll our own?" Without a dedicated team, you're then stuck implementing and maintaining something on your own, while dozens, hundreds, or even thousands of people participate in the development of open-source libraries.
This is similar to making a private fork of an open-source project. At first, it seems fantastic. But pretty soon, you discover that no one, including the original authors, can provide you with advice and support.
I can't even imagine how painful that would be for part of the network stack. Yukko.
Agree that for most situations, especially public internet facing, this is the right advice.
However, in addition to big companies using internally, it can also work when the environment is otherwise controlled. For example scylladb is built on top of seastar and has it's own userspace networking stack based on dpdk that it can use, and it works just fine since only other database instances and well-behaved clients will be interacting with those servers. No dedicated team required, just realistic isolation on the scope of this "custom" protocol.
So essentially, you are saying that the Internet is doomed to eternal cruft. It seems like the Left-pad incident shows that a Chinese dolls level of interdependency results in inherent instability.
But there is an army of programmers, sysadmins and so-forth to fix all this. Moreover, this is what people have gotten to work on a really large scale. It seems like programming in two hundred will become 90% "listening to the mythology" and 10% actual logic. But that's how it happens.
> "So essentially, you are saying that the Internet is doomed to eternal cruft."
No. This is a classic disruption story. While the establishment is smug and comfortable about the accumulated cruft of decades, others are working on the problems they're completely ignoring. Check out my other comments in this thread.
If there is to be disruptive change, it will be a new protocol that solves global-scale problems the old one did not. The automobile did not depend on the horse. An incrementally better TCP implementation will not disrupt TCP.
Unfortunately a large portion of protocol development seems to be occurring at the application layer even if it doesn't belong there, which is how we ended up with HTTP/2 and WebSockets.
The pragmatics are all against rewriting the network stack, as you have thoroughly explained. Though I have a feeling of unease. The network stack implementations are lacking diversity. They improve slower than they could otherwise do. They have plenty of undocumented obscure corner cases. Developing an implementation of a [de-facto] standard requires a solid open test suite. The Web platform has one, https://github.com/w3c/web-platform-tests. Is there an equivalent suite for TCP/IP?
Certainly there are high-quality commercial testing appliances for precise performance and correctness figures. For example, https://www.ixiacom.com/products/ixanvl
Twice in my career I have been on teams where we decided to rewrite IP or TCP stacks. The justifications were different each time, though never perf.
The projects were filled with lots of early confidence and successes. "So much faster" and "wow, my code is a lot simpler than the kernel equivalent, I am smart!" We shipped versions that worked, with high confidence and enthusiasm. It was fun. We were smart. We could rewrite core Internet protocol implementations and be better!
Then the bug reports started to roll in. Our clean implementations started to get cluttered with nuances in the spec we didn't appreciate. We wasted weeks chasing implementation bugs in other network stack that were defacto but undocumented parts of the internet's "real" spec. Accommodating these cluttered that pretty code further. Performance decreased.
In both cases, after about a year, we found ourselves wishing we had not rewritten the network stack. We started making plans to eliminate the dependency, now much more complicated because we had to transition active deployments away.
I have not made that mistake a 3d time.
If you are Google, Facebook or another internet behemoth that is optimizing for efficiently at scale and can afford to dedicate a team to the problem, do it. But if you are a startup trying to get a product off the ground, this is Premature optimization. Stay far, far away.