Please don't rewrite your network stack unless you can afford to dedicate a team...

SwellJoe · on July 2, 2016

Agreed. Also, it likely makes you and your company a bad Internet citizen to roll your own, for the reasons you've mentioned. My first company sold proxy servers, and I can't count the number of times a buggy router stack or other embedded thing broke the web for some users some of the time. If you make an off-by-one mistake in your PMTU discovery code, you're gonna waste somebody's day. If you don't deal with encapsulation right, you're going to waste someone's day. If you respond incorrectly to ICMP messages, you're going to waste someone's day. The list of things that can go wrong is endless.

I kinda feel like it's comparable to rolling your own encryption. Yes, most of the standards are well-written and well-defined, and you can spend a couple weeks really grokking Stevens' book (or whatever the modern equivalent is, I dunno, as I don't implement anything at that level anymore), but you're gonna spend years becoming bug-compatible with the rest of the Internet (or coming to realize your interpretation and the rest of the world's interpretation of the spec differ).

2 million requests a second sounds amazing. But, the price is high. As you note, if you have a team dedicated to it, cool. And, if you want to do it for fun and experience, that's also cool. But, Linux has a lot of wisdom built-in. There's like a gazillion commits over its life that have fixed various networking bugs and quirks.

Solar19 · on July 13, 2016

Why does it make someone a bad internet citizen to implement internet protocols? Is there something official about the Linux kernel (the insecure, buggy, and written-in-C Linux kernel). I would start clean-sheet for a lot of reasons, mostly security, but I don't know what that has to do with my internet virtues.

syncsynchalt · on July 17, 2016

No, there's nothing official about the Linux kernel implementation. Though you could make the case for BSD...

That aside, the reason it makes you a bad citizen is the lack of follow through. If you're a small organization writing a network implementation, you know from the start that your implementation is unlikely to get the decade+ of follow-through that is necessary to fix your interoperability problems. Instead you're putting that cost on every other internet host that is going to have to interact with your out-of-spec implementation until the day the last instance of your code is shut down.

mveety · on July 2, 2016

I wrote a ip stack for my little toy OS a few years back and even that experience is miserable and hard. It never works quite right because either your implementation sucks or the people you're talking to are fast and loose with standards. More often both. It taught me though that the papers don't matter and the real standards are what's in the wild, and that what's happening in the wild is an order of magnitude more insane than you can imagine.

hannesm · on July 3, 2016

The specification, as written in the RFC, is pretty useless... But luckily people sat down and formalised TCP/IP as spoken on the Internet (10 years ago, Linux-2.4.20/FreeBSD-4.6/Windows XP): https://www.cl.cam.ac.uk/~pes20/Netsem/index.html

All the code is now BSD licensed https://github.com/PeterSewell/netsem

What is missing in 10 years of TCP? Apart from SACK and delayed ACK not much it seems (in FreeBSD congestion control algorithms are pluggable nowadays).

And I currently revive it (working hard on test suite using DTrace probes) https://www.cl.cam.ac.uk/~pes20/HuginnTCP/

Please ping me if you're interested (hannes at mehnert dot org) and eager to help.

chetanahuja · on July 2, 2016

> "Please don't rewrite your network stack unless you can afford to dedicate a team to support it full time."

Generally agreed but there's a caveat to this statement.

> "Twice in my career I have been on teams where we decided to rewrite IP or TCP stacks."

This is the caveat. Don't try to rewrite a protocol that's already running billions of devices on the internet which you have to interoperate with. It's a much more tractable problem when you control both ends.

Twice in my career, I've worked on protocols (with success... the first company went public, second has it's protocol working well in millions of devices). Both times, the protocols were key to the business of course and both times the "custom" protocol only had to interoperate against itself on the other end.

jfoutz · on July 2, 2016

I've always thought custom implementations of this flavor were smart or dumb. It's never an OK choice, there's never a middle ground. I think in your case, it's a big win. In the op's case, it was a bad choice.

I'm not saying the op's team aren't smart and capable people. They just made a bad choice.

Fundamentally, it's a hard thing to get perfect. There are lots of pointy edges in that code. If it is to be undertaken, it needs to be a huge win. The potential for a huge win is worth the effort.

hacknat · on July 2, 2016

Did you work at Onlive or something? I was always curious what their protocol was doing. I never bothered to look, because I figured it would take a while to figure out what I was looking at.

chetanahuja · on July 2, 2016

No it wasn't Onlive. The company that went public was Riverbed and they used their custom TCP between their appliances. My current company (which I founded) is PacketZoom and our custom protocol (built on top of UDP) communicates between our SDK in the mobile app and our servers distributed around the world.

vocatus_gate · on July 4, 2016

We use Riverbeds on either side of our sat shots here in Antarctica. Work fairly well for the most part. I didn't know they were running a custom TCP implementation in between the end devices.

kabdib · on July 2, 2016

Veteran of a similar affair. Well, two stacks, and once, an HTTP proxy.

You may be smarter than the average bear, but you have to deal with other people's um . . . questionable decisions (and bugs).

Customers won't care that the FuppedUckTron-9000 web server they bought on eBay is non-compliant and that its Content-Length needs to have special casing to work around some spectacular drain-bamage, they only care about their valuable business data ^H^H^H porn.

Stacks and proxies are usually thankless.

adrianratnapala · on July 2, 2016

The original article claims that having the TCP stack in the kernel causes performance problems because it needs to do excessive locking.

I can't judge, but if really that is true, then in principle, a user-space library could be written to take care of all those corner cases you mention, and still be faster than the kernel stack.

Of course that wouldn't be everyone rolling their own.

toast0 · on July 2, 2016

I've only poked at the FreeBSD TCP stack and not the Linux stack, but it seems like if the problem is locking, you should be able to get good results from working on the locking (finer grained locks / tweaking parameters) in less time than building a full tcp stack.

What kind of limitations are people seeing with the Linux kernel? If I'm interpreting Netflix's paper[1] correctly, they could push at least 20 Gbps of unencrypted content with a single socket E5-2650L (document isn't super clear though, it says they were designed for 40Gbps). My servers usually run out of application CPU before they run out of network -- but I've run some of them up to 10Gbps without a lot of tuning.

[1] https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf Context is accelerating https downloads, but some decent numbers anyway.

mioelnir · on July 2, 2016

Unencrypted Netflix pushes 80~90+ GBit/s from their most recent OpenConnect revisions.

[1] https://twitter.com/ed_maste/status/655120086248763396

[2] https://twitter.com/scott4long/status/656219076629368832

[3] https://twitter.com/facepalm_tar_gz/status/71066891267375104...

[4] https://media.netflix.com/en/company-blog/how-netflix-works-...

yxhuvud · on July 2, 2016

Gbps is not created equal. Traffic with many small packets takes a lot more resources compared to traffic with fewer but bigger. Netflix packages would be as big as they come.

bogomipz · on July 2, 2016

Yes, in fact for things like Juniper/Cisco firewalls they will always quote PPS in full MTU packets. If you want to bring that shiny new firewall to its knees try sending it traffic with the minimum MTU of 68 bytes at line rate for the NIC.

toast0 · on July 2, 2016

Ah, I'm also dealing with large packets, generally.

omellet · on July 2, 2016

The problem isn't locking so much, it's that you have to dispatch to a kernel thread when you're requesting and sending data, paying the cost of that context switch every time. In userspace you can spin a polling thread on its own core and DMA data up and down to the hardware all day long without yielding your thread to another one.

bogomipz · on July 2, 2016

The kernel is mapped into the top of the address space of each user spaces process. That is generally pretty efficient which is why it is done.

hendzen · on July 3, 2016

sure, that saves you from dumping TLB state - but you still need to save register state, copy data from a user supplied buffer in to a kernel-owned device-mapped buffer - wiping L1 data and instruction caches in the process.

For 99% of use cases this isn't a problem, but if you're trying to save every possible microsecond, then it definitely does.

bogomipz · on July 3, 2016

Sure, I was more commenting on the parent post that suggested that the cost was doing to a "context switch" when its not a context switch at all its mode switch - to "kernel mode."

If you are trying to save microseconds you are probably running special hardware like the SolarFlare network cards which also run the drivers in user space. These are generally hedge funds or high frequency trading shops. I can't imagine anyone else could justify the price.

corysama · on July 2, 2016

For example, you could use the BSD TCP stack which has been refactored into a user space library as part of the rump kernel project.

Matthias247 · on July 2, 2016

I guess most locking is in place in order to allow multiplexing - different applications need access to different sockets, which however send and receive data from the same NIC.

If you implement the whole stack in userspace and have only a single thread which processes all data you might get away with less locking. However as soon as there are multiple threads which want to send/receive on multiple endpoints there is the same need for synchronization and it would need to be implemented in userspace.

hendzen · on July 3, 2016

in practice you will bind a single core to each NIC and then run a single polling thread for each networking core that runs with rtprio enabled.

anandrm · on July 2, 2016

The problem here mostly is not about the throughput,which i guess most of the highend hardware could help, but in terms of how many connections it can handle per second is where the bottleneck is on the Linux Kernel.I cannot imagine a 1 million CPS linux kernel support now .. But the same is possible with a UserSpace TCP Stack .

digi_owl · on July 2, 2016

I suspect this is not limited to networking stacks.

I see a whole lot of fretting over technical debt here on NH, and the more i see it the more i find myself thinking that said debt do not just happen. There will always be a reason, and it likely it can never be fully avoidable.

As such, whenever i see someone joyfully ripping out old code or rewriting something from scratch, i can't help think that at best it has reset the clock with a few years. We will be right back where we started, and then some, soon enough.

larzang · on July 2, 2016

Yes and no.

For example, the company I currently work for has 15 years of unaddressed technical debt, and it has enormous impact on everything we're doing today. Fixing it is absolutely necessary, and where it's been possible to do so it's already made a significant impact.

However, in a few years things will likely still be awful and even the shiny new improvements will have been dirtied up, because the cultural and managerial decisions that led to the current state are still in full effect. You're correct in that technical debt is ironically usually as much a people problem as a technical one and solely-technical solutions are usually inadequate.

That doesn't mean that I'm going to stop fixing things where I can and pushing for change though, because at the very least it means my life there will be a little saner (until I finally burn out on fighting the tide and find somewhere different to go).

reuven · on July 3, 2016

I have seen variations on this so, so many times over the years. "Why use that generic version, when we can roll our own?" Without a dedicated team, you're then stuck implementing and maintaining something on your own, while dozens, hundreds, or even thousands of people participate in the development of open-source libraries.

This is similar to making a private fork of an open-source project. At first, it seems fantastic. But pretty soon, you discover that no one, including the original authors, can provide you with advice and support.

I can't even imagine how painful that would be for part of the network stack. Yukko.

manigandham · on July 2, 2016

Agree that for most situations, especially public internet facing, this is the right advice.

However, in addition to big companies using internally, it can also work when the environment is otherwise controlled. For example scylladb is built on top of seastar and has it's own userspace networking stack based on dpdk that it can use, and it works just fine since only other database instances and well-behaved clients will be interacting with those servers. No dedicated team required, just realistic isolation on the scope of this "custom" protocol.

joe_the_user · on July 2, 2016

So essentially, you are saying that the Internet is doomed to eternal cruft. It seems like the Left-pad incident shows that a Chinese dolls level of interdependency results in inherent instability.

But there is an army of programmers, sysadmins and so-forth to fix all this. Moreover, this is what people have gotten to work on a really large scale. It seems like programming in two hundred will become 90% "listening to the mythology" and 10% actual logic. But that's how it happens.

chetanahuja · on July 3, 2016

> "So essentially, you are saying that the Internet is doomed to eternal cruft."

No. This is a classic disruption story. While the establishment is smug and comfortable about the accumulated cruft of decades, others are working on the problems they're completely ignoring. Check out my other comments in this thread.

inopinatus · on July 6, 2016

If there is to be disruptive change, it will be a new protocol that solves global-scale problems the old one did not. The automobile did not depend on the horse. An incrementally better TCP implementation will not disrupt TCP.

Unfortunately a large portion of protocol development seems to be occurring at the application layer even if it doesn't belong there, which is how we ended up with HTTP/2 and WebSockets.

Kiro · on July 2, 2016

May I ask what kind of startups you worked at where you felt the need to rewrite it?

pacala · on July 2, 2016

The pragmatics are all against rewriting the network stack, as you have thoroughly explained. Though I have a feeling of unease. The network stack implementations are lacking diversity. They improve slower than they could otherwise do. They have plenty of undocumented obscure corner cases. Developing an implementation of a [de-facto] standard requires a solid open test suite. The Web platform has one, https://github.com/w3c/web-platform-tests. Is there an equivalent suite for TCP/IP?

mcpherrinm · on July 3, 2016

Certainly there are high-quality commercial testing appliances for precise performance and correctness figures. For example, https://www.ixiacom.com/products/ixanvl