Some amount of TCP handling needs to be in the kernel to handle arbitration of the shared resource (namely, connection tuples). Once you've already handled IP defragmentation [1] and looking at the structures to get port numbers, and associate that with a userspace file descriptor/process, you've already got all your cachelines primed, and you may as well finish processing the packet, before handing it to userspace.
[1] boo hiss; I wish the spec was simply to truncate overlong packets at the MTU, and indicate that with a flag; the peers could then figure out what to do when a packet arrived that was shorter than its original length. Handling it in-band would mean it was more likely to arrive. Instead we can fragmenting it, which is icky, because defragmentation sucks; or we can drop it and sending an out of band message to the sender, but that message may not make it (and often doesn't). TCP could very easily adapt to 'i sent 1480 bytes of payload, but my peer is only acking 1472 each time, maybe I should send 1472 --- much easier and quicker than I keep sending packets and they don't get acked, maybe i should try sending smaller packets 15 seconds later.
[1] boo hiss; I wish the spec was simply to truncate overlong packets at the MTU, and indicate that with a flag; the peers could then figure out what to do when a packet arrived that was shorter than its original length. Handling it in-band would mean it was more likely to arrive. Instead we can fragmenting it, which is icky, because defragmentation sucks; or we can drop it and sending an out of band message to the sender, but that message may not make it (and often doesn't). TCP could very easily adapt to 'i sent 1480 bytes of payload, but my peer is only acking 1472 each time, maybe I should send 1472 --- much easier and quicker than I keep sending packets and they don't get acked, maybe i should try sending smaller packets 15 seconds later.