In my experience, kernel threads have a cost that is sufficiently far from that of, for example, Erlang processes (400+ bytes each) that it changes the way you program with them.
In Erlang you don't think twice about spawning a million threads if that models your problem nicely. The same isn't true when you're dealing with kernel threads. (Right?) So I'm interested in green threads because when I have to start thinking about the cost of the threads I'm using, then I'm thinking less about how to most elegantly solve the problem and more about how to satisfy the architecture I'm programming on.
Now, if kernel threads are massively cheap these days and there's no problem spawning a million of them, then I need to take another look at that model.
So since you're talking about scalability into the millions of threads, I think what you actually want is stackless coroutines rather than M:N threading with separate user-level stacks. If you have 1M threads, even one page of stack for each will result in 4G of memory use. That's assuming no fragmentation or delayed reclamation from GC. Stacks, even when relocating, are too heavyweight for that kind of extreme concurrent load. With a stackless coroutine model, it's easier to reason about how much memory you're using per request; with a stack model, it's extremely dynamic, and compilers will readily sacrifice stack space for optimization behind your back (consider e.g. LICM).
Stackless coroutines are great--you can get to nginx levels of performance with them--but they aren't M:N threading as seen in Golang. Once you have a stack, as Erlang and Go do, you've already paid a large portion of the cost of 1:1 threading.
Coroutines are preemptible at I/O boundaries or manual synchronization points. Those synchronization points could be inserted by the compiler, but if you do that you're back into goroutine land, which typically isn't better than 1:1. In particular, it seems quite difficult to achieve scalability to millions of threads with "true" preemption, which requires either stacks or aggressive CPS transformation.
In Erlang you don't think twice about spawning a million threads if that models your problem nicely. The same isn't true when you're dealing with kernel threads. (Right?) So I'm interested in green threads because when I have to start thinking about the cost of the threads I'm using, then I'm thinking less about how to most elegantly solve the problem and more about how to satisfy the architecture I'm programming on.
Now, if kernel threads are massively cheap these days and there's no problem spawning a million of them, then I need to take another look at that model.