Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fix for Golang runtime memory corruption bug (googlesource.com)
115 points by rargulati on Jan 7, 2017 | hide | past | favorite | 49 comments


One thing I like about rsc (and Austin) is that they have these well explained commit logs that let's people follow along at home. The changes seem like they are well justified and made for posterity. This change in particular is a story more than it is a 3 LOC fix.


I like going through Postgres's commit logs once in a while for exactly that reason, non-trivial commits (logic wise, they may be a 1LOC commit) get short-story-scale messages, often with references to ML thread(s) where the issue or change was discussed. It's a nice change of pace from $dayjob where a commit changing half the codebase in ways unspecified gets logged as "improvements".


> It's a nice change of pace from $dayjob where a commit changing half the codebase in ways unspecified gets logged as "improvements".

The best (worst?) one is where you'll see a series of major changes with the following commit messages:

- Add feature X (no explanation of what it does)

- Fix feature X

- Really fix feature X

- Fix feature X for real

- Finally fixed X

- Fix test environment


I know you don't work at $dayjob because the last commit would be "fix X test which is dumb", which removes the test entirely.


You weren't joking, it is a story. Isn't it kind of unfortunate that all this information is in a commit message, that 1) needs to be discovered through git blame, 2) at risk of requiring code archeology if enough lines change, e.g. by moving code around, and 3) is immutable so parts of it cannot be edited in the future?

I see this habit in a lot of commit messages and while it is important to convey intent, there are better ways to document your code (either inline, or in this particular case, a wiki page).


Much of it is reproduced in the comment in the patch.


Indeed, the only actual code change is:

    -	sellock(scases, lockorder)
    +	systemstack(func() {
    +		sellock(scases, lockorder)
    +	})
The rest of the patch is adding/changing comments explaining everything.


I feel this kind of thing should be remembered when people say they're going to reinvent decades-old technology.

This bug exists because they wanted to reinvent threading so you could have millions of threads, except when you actually use millions of threads for anything non-trivial it doesn't actually work out so well. To do that they couldn't use the standard since-1970 fixed stacks so they needed movable/growable stacks. They reinvented exceptions in a way that broke selinux. They reinvented the memory allocator in a way that breaks ulimit. They reinvented a javadoc/doxygen that's less useful.

It's great to try to improve things, but a lot of times there's actually very good reasons for the old ways.


A lot of the choices for Go become clearer when you realize it was primarily built to solve problems of servers and code base at scale.

It supports millions of co-routines, not threads. There's typically one thread per CPU or so. The key insight is a lot of server type workloads primarily are constrained on I/O and multiplexing virtually unlimited co-routines on a thread automatically by the runtime makes for much simpler concurrent server code.

Implementing high concurrency server code in almost any other mainstream language is non-trivial. Indeed I originally switched from Python to Go primarily because of that.


Are you kidding? How is `go doc` anything like Javadoc aside from being automatically generated by machine from your sources. Might as well compare it to Jekyll.



Note the backticks, I was referring to the `go doc` command built into the standard tooling. There are many generated HTML documentation generation tools for every language. It's Golang's general approach to documentation which many find appealing, not the fact they can output to HTML.


My tooling does this too and I don't even need to remember a terminal command: http://i.imgur.com/YmAkBPG.png

I get what your saying but it's apples to oranges. They are in effect similar though.


I (and I guess many others) enjoy that Golang's documentation strategy is coupled far tighter to the language (almost as a component of the language), in contrast to JavaDoc's existence as more of a standard for writing comments, `javadoc` being the first of many implementations and variations on it.

I'm not saying this is in any way empirically superior, in fact I think Go's approach owes a lot to Sun's original vision with JavaDoc. Some just prefer it and feel it's approach benefits them, and it's a preference that is often factored in when making the choice to use it.

My original point was that Sun's approach is now so commonplace it's an 'apples and oranges' argument. Golang has made a legitimate step forward (I think), in how a language treats and encourages it's documentation: As a first class citizen from day 0, rather than a convenient afterthought. The parity of experience between godoc.org & `go doc` demonstrates this with ample effect.


I'm not sure what you are implying here. Should we give up on creating a language that can have millions of threads because we might break selinux? Should we give up on programs with millions of threads because someone in the 1970's never foresaw that use case?


> Should we give up on programs with millions of threads because someone in the 1970's never foresaw that use case?

It's not that simple. I think there's a reasonable first-principles argument to be made that scheduling belongs in the kernel, which has a global view of the system and is better equipped to make those decisions.


I use erlang, not go, but it does similar things. On most of our production machines, the only OS process dong real work is the erlang VM, other processes are just things like crond, ntpd, sshd, syslogd; so all the kernel scheduler has to do is run the erlang OS threads (and they're pined to individual CPUs).

Maybe the kernel scheduler would do a better job, but is it realistic to run 1M OS threads? It feels like the overhead per green thread is much lower, and the potential for being interrupted in a critical section is also much lower: An OS thread may be paused while holding a (vm internal) lock, but the vm scheduler would not preempt a green thread in the middle of doing something that requires a lock (ex: Delivering a message to another green thread)


> but is it realistic to run 1M OS threads? It feels like the overhead per green thread is much lower

What kind of overhead do OS thread have that green thread doesn't ? I'm aware of the stack size, but is there something else ?

If the stack size is the only culprit here, what prevents the admin from manually setting a small stack size that will be enough for most operation (in practice, I don't know how useful the ability to grow dynamically the stack size in languages with green threads).

> An OS thread may be paused while holding a (vm internal) lock, but the vm scheduler would not preempt a green thread in the middle of doing something that requires a lock (ex: Delivering a message to another green thread)

That's an interesting point.


Besides the stack size there's also the switching overhead involved when you go through the kernel.

And yes, the dynamic stack growth is an extremely important part of it. Without it, you'd just have small stacks. Take an arbitrary pthreads program and restrict its stacks to a few kb. You won't have much fun. The code really needs to be written such that it can cope with small stacks. That can be tedious, so you don't want to do it for run-of-the-mill userspace code. Dynamically-sized stacks give you the best of both worlds.


You can use dynamically-sized stacks with pthreads. Nothing says you can't.

Dynamically sized stacks are a property of GC, not userland threads. This is a common misconception.


> What kind of overhead do OS thread have that green thread doesn't ? I'm aware of the stack size, but is there something else ?

Real context switches (with cache invalidation) to switch between, system call (with context switch to kernel) to create/dispose, probably kernel bits that will explode with huge thread lists: I wouldn't be surprised if listing all threads with ps or top also locks the thread lists: not a big deal if you have a reasonable number of threads, but could block thread creation in a 1M thread case.


Context switches are really fast on modern x86.


> Context switches are really fast on modern x86.

Not context switching is even faster.

How fast is really fast, how modern is modern? some results from 2010 [1] show at least 1000 nanoseconds per context switch (even with optimizations for the kernel switching between threads of the same process). That's not that slow, but it is a lot of time you could be doing something else.

[1] http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-ma...

[1] http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-ma...


Plus there are specialized hardware instructions for them, and the expensive parts (saving weird registers etc.) are fast-pathed in Linux if they're not being used.


The kernel is necessarily so general purpose that it's hardly equipped at all to make correct, performant scheduling decisions at the app level.


Some Go users can probably make the argument that their one Go application is the single process using appreciable amounts of CPU time on their systems.


Well, or some Erlang users, I guess, fine


> Should we give up on programs with millions of threads

We gave up on this already, as multithreading is not even a little bit decent model for concurrency.


Okay. What is a decent model for concurrency that incorporates shared mutable memory (which, like it or not, is [1] a thing that exists on real machines and is impossible to avoid, and [2] is very much alive and well in Go) and isn't basically threads redux? I seriously have no clue what people mean when they say things like this.


Go programs have roughly one active thread per CPU in the computer. I don't think they claimed they reinvented threading, they used a (known) model, where many green threads are mapped to hardware threads and added the primitives to the language. To make millions of goroutines on one machine feasible, they need small default stacks - 2kb currently.


Do you have a pointer to that SELinux issue? Can't remember anything related to exceptions.

I think there was some issue with runtime code generation (used with closures IIRC) very early on but that's ancient history by now.


doesn't it mean some kind of global lock ?

i'd be curious to see the impact on benchmarks for parallelization.



Does this mean every single application binary compiled with go in the wild up until today has a memory corruption bug?

If so, it's going to take some effort for this bug to be weeded out in every single deployment around the world...


From the commit message:

"This bug has existed since Go 1.4. Until then the select code was implemented in C, 'done uint32' was a C stack variable 'uint32 done', and C stacks never moved."

It'll affect probably most running jobs out there, as Go 1.4 was released on Dec 10, 2014.


I don't understand this comment, are you implying that it's the first time you see a runtime with a bug?


Well usually you can just update whatever library has the bug and restart your applications, but as I understand this bug will be statically linked in all application binaries so they will all have to be recompiled and redeployed?


Yes, that's what the grandparent is saying, this is a bug in the runtime, which you can't fix by updating libraries, you have to update the compiler/interpreter (depending on the language).

This is not a unique occurrence.


I think the original question was querying if the runtime is statically or dynamically linked.

It's also possibly a subtle dig at go, because unlike other languages go does not use dynamic linking, meaning bugs in libraries require recompilation (has positive effects too).


Yes it does, since Go 1.6.

It just isn't the default compilation mode, you have to explicitly enable it.


Runtimes have bugs but normally you just download a new one and swap it out.


The bug was introduced in Go 1.5 (when the compiler was rewritten to Go)


No, it was introduced in Go 1.4, when the runtime was rewritten in Go.


Some of the points are just ridiculous. Because you can't compare java and go or whatever. If you want complain about a language you should first checking the history of java or whatever. Maybe this language has runtime bugs in old versions as well.


I'm surprised there was no test added to ensure this doesn't regress. Any reason why not?


Looks like it's a concurrency bug that is only triggered non-deterministically under certain circumstances. It is probably hard to reproduce reliably in a test, if this is possible at all.


I wonder if it'll get backported to at least Go 1.7?


Not sure, but I'd expect it to get fixed in 1.4, since that's the version needed to generate later versions if you're bootstrapping.


Somewhat OT: Just having recently gotten into Go: How about changing the official name from Go to Golang? It's not ideal, but so wasn't the naming in the first place :/ Go isn't particularly search friendly. Quite ironic for a language coming out of the world's largest search company.

Just try searching for go in the otherwise excellent Algolia HN search engine:

Without quotes:

http://i.imgur.com/8ztH48u.png

With quotes:

http://i.imgur.com/5a229fC.png




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: