The Broken Promises of MRI/REE/YARV

iam · on July 5, 2011

I think this is a problem that exists across any VM that implements a GC, not just Ruby.

.NET CLR has the exact same problem (perhaps a harder one, since CLR has a moving GC), so anytime they touch GC references (pointers to objects that are collectible) it's always wrapped in an explicit GC stack frame (think GC struct that lives on the stack). Furthermore, all reads/writes are carefully done with macros (which of course expands to volatile + some other stuff) to make sure the compiler doesn't optimize it away.

On the one hand, this is nice because they don't need to scan the C-stack (it scans the VM stack and the fake GC frame stacks -- well it's one stack but you skip the native C frames), on the other hand this means that any time a GC object is used in C code (ok, actually it's C++) they have to be real careful to guard it.

Of course bugs crop up all the time where an object gets collected where it shouldn't have, it happens so often that there is a name for it -- "GC Hole".

Astute readers and users of p/invoke may remark that they don't have to set up any "GC frames" -- that is because this complicated scheme is not exposed outside of the CLR source. Regular users of .NET who want to marshal pointers between native/managed can simply request that a GC reference gets pinned, at which point I'm mostly sure it won't get collected until it's unpinned.

The bad news is I'm almost positive there is nothing you can do with just C here to make this problem go away. You'd want stuff to magically just happen under the hood, and C++ is the right way to go for that.

It's probably possible to create an RAII style C++ GC smart pointer that would be 99% foolproof at the expense of some performance. It gets a little bit trickier if we are doing a moving collector. I am thinking it could ref/unref at creation/destruction, and disallow any direct raw pointer usage not to shoot yourself in the foot.

Of course the people writing the GC still need to worry about this..

tsuyoshi · on July 6, 2011

Anyone who has written an extension to a garbage-collected language in C will have run into this issue. Personally I've written extensions for Guile, OCaml, Ruby, MLton, and Java, and all of them have tricky rules for making your C code safe for garbage collection. Using volatile is the wrong way to do this though... this tells me that the people figuring this stuff out for Ruby don't really know C that well.

tptacek · on July 5, 2011

A very similar pattern bit me in the ass with the ObjC GC and libevent.

fanf2 · on July 7, 2011

There are other ways to structure the VM's API so that all VM objects are connected to VM data structures at all times. A good example is Lua, where you manipulate Lua objects on the Lua stack - they are never referred to by a raw C pointer.

thibaut_barrere · on July 5, 2011

I do appreciate the technicality of the article, but I'm not sure to agree with the first point of conclusion: how does it makes MRI (and related) 'fatally flawed' though? (real question).

What makes it 1/ irreversible and 2/ bad for today's users?

EDIT: as well, I wouldn't stop using Ruby because of that; I would use JRuby or Rubinius or IronRuby (if I understand well, these ones are not affected?)

jjore · on July 5, 2011

A fairly commonsensical approach is to just require all extension authors to annotate their code properly. At some basic level, this happens with Perl with its oft-maligned DSL for generating C code that happens to do all the right declarations. You might then end up writing your code using more macros. It's certainly not pretty but it is sound.

A plausible rewrite of that function in an XS for ruby would leave the function declaration and wrapper code up to your equivalent of xsubpp to execute your DSL and transform the wrapped code to fully functional C. If you build a C using extension from Perl, you'll find an XS file like http://cpansearch.perl.org/src/SIMON/Devel-Pointer-1.00/Poin... which during the `perl Makefile.PL && make` step is transformed via `xsubpp Pointer.xs > Pointer.c` and then compiled as normal C.

phillmv · on July 5, 2011

It's a bit hysterical.

Shit! MRI/YARV/REE are inherently fatally flawed! All that code I have running in production must be a FIGMENT OF MY IMAGINATION! SAVE YOURSELVES

benblack · on July 5, 2011

I am running this code in production, hence it cannot have bugs. QED.

Yours in perpetual bogglement,

Lil' B

msbarnett · on July 5, 2011

That is an interesting strawman you've constructed, as accepting it requires the reader to conflate the idea of bugs in general and "fatal flaws".

Obviously all non-trivial code working in production not only can have bugs, but will have bugs. Just as obviously, no reasonable person would consider those "fatal flaws" for any reasonable definition of the word fatal.

MRI/YARV's Conservative GC opens up some bedevilling classes of bugs for gem writers, obviously. Calling that a "fatal flaw" when millions of lines of production code continue to function despite its presence is nothing but over-the-top hyperbole.

pshc · on July 5, 2011

I think the author's definition of "fatally flawed" in this article is more along the lines of "this is an evolutionary dead end and I won't have anything to do with it in the long run" rather than "cannot work under any circumstance."

rbranson · on July 5, 2011

Erlang is 100% bug free. Use that instead.

strmpnk · on July 5, 2011

You should see Joe's rants on Erlang too. Not as bad as MRI but there are plenty of things to gripe about in beam.

strmpnk · on July 6, 2011

How does this deserve a down vote? Erlangs beam VM is pretty amazing but it's not without some pretty weird artifacts in the source, many of which I've discovered via Joe's twitter ranting.

(EDIT: I guess people don't like unpopular views at all, that's fine, long live jokes, forget the facts.)

msbarnett · on July 6, 2011

I'd rather see a technical analysis of Erlang that didn't read like it was written by Kanye West and sponsored by Axe Body Spray.

jamesgolick · on July 6, 2011

This would be funny if it made any sense and if it wasn't a ripoff of an old Merlin Mann tweet about DHH.

nwjsmith · on July 6, 2011

A _why tweet? http://favstar.fm/users/_why/status/1640180235

codahale · on July 6, 2011

Perhaps you should write that, then.

rbranson · on July 6, 2011

"Yo Ruby, I'm really happy for you, but Erlang has one of the best VMs of all time."

jcapote · on July 6, 2011

That's a pretty cool story you got there.

koudelka · on July 5, 2011

The point was clearly not that it has no bugs, but that if something is working to spec, it's working.

dlikhten · on July 5, 2011

I fail to see how this is an all hands abandon ship issue. If its a critical issue in all 3 interpreters they should be fixed asap if possible. At worst with a flag.

If rubinius/ironruby/jruby have no issues, this may become moot eventually as rubinius is gaining lots of traction recently and is becoming faster by the release outperforming standard ruby vms in many cases.

evanphx · on July 5, 2011

Neither Rubinius nor JRuby (and probably IronRuby too) have this issue because they all use accurate garbage collection rather than conservative. Accurate requires much more bookkeeping since all pointers must always be properly identified, but if you start writing a system with accurate GC, it's pretty easy. Bugs like this are a direct result of a conservative GC strategy (and these bugs, as I'm sure you got reading Joe's post, really really suck to find).

pmjordan · on July 5, 2011

This class of subtle bugs exists whether or not your GC is accurate as soon as you take the red pill and leave the VM environment. If you forget to add your C pointer to the accurate GC's root set, you're just as dead. Related story: http://news.ycombinator.com/item?id=217189

evanphx · on July 5, 2011

But that is by definition a tractable problem because the source will show that the root set isn't being used properly. (additionally, in practice this proves to be a rare and easy to fix bug)

jleader · on July 5, 2011

I think the author has a valid point that the "conservative" garbage collection approach has a flaw in its assumptions about the behavior of C compiler optimizations, and it doesn't sound like something easy to fix without a rewrite (i.e. switching to "accurate" GC). This sort of flaw will continue producing new surprising bugs, potentially any time the code is changed, or any time the compiler's optimizations change. These sorts of bugs are frustrating to track down, because they depend both on details of code optimization, and on details of memory allocation/deallocation history. If you compile with debugging options, you may change what optimizations are used; if you insert debug prints for some old-school log-based analysis, you may change the allocation/deallocation history, so the GC gets triggered in a different place.

jjore · on July 5, 2011

Right now, doesn't the GC traverse the entire heap and keep all objects where the memory's value looks like it might possibly be a pointer to some other object in memory?

This certainly isn't an awesome solution but couldn't the GC backtrace(3) the current process and look at %eax at all C stack frames to additionally include that value in the "pointers currently plausibly in flight" list?

pmjordan · on July 5, 2011

The problem is this[1]: strings are compound objects, which use 2 memory allocations. One for the object representation, the other for the memory holding the character array. The problem arises when you access the character array but technically no longer need the string object itself anymore. The C compiler notices that you don't use the pointer to the string object anymore, so it doesn't bother storing that on the stack. It is allowed to do this. The GC's mark phase now runs; it inspects all the stack frames and the global roots. It detects that no references to the string object exist and decides to collect it. There happens to be a destructor function associated with that memory object, which frees the character array, as the character array is manually memory managed. It blows up when you then try to access that character array directly.[2]

The correct way to handle this is to add the object reference to the GC's "root" set while you're using its guts, and removing it again when you're done.

Another possible solution is to allocate the string object and its character representation in one chunk of memory. This only works for immutable strings which never share substructure, though. The reason this works is that most conservative GCs will consider objects live as long as there is a pointer pointing to somewhere within a chunk of memory, not necessarily at the beginning.

[1] note: I'm not a Ruby coder but I fixed a very similar problem in a Lua implementation about 4 years ago. That one wasn't even conservative GC. EDIT: I told the story of that bug on HN 3 years (!) ago http://news.ycombinator.com/item?id=217189

[2] worse, it probably doesn't blow up immediately and instead causes memory corruption.

davesims · on July 5, 2011

This post is a weird mix of careful technical analysis and douchey, Zed Shaw-style hysterical overstatement.

However, I would like to see Matz' response to the recommended steps for a fix at the end. Sounds like a reasonable goal to add for Ruby 2.0.

Note to self: Listening to Papoose while writing a technical blog post turns your otherwise important observations into a Chicken Littleish, end-of-the world rant.

Nelson69 · on July 6, 2011

I kind of branded it a bit "douchey" at first too but then as I thought about it, it seemed remarkably restrained considering he debugged this issue. It's not like this happened all the time, had to get kind of lucky and build and calibrate a system just right to capture it.

I don't intend this to be an inflammatory question, I'm sort of a perpetual ruby novice, it's never been my day job and I've never managed to sort of catch up with the community, as soon as I feel pretty good with something I find it's been obsoleted a couple times. I like it but how does the community at large deal with stuff like this? This guy found a real bug and invested some time in it, do other rubyists just deal with crashes and restart their stuff? Do they just consider it part of "being on the cutting edge?" Or do they not even notice?

msbarnett · on July 6, 2011

In practice crashes due to this issue simply do not occur very often. I think I've had the VM segfault twice in the last two or three years.

That's what makes the hyperbolic tone of this article so douchey; he wrote up an interesting dissection of an edge case issue as though it were an ongoing catastrophe, mostly just to inject a bunch of chest-thumping rock-star bravado that added nothing of value to the discussion.

knewter · on July 6, 2011

I've actually had an ungodly metric ton of ruby segfaults in the past month or so, and almost never before that. At least one of them has definitely been GC-related - see "therubyracer is not thread safe" for one problem I've been running into. You also have to use PassengerSpawnMethod conservative to avoid GC-related failures in passenger with rails 3.1.

I'm not sure if those are both related to this or not, but I've had drastically more segfaults lately than in my past 6 years of ruby programming. It's getting pretty bad imo.

riffraff · on July 6, 2011

but how much of that is the interpreter's fault?

I know I can't run typhoeus + thin on 1.9.2 on OSX as it reliably crashes every ten minutes and I have no clue on how to debug it, but it is not a problem with the interpreter, it's a problem with external libraries.

davesims · on July 6, 2011

Agreed. As pointed out by others, this is a perennial problem with GCs in general. Obviously it doesn't happen very often, otherwise the business risk would have forced Matz or the Ruby community to fix it, else no one would have ever deployed on Rails, because, you know, there's that Fatal Ruby VM Bug that crashes your applications constantly and costs people lots of time and money.

The analysis was good, but the tone was ludicrous. It sounds far too much like: "Hey! everyone in the world should abandon MRI because of a bug I found!! That's right, me!!"

Nelson69 · on July 6, 2011

I get that. It's probably related to how many libraries you use and a lot of other things? There might be pathological ways to make it happen more frequently. It all depends on how and when it happens though.

msbarnett · on July 6, 2011

Well, we're using a large number of gems, a number of which rely on native code.

This issue is no doubt a pain in the ass for gem authors to debug, but it's definitely not something that library users are running into with any sort of frequency.

ice799 · on July 6, 2011

You are totally missing the point, bro.

The question really is: how much data corruption is occurring that _does not_ cause world ending segfaults? THAT is what you need to worry about. Check yourself before you wreck yourself.

msbarnett · on July 6, 2011

That is some amazingly selective memory corruption, homeboy. How whack is it that it is causing neither widespread segfaults nor widespread reports of creeping data corruption despite these VMs having logged billions of hours of CPU time in production all over the planet, dawg?

And yo check this: maybe this "fatal flaw" is actually just an edge case bug that isn't cropping up much in practice. Fo'shizzle!

And maybe we can drop the ridiculously asinine slang and douchey bravado, "bro".

ice799 · on July 6, 2011

the gzip bug mentioned did not segfault. it simply corrupted the gzip file in memory. not that selective son.

imma talk the way i talk and dont give a fuck if you like it or not.

kingkilr · on July 5, 2011

I think this goes to a pretty simple point: anything you have to do by hand you will eventually get wrong. Thus, to a first approximation anything that can be automated, probably ought to. To show off this principle I'm going to show off some of the PyPy source code: https://bitbucket.org/pypy/pypy/src/default/pypy/module/sele...

This is the implementation of `select.epoll`. Somethings you'll notice there's no GC details (allocations outside the GC of C level structs are handled nicely with a context manager), and we have a declarative (rather than imperative) mechanism for specifying argument parsing to Python level methods, this ensures consistency in readability as well error handling, etc.

on July 5, 2011

[deleted]

kingkilr · on July 5, 2011

Nope, wrapped values are interpreter level objects, they're the kind of things that exist at the Python level, in PyPy they're called things like W_IntObject, on CPython they're PyIntObject, I'm sure Ruby has the same. Then there are unwrapped ints which are machine level integers.

wingo · on July 5, 2011

Cute. The Boehm-Demers-Weiser collector has GC_reachable_here for this reason. Guile has scm_remember_upto_here since before it switched to libgc. I'm sure other systems have their things too.

That said, I like Handle, the RAII thing that V8 uses. It also allows for compacting collection. Too bad C doesn't do RAII.

thibaut_barrere · on July 5, 2011

.Net has GCHandle [1] and I believe the JVM calls to JNI have a similar mechanism (GetXXCritical [2])

[1] http://www.shafqatahmed.com/2008/05/memory-control.html

[2] http://publib.boulder.ibm.com/infocenter/javasdk/v5r0/index....

onedognight · on July 6, 2011

While C doesn't support RAII, gcc does: https://secure.wikimedia.org/wikipedia/en/wiki/Resource_Acqu...

wonnage · on July 5, 2011

Can someone dissect this a little more? My understanding is the pointer to str never gets written to the stack, and so str on the heap might get freed before zstream_append_input makes use of it. But how could the GC see this/what is the faulty assumption?

eonwe · on July 5, 2011

My understanding is that Ruby GC just runs through its heap of Ruby objects and sees which of them are reachable based on other objects in the Ruby heap and C-stack/registers.

Faulty assumption seems to be that counting references only to RVALUEs (Ruby objects in heap) is enough to determine if a part of memory can be freed. This breaks down in C-extensions where macros extract some part of the object or something pointed by it for use. In this case RSTRING_PTR extracts the C char-array used by str for zstream_append_input to use (lets call it arr).

If zstream_append_input or any calls underneath it tries to allocate a new Ruby object, GC may get called and str (and thus arr) may get freed because there are no references left to it anymore (no heap/stack/register because the register value was overwritten).

And this seems to require all Ruby C-extension writers to lock the objects they're using through macros with RB_GC_GUARD.

Edit: note that there are no references left to str

fhars · on July 5, 2011

The point is that the GC cannot see that and so assumes that the object is no longer referenced and can be freed. A conservative collector works by scanning the live memory of the process for things that look like pointers into the same live memory and then assumes that all objects that are not the target of any of these pointers are garbage. Tough luck if the only reference to a live object lives in a register.

ice799 · on July 6, 2011

registers are scanned, too. the bug is not that the ref is in a register. the bug is that there are no refs anywhere. not on the stack and not in any register.

yellowredblack · on July 6, 2011

This statement confused the heck out of me (wow! magic free memory) but of course, the pointers are being held to the contents of the memory, just not to the start of the object, which is what the GC cares about.

Perhaps the GC could be modified to track pointers not just to the head of object but to any address within it. Alternatively, C-coders working with Ruby could just say "I'm using this gc object" before calling C code.

I don't see this is a fatal flaw at all. Sounds like its just a bug. Now if, as many here assert, this bug is present all over the Ruby VM, then that's pretty unfortunate. Is that the case, or just hyperbole?

anaisbetts · on July 5, 2011

So, what this really seems to boil down to, is:

The Ruby C API is returning objects that are not correctly reference-counted for a short period of time and are incorrectly subject to GC.

This doesn't seem fatal to me, just not reasonably fixable from the GC side. It might be true, that a new API is needed to hold refs in the C side.

benblack · on July 5, 2011

I am apparently in that foolish minority that believes language runtimes should not segfault/corrupt themselves while running correct code. That this problem requires significant effort just to hack around, while actually fixing it would take a major architectural change, is what elevates this from mere "lolwut?" to fatally flawed. There are good alternative runtimes for Ruby, such as the JVM and the CLR, that do not suffer from this problem. Y'all should use them.

Funktacularly yours,

Lil' B

pmjordan · on July 5, 2011

I can crash a JVM or CLR program instantly by calling out to some careless C code. This bug is exactly such an instance: the C code for one of the library functions is flawed. The only way you can stay safe is by (a) having a flawless VM and (b) never calling out of it. The former is extremely unlikely, the latter extremely impractical as it inhibits any kind of I/O.

davesims · on July 5, 2011

If edge case segfaults were fatal flaws Windows should never have shipped. I say 'edge case' because obviously there are millions of lines of Ruby code running for years on MRI/YARV/REE that have not encountered this error often enough to cause the kind of breathless panic you seem to think is appropriate.

BTW the CLR is not a good alternative runtime for Ruby, might not ever be: http://www.zdnet.com/blog/microsoft/whats-next-for-microsoft...

You did good work here -- don't hurt your credibility with overstatement.

jjore · on July 5, 2011

Well, the problem here is that C using gems are going to often be memory corruptingly buggy until and unless either the gem source is updated to declare the proper parts volatile or Ruby's own C API is reworked to evolve this bug out of existence and then gems would have to be updated to use the API anyway.

Both problems are hard and the current state of affairs is apparently some random amount of the time we'll get memory corruption bugs.

KirinDave · on July 5, 2011

It's worse than that. We don't actually know where it occurs. There are clearly some gems where it does, but it could also be occurring elsewhere in the VM.

Just figuring this out is a non-trivial project.

on July 5, 2011

[deleted]

epochwolf · on July 5, 2011

MRI = Matz's Ruby Interpreter = Ruby 1.8

YARV = Yet Another Ruby VM = Ruby 1.9

REE = Ruby Enterprise Edition = Ruby 1.8 with a modified garbage collection system to make it friendly to vforking.

These are all versions of ruby written in C.

CPlatypus · on July 6, 2011

Damn, your reply and the other four pointing out that they're Ruby VMs reflects very poorly on the intelligence of HN commenters. Did you guys even RTFA? I know these are Ruby VMs, but the article is about the C code that's used within them. That's the only place where "volatile" has any meaning at all. Maybe "volatile" really is unknown among Ruby programmers, but among the people who implement interpreters for Ruby or any other language I can assure you it's pretty common knowledge.

I confess, I just don't know how to deal with such epic stupidity more gracefully than this. Sheesh.

lusis · on July 5, 2011

The article covers it but those are different C-based Ruby VM implementations:

- 1.8 is MRI

- 1.9 is YARV

- 1.8 plus performance patches is REE

nkassis · on July 5, 2011

Ruby virtual machines. MRI is the VM for ruby 1.8 (Matz ruby implementation and Matz is the creator of ruby), YARV is the newer incanation that was developed for ruby 1.9 these are all written in C. There plenty of other ruby VMs like rubinius or jruby etc.

judofyr · on July 5, 2011

MRI - Matz' Ruby Interpreter (the default Ruby)

REE - Ruby Enterprise Edition (unofficial branch of Ruby 1.8)

YARV - Yet Another Ruby VM (the VM used in Ruby 1.8)

(Although in this case I think he means MRI = 1.8 and YARV = 1.9)

ssmoot · on July 5, 2011

Minor correction, YARV, written by ko1, has never been a part of the Ruby 1.8.x line to my knowledge. Some language features, (mostly stdlib I think) were back-ported in 1.8.7, but 1.8.x is and always has been an interpreter where 1.9.x has always been YARV (a VM).

Useless trivia: Once upon a time Ruby2 was going to be called "Rete" IIRC. Or maybe "Rite"? I doubt it's in any shape to be called a "formal" plan at this point, and who knows if it'll ever actually see the light of day. It was supposed to drop optional parens IIRC, it's even in the original Pickaxe I think, but I doubt that's still on the board. Don't remember what else.

knewter · on July 6, 2011

It was "Rite." Now Matz is calling the embedded ruby he's working on with some japanese electronics manufacturer by the name "Rite." RubyConf 2010 keynote covered it in detail.

judofyr · on July 10, 2011

Damn, that's a pretty big typo. You're absolute right. Can't change it now though :(

grifaton · on July 5, 2011

They're Ruby interpreters -- Matz's Ruby Interpreter, Ruby Enterprise Edition, and Yet Another Ruby VM respectively.

CPlatypus · on July 6, 2011

"Very few people out there know that the volatile type qualifier exists"? Only if there are "very few" kernel programmers, embedded programmers, and others who have used C for anything low-level and/or multi-threaded. Otherwise, no. Sorry, but knowing about it doesn't make you special.

"Volatile" is the wrong fix, by the way. That's just depending on yet another non-required behavior. There is in fact no further reference to "str" between the function call and the reassignment at the start of the next iteration, so there's nothing for "volatile" to chew on. This particular version of this particular compiler just happens to add an extra pair of stack operations in this case, but it's not truly required to. A real fix would not only mark the variable as volatile but also add a reference after the function call. The same "(void)str;" type of statement that's often used to suppress "unused argument/variable" warnings should count as a reference to force correct behavior here.

softbuilder · on July 5, 2011

Well plus one for a blog post with a theme song, anyway.