Parsing file uploads at 500 mb/s with node.js

pjscott · on May 31, 2010

This is nitpicking, but it always bothers me: "mb/s" is an abbreviation for "millibits-per-second". You meant "MB/s", although this wasn't obvious; if you hadn't mentioned maxing out a GigE connection, it could have been interpreted as "Mb/s", which is short for "megabits-per-second".

</pedantry>

jules · on May 31, 2010

Writing everything in callback style is not nice. Why aren't they using a language with coroutines?

felixge2 · on May 31, 2010

This presentation has a tiny bit on coroutines:

http://nodejs.org/jsconf2010.pdf

> Coroutines complicate the mental model while adding only cheap syntactic pleasures.

makmanalp · on May 31, 2010

Actually, that's not a very good argument. The answer to that is "to you, maybe". The better argument is this one: "Must worry about I/O occurring in all function calls. (They might call wait().) The user needs to make their functions coroutine safe!" I think this is the reason why coroutines are more popular in functional programming languages where side effects are limited by style or by enforcement of the language itself.

jules · on May 31, 2010

That is not a good argument either. If you are going to write your entire IO library in asynchronous style like node.js then you could easily make all IO routines coroutine safe. In fact you have exactly the same problem whether you use coroutines or not. If you have a nice asynchronous program and I call wait in the middle that's going to hurt you in the same way.

felixge2 · on May 31, 2010

You're correct, it would not be impossible to write coroutine safe code in node.js.

The problem is just that coroutines are not natural in JS, and people will shoot themselves in the foot all the time.

Some "features" are better left out to allow a bigger audience to write reliable Software. Your milage may vary.

jules · on May 31, 2010

Perhaps, but in my experience using coroutines is easier than using callbacks because coroutines make code look the same as if it was synchronous.

    readAsync(function(x) {
      readAsync(function(y) {
        write(x+y)
      })
    })

vs

    x = read()
    y = read()
    write(x+y)

You don't really need to know how to use the full power of coroutines if you just want asynchronous operations. You just need to know that read() may block the current coroutine.

pmjordan · on May 31, 2010

You could even be more explicit about the whole thing, use futures and only allow them to block. The above example would go something like this if the reads can't be executed concurrently:

  future_x = read();
  x = future_x.wait();
  y = read().wait(); // shortcut
  write(x + y);

Or something like this if the reads are independent:

  future_x = read();
  future_y = read();
  // one of a handful of functions that can "block", all operating on futures:
  waitForAll(future_x, future_y);
  write(future_x.get() + future_y.get());

Using futures rather than implicit suspension has the added advantage of being able to pipeline independent reads just as you can with callback-style asynchronous I/O.

You can already implement[1] an approximation of this in terms of callbacks, but it doesn't look quite as nice, e.g.:

  var handler = new AsyncHandler();
  // independent, pipelined reads
  readAsync(handler.cb());
  readAsync(handler.cb());
  
  handler.whenDone(function(x, y) {
    write(x+y);
  });

It gets substantially uglier than that if the dependencies aren't so straightforward, e.g. A, B & C are independent, D depends on A & B having completed and the last part of the code requires the results from C & D. Futures do much better in that sort of situation.

[1] I've built a basic but usable helper for this purpose: http://github.com/pmj/MultiAsync-js/tree/master/src/ I hear the Dojo toolkit contains something similar.

jules · on May 31, 2010

Excellent :)

Why do you have a separate wait & waitForAll? Couldn't get() wait automatically?

pmjordan · on May 31, 2010

I was just throwing ideas out there, not really thinking it through. :) Although you're absolutely right about waitForAll() in that example, there is a point to that sort of function, say if you wanted to add a timeout. Or you could have a waitForAny() function - useful if you only need one of the futures to finish before proceeding.

get() vs. wait() is admittedly a question of preference. Personally, I'd keep them separate and even go so far as to have it warn you if you called get() without a wait() or a successful isReady() or so on that future. It keeps the suspensions explicit and the intentions clear. It's a bit like explicit vs implicit transactional systems.

bruceboughton · on June 3, 2010

Out of interest, is this any different to WaitHandles + async delegates in .NET? Doesn't look like it to me but I could be missing something:

    Func<string> reader = read;
    IAsyncResult future_x = reader.BeginInvoke(null, null)
    IAsyncResult future_y = reader.BeginInvoke(null, null)

    WaitHandle.WaitAll(new[] { future_x.AsyncWaitHandle, future_y.AsyncWaitHandle }); // takes optional timeout

    string x = reader.EndInvoke(future_x);
    string y = reader.EndInvoke(future_y);

    write(x + y);

pmjordan · on June 4, 2010

Looks like it's essentially the same type of programming model, although I have a feeling WaitHandle.WaitAll() just blocks the current thread. In the ideal case, it would internally call the event loop coroutine and process other events instead without sending the thread to sleep. Thread scheduling involves system (kernel) calls, coroutines are purely userspace, just like node.js's async callback mechanism.

felixge2 · on May 31, 2010

Yes, other languages might be more suitable for coroutines.

But then again, Erlang is probably a language more "suitable" to high concurrency programming, but node's goal is to make writing scalable networks programs possible for everyone. One might call it the PHP of concurrency : )

kg · on May 31, 2010

The statements made about co-routines/cooperative threading and the stated reasoning for removing them makes me wonder whether the decision to remove them was made based on benchmarking and experimentation or just based on taste. In my experience they simplify many forms of asynchronous logic tremendously and can have performance benefits as well, as long as you're not forced to use them for everything.

I suppose someone who wants them in node.js can just reimplement them as an extension.

felixge2 · on May 31, 2010

Well, just as there is the problem of thread-safety, there is also the problem of co-routine safety.

Basically callbacks and coroutines don't play well together.

But there were a bunch of reasons for the removal, performance and quality of the initial implementation also played into it AFAIK.

jules · on May 31, 2010

> Basically callbacks and coroutines don't play well together.

Can you elaborate on this? I thought the opposite: coroutines play very well with callbacks. Callback libraries force you to write programs in continuation passing style. Coroutines let you write asynchronous programs in normal style. Lets assume we have an asynchronous readAsync(callback) function that reads a number and calls the callback with the input when it's ready. Now you're writing your program like this:

    readAsync(function(x){
      readAsync(function(y){
        write(x+y)
      })
    })

What you want to write is this:

    x = read() // may block this coroutine, but *not* the entire program
    y = read()
    write(x+y)

This can be achieved with something like:

    function read(){
        coro = getCurrentCoroutine()
        var result
        callback = function(x){ result = x; coro.resume() }
        readAsync(callback)
        yield // stops the current coroutine until it is resumed
        return result
    }

It seems to me that this is a general way to adapt any callback style library to coroutines.

kg · on May 31, 2010

If you implement coroutines as a state machine they integrate just fine with callback-oriented APIs. But perhaps that was the problem since the PDF mentions stack swapping.

rythie · on May 31, 2010

It's a pity that browsers couldn't just tell you the length of the part in the header of each part.

felixge2 · on May 31, 2010

I think the reason multipart works that way, is so you can stream data that you don't know the full length of beforehand.

But afaik, that's pretty much never the case with file uploads, unless you are uploading a file that is still growing in size - so yeah, it's annoying : ).

('felixge2' because my other account is in noprocrast mode: )

stingraycharles · on May 31, 2010

Think of it the other way too: it allows the HTTP server to start writing the file to disk without having to completely load the file into memory.

felixge2 · on May 31, 2010

You could still do that if the length was pre-announced, or am I missing something?

stingraycharles · on May 31, 2010

Ehrm, yes, you're correct, I wasn't fully awake yet. I apologize.

irrelative · on May 31, 2010

Yeah, you could -- I wonder what would happen if the client gets it wrong or is deliberately dishonest? Not trusting the client is a big part of writing an open server and this seems like you would have to trust the client in a big way.

felixge2 · on May 31, 2010

Well, it's not a big problem - you should have a timeout on incoming connections, and node is pretty well-suited for having lots of "hanging" connections (I ran some test with 56k active connections).

If a connection is closed, either by a timeout, or EOF, you simply check if the promised content length matches the count of received bytes - if not you should probably discard the whole thing (unless you're specifically supporting broken clients).

marketer · on May 31, 2010

When parsing boundaries you know the first character is going to be a hyphen(-), and the last character is going to be a newline. Wouldn't it be easier to search for hyphens, and then read until you see a newline, and then compare to the boundary? Typically boundary characters are random printable characters, so you might be doing more work than you need.

inimino · on May 31, 2010

The rub is "search for hyphens": you have to look at every character until you find a hyphen; the point of Boyer-Moore is that you don't have to even look at most of the characters at all.

marketer · on June 1, 2010

Ah, that makes sense, thanks.

GrandMasterBirt · on June 1, 2010

If I can get a revenue stream of approximately $1000 a month my entire website will be literally a call to a few different web services and some fancy-shmancy CSS files. God I love hacker news! Ok fine there will be some of my code in there, hopefully not for long because soon there will be a tool to solve any other problem as well :P As long as these don't get too pricey.