It’s always crazy to me that we have picked encodings for data that are human-fi...

p_l · on June 13, 2024

Considerable portion of that was early ARPANET work often involving somewhat lacking access to hardware, so I'm it's formative ages internet had a lot of directly attaching teletypes to network ports.

Also one can't forget about TIPs, which provided "phone modem to telnet-like" bridge service in ARPANET.

Another part was how text was the one thing that people had standardise enough between different computers. FTP's ASCII and binary modes aren't about love conversion for Unix, but because "binary" meant totally different things on different hosts (could be 8bit octets, could be 28bit words, could be 36bit words, could be 60 bit, before internet fully settled down there were 40 bit hosts too).

Also people are scared of compilers.

All of that led to cultural bias towards textual protocols

btown · on June 13, 2024

There's a parallel universe where the right person on the Google Chrome Network Tab team (or, earlier, the Firebug team) foresaw this a decade ago, and resolved: "we will make it possible for any developer to plug in a custom binary parser in the Network Tab, able to access setup files from the project at hand, and do so in a safe and sandboxed way that's easy to deploy to colleagues." Then a billion binary formats would have bloomed. But that's not the world we ended up in.

klysm · on June 13, 2024

I think that’s really the difference though: tooling. That’s all you need to make it easy

nasretdinov · on June 14, 2024

> That latency is experienced everywhere too.

Don't confuse latency and bandwidth though. Most of those messages are relatively small, so they don't contribute to (network) latency almost at all. Plus gzip exists, further reducing the amount of data transmitted, thus both reducing latency and improving bandwidth utilisation.

Also usually when it comes to cases where text would be actually a bottleneck (e.g. images, videos, audio, etc), binary formats are preferred and work very well, and generally you can tolerate e.g. images or audio being slightly broken, so you don't need to debug those formats too frequently. It's a nightmare do debug them though.

klysm · on June 23, 2024

I’m not confusing latency and bandwidth. Serialization does contribute to latency everywhere, you have to wait for the decoder or encoder. It’s another step

hinkley · on June 13, 2024

It's about agility (feedback loops), and enough of the problems with it as a transport mechanism can be addressed by transport encodings that we put up with the rest.

Is there a similar format that is more amenable to SIMD but has similar ergonomics? That remains to be seen. But if someone makes a compelling case then I'm sure that not only could I be convinced but I could convince others.

Code is meant to be read by humans, and only incidentally by computers. Transport formats are the same thing. HTTP is an even worse format than JSON, and we have really done very little to change its line format in 35 years. It's adequate to the task.

klysm · on June 13, 2024

It not changing is not an indication of its adequacy. It’s merely an indication of backwards compatibility and lock-in effects. It’s not practical to change it even if we did have something better.

AgentOrange1234 · on June 14, 2024

Aren’t “all” the servers and browsers speaking in gzipped html to one another anyway?

Couldn’t we do a similar kind of invisible conversion if we had some format better enough to be worthwhile?

klysm · on June 23, 2024

It would be way better to use a binary representation of the structure

jvanderbot · on June 13, 2024

From the moment I understood the weakness of my encodings, it disgusted me.

pquki4 · on June 13, 2024

JSON is usually used for front end-back end communication or public API endpoints, otherwise protobuf/Thrift/Avro is commonly used in the backend for internal services (that is controlled by one organization), for very good reasons. Same for HTML -- you need to thank HTML for being able to read hacker news frontpage on a 10 year old kindle with a barely usable browser. I suggest you look all these up before complaining about nothing.

klysm · on June 13, 2024

I don’t think it’s very good reasons. We could definitely have binary first serialization protocols and good tooling built into the browser. But no, we encode everything as text, even binary stuff as base64, and whack that into strings

pquki4 · on June 15, 2024

There is nothing preventing anyone to build a whole new set of infrastructure for such "binary first serialization" 20 years ago, 10 years ago or today. We don't even need to do that much. Instead of "text/html" or "application/json", let's just use some binary format in the request headers everywhere and make both the client and server support it. Why hasn't that happened?

It's for the same set of reasons, and people aren't dumb.

geraldwhen · on June 13, 2024

All of those are slower than json in many contexts. JSON parsing and serialization is very fast!

klysm · on June 13, 2024

In what context is json slower than protobuf?

vitus · on June 14, 2024

Did you mean to ask the reverse question (in what context is protobuf slower than json)? Because that's definitely the question on my mind, since GP's assertion runs counter to my expectations and experience.

JSON is a heavy-weight format that requires significantly more memory for both serialization and deserialization, and the representations of values in JSON are optimized for human consumption rather than machine consumption.

Just listing a few examples:

- Strings in JSON require you to scan for the terminating quotation mark that isn't escaped. Meanwhile, in protobuf, the length of the string is given to you; you can just grab the bytes directly.

- Parsing an integer in JSON requires multiplication by 10 and addition / subtraction for each digit. Meanwhile, in protobuf, fixed64 and related types are either in host order or are just a ntohl away; int64 and other varint types only require bit twiddling (masks, shifts, etc). Do you think it's easier to parse "4294967296" from a string, or 5 bytes along the lines of {0x88, 0x80, 0x80, 0x80, 0x00}?

- Having a format agreed-upon ahead of time (protobuf) means that your keys don't require more string parsing (JSON).

anonymoushn · on June 14, 2024

The benchmarks available for protobuf generally have it parsing like 5x slower than json (and I suspect the payloads are smaller, but not 5x smaller). I don't think that the code generators shipped with protobuf generate parsers of comparable quality to simdjson, so it's a bit unfair in that sense.

vitus · on June 14, 2024

Can you point to some of these benchmarks? https://news.ycombinator.com/item?id=26934854 suggests that in at least one synthetic benchmark (with a 7.5KB protobuf message which expands to a 17KB JSON payload), protobuf parsing at 2GB/s would be comparable to JSON parsing at 5GB/s.

Meanwhile, simdjson's numbers (https://github.com/simdjson/simdjson/blob/master/doc/gbps.pn...) show a peak parsing speed of about 3GB/s depending on the workload. Of course, it's not clear you can compare these directly, since they were probably not run on systems with comparable specs. But it's not clear to me that there's a 5x difference.

Perhaps my experience differs because I'm used to seeing very large messages being passed around, but I'd be happy to reconsider. (Or maybe I should go all-in on Cap'n Proto.)

anonymoushn · on June 14, 2024

> - Parsing an integer in JSON requires multiplication by 10 and addition / subtraction for each digit. Meanwhile, in protobuf, fixed64 and related types are either in host order or are just a ntohl away; int64 and other varint types only require bit twiddling (masks, shifts, etc). Do you think it's easier to parse "4294967296" from a string, or 5 bytes along the lines of {0x88, 0x80, 0x80, 0x80, 0x00}?

For this one, actually I think the varint may be harder because you have to parse it before you know which byte the next value starts on, but recently there has been some progress in the area of fast varint parsers. For parsing decimal numbers, a good start is here http://0x80.pl/articles/simd-parsing-int-sequences.html. Some users at https://highload.fun/tasks/1/leaderboard are calculating the sum of parsed base-10 integers at about the speed of reading the string from RAM, but this task is subtly easier than parsing each number individually, which may only be doable at half or a quarter of the speed of reading the string from RAM, and then you'd have to pay a bit more to also write out the parsed values to another buffer.

vitus · on June 14, 2024

From the intro of your first link:

> While conversion from a string into an integer value is feasible with SIMD instructions, this application is unpractical. For typical cases, when a single value is parsed, scalar procedures — like the standard atoi or strtol — are faster than any fancy SSE code.

> However, SIMD procedures can be really fast and convert in parallel several numbers. There is only one "but": the input data has to be regular and valid, i.e. the input string must contain only ASCII digits.

There definitely are some benefits and speedups available with SIMD, but that intro doesn't inspire a whole lot of confidence in its relevance to JSON parsing, where the only case where you might have this regularity is if you definitely have an array of integers. (JSON strings are not restricted to ASCII, as they can and do include Unicode.)

anonymoushn · on June 14, 2024

I think you'd have to pay some additional copies to perform batch processing of integers in json documents in the general case. Last I checked simdjson included the typical scalar code for parsing base-10 integers and a fairly expensive procedure for parsing base-10 doubles (where most of the runtime is paid in exchange for getting the final bit of the mantissa right, which was not reasonable for our use case but is reasonable for a general-purpose library).

That said, it's not clear to me that the scalar integer parsing code should win even if you're only parsing integers individually. For inputs that have the length of the number vary unpredictably, it pays a significant amount of time for branch misses, while the vector code can replace this with a data dependency.

Edit: After writing the above, I thought that probably most documents have a regular pattern of number lengths. I don't know if this works well with branch predictors if number of branches in the pattern is pretty long (in terms of the sum of the lengths), but probably the branches cost ~nothing for a lot of real-world inputs.

geraldwhen · on June 14, 2024

Run benchmarks. It doesn’t matter what I think is easier to parse.

Protobuff is slower than JSON.parse in node by 4-8x for my data sets: large reference data needed.

I was only measuring decode time, since that’s I can recompute my data well in advance.

698969 · on June 14, 2024

In browser Javascript, because there you have to load the decoder which is already slower to parse than JSON, then run it in the JS vm to do the actual decoding whereas JSON is built in and has a native highly-optimized parser.

bigbuppo · on June 13, 2024

So... what you're saying is that we should do what cobol did in 1960?

dingdingdang · on June 13, 2024

I tend to agree here, judiciously used json (these days often running under zstd compression) is seldomly the bottleneck on anything and allows immediate human debugging if need be.

hinkley · on June 13, 2024

What we actually implement is often more constrained by what we can prototype and experiment on than how fast or how well we can define formal requirements and implement them.

So much good stuff in software is down to a mix of serendipity and 'what if' and anything that reduces friction (improves ergonomics) has my vote.

klysm · on June 13, 2024

Bottleneck suggests optimization for throughput. I care about latency more

intelVISA · on June 14, 2024

Most (good) orgs use zero copy binary serde in their backends, JSON is only for end users.

3836293648 · on June 14, 2024

Are you using serde as a general term here or are you specifically referring to the rust library? The former would make sense but I've never heard it used that way before.

intelVISA · on June 16, 2024

Former, I'm too lazy to write serialization/deserialization :P

akira2501 · on June 14, 2024

We did that. You're welcome to use ASN.1 or some restricted subset of it if you want, and people did for quite some time, but it's brittle and inflexible nature and it's inability to be quickly edited or observed during edit-test-debug cycles deprecated it the minute we had something human readable that could reasonably replace it.

In any case.. computing is entirely about serving human needs.. early computer science sort of missed the boat on that point.