> Understanding we are memory-bound is crucial, as it directs our efforts towards reducing memory consumption in order to accommodate more users on the server.
While this is true, it bears mentioning that in a service handling many requests they all smear together memory-wise, so _any_ latency savings can also ease memory pressure (a memory allocation has an area: bytes × time). So for instance it can make sense to minimize memory held across an RPC boundary, or even to perform CPU optimizations if they would improve the latency,
> e.g. examining a memdump rather than just measuring overall memory utilization
A flame graph also would have highlighted this (without the need to look at what might be user data), and is usually where I start looking for memory optimization. I mean if you don't know what code is allocating lots of memory where would you even start anyway?
Additionally I've found that looking at outliers, while a good idea, can sometimes lead to wild goose chases if you're not careful. There can be pathological edges cases which are rare enough to not actually matter in practice.
This service keeps around a lot of long-lived per-session state, which leads to a very different memory profile from a typical web service, like what you're describing. Your point is valid in general, but doesn't seem to be relevant here.
Ironically, not long ago there was an article about how hard string handling is in C, so programs tend to be written to avoid manipulating strings as much as possible. In other languages, where strings are easy to use, it encourages inefficiency like this. The term "stringly typed" also comes to mind.
JSON truly is the new XML, both in terms of advantages and disadvantages. I wish people would stop using it for everything and realise that using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.
That's a massive exaggeration. XML is both much more complex (and bug-prone) to parse, and wastes more space.
And the problem here wasn't JSON in the first place anyway, it was
> The “back” stack is unlimited, meaning we’re saving more and more state until we explode
And storing previous state in serialized string in current state, instead of just having it stored natively. You'd have same problem if you tried to serialize <format> into a field in <format>
> And the problem here wasn't JSON in the first place anyway
Yes it was. They stored state in a huge JSON dump which ended up working like a zip bomb by the way they had to escape a string. This alone resulted in an exponential increase in memory that quickly became a problem.
Think about it for a second: would they experience this issue if they stored an in-memory stack per user? They wouldn't.
Heck, it seems that they could live with their JSON choice if they replaced '/' with any other inescapable character and converted it back to '/' when reading.
XML would also have a problem with this: you'd replace < with <, then with &lt;, then with &amp;lt;, … It's not exponential, but still inefficient.
Inefficient is not the same as problematic. They turned what otherwise would be a linear increase in memory usage into an exponential increase, which expectedly resulted in memory exhaustion problems.
There are plenty of problems to go around this approach and overall system design, but the weakest link is undoubtedly how JSON was used.
I'd argue that JSON is one of the least Human Readable serialization formats.
It lacks comments. Has no way to offer context. Cannot impose limitations, and normalizing needs additional, often poorly supported RFCs (JSON-LD or pointers).
It also is a poor non-human readable format, as it lacks crucial types (datetime, decimal) is rather verbose (just imagine the terabytes bandwidth wasted, globally on all these commas and curly braces) and cannot be parsed streaming.
Agreed. Lack of comments and lack of proper schema enforcement makes JSON a junk pile of a data format.
Definitely not an improvement over XML. Hopefully the next popular standard data format will learn from the past and take the best of previous attempts.
JSON itself (rfc 8259) has no schema, nor any mention of it. There are several additional RFCs that allow to add schema or other semantics or limitations, though. But JSON has no schemas.
Decimal is there, isn’t it, as part of ‘number’, which lumps together fixed-size integers, IEEE floats and doubles, bigints and bigfloats, and leaves it to the parser to make the right guess as to how to interpret them (often assisted by having the programmer specify a target data type)?
Lumping all those together as ‘number’ makes for a much simpler grammar, but I think that’s not worth it for its negative effects on interoperability. How many json parsers can (easily) read a value of 1234567890123456890E0000000001 and write back the same string to a json file?
Apart from datetime, I’d add duration as a basic type, too, even though that has the problem of being imprecise. Users will want to specify durations such as “1m” or “1y” (is that 365 days? 365.2425 to correct for leap years? Does it depend on interpretation by the user? is it a lunar year of 354/355 days as in the Islamic calendar? Etc)
The lack of comments isn't really a big deal for serialisation. Nobody is really going to bother adding comments when data is serialised. You could do that with XML if you wanted and I have never seen a single occurrence.
The issue is that JSON ended up being used as a configuration format too, and comments are critical there.
EDN at least provides date literals, UUID literals, and decimals. It also does not require commas to be littered everywhere. Unfortunately EDN is only really well-supported within Clojure. I'm not sure what is meant by "context problem" for JSON, but keywords and symbols in EDN can be namespaced, so that one can distinguish e.g. :human/name and :computer/name. However, it still requires you to write a schema. And due to lack of support in other languages, protocol buffers seem to be a much better choice.
2. I’m still not convinced it’s the file format’s job to encode every possible data type. We used to have deserialization layers for that, but for some reason, we don’t do that anymore. Nowadays we expect every file format to be self-describing, which I think is wasteful and limiting. But that’s a matter of opinion, obviously.
3. What do you mean by “cannot be parsed streaming”? Or to be perhaps more precise, what’s missing in existing streaming JSON parsers? I haven’t used any of them yet.
Many JSON parsers require the whole JSON file to be in memory. Parsers that allow parsing in streaming manner are the exception.
I only found jsoncons to work reliable but at the cost of speed compared to something fast like rapidjson. And even jsoncons might have the problem that large strings have to be fully in memory, i.e., if someone has the fun idea to store a 1 GB string inside a JSON, then you still have a potential out-of-memory event at your hand.
There are other attempts to fix this problem with the jsonl format but that format feels so ad-hoc that I'm not convinced yet. It stores one JSON object per line. Of course, this comes with its own problems such as each JSON object potentially still being unbounded and, now that lines are used to separate JSON objects, you cannot format those JSON objects in a human-readable way anymore.
Also, the first time I encountered a file like this, it simply had a .json extension and I was wondering why every JSON parser failed to parse that file. Because it isn't valid JSON and should have a separate extension.
Streaming parsers may be the exception, but are likely the correct tool to use when handling large inputs becomes a problem. Similarly, XML parsers evolved the SAX model (as opposed to DOM, where the document had to fit in memory.) It is often sufficient to rip through the file in an event-based fashion to extract what you want, and throw the rest away.
For cases where a single string is truly massive, I would lean towards defining a maximum size to prevent denial of service attacks.
3. Like sibling comment or said: most parsers need the entire file (in memory) before being able to start parsing. XML is just as bad here: both, technically, need the parser to see the closing part before being able to parse the chunk.
That one wasn't even down to json, but rather some c stdlib functions being in linear time because c strings(which then compounded into quadratic) right?
Developing and debugging systems with JSON (vs, say, protocol buffers) is much easier… configuration is rarely interfaced with by humans vs. debug logs or dumps etc, which are always being looked at.
With something like protocol buffers, you have an explicit schema, parsing is well-defined, and you rarely (if ever) need to manually inspect the serialized representation.
It's not hard to read what isn't "human readable" anyway; it just takes a little bit of practice and learning. For example, I can "sight-read" x86 and Z80 Asm from a hexdump, and bits of ASN.1 and a few other proprietary protocols too.
Did I strike a nerve? It's unfortunate that so many seem to be under the impression that "human readable" is really an euphemism for "like English" and are narrow/closed-minded enough to not consider the possibility that people can learn just about any language with enough exposure and a bit of effort.
4f 6e 63 65 20 79 6f 75 20 73 74 6f 70 20 6c 65 61 72 6e 69 6e 67 2c 20 79 6f 75 20 73 74 61 72 74 20 64 79 69 6e 67 2e
You didn’t strike a nerve, but your comment is misguided.
Of course people can read binary formats. It is just that they read them more slowly, and it requires quite a bit of exposure/practice, so that people usually invest in tooling to simplify the process.
I suggest that when you see a comment like this, you ask yourself “if I change ‘human readable’ to ‘easily human readable’ would I need to offer my correction?”
>using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.
In the encode/decode library we use, the generated toString were accidentally so close to json, that we made them exactly so (mostly had to add quotes). Now we can use json tooling on top of efficient messaging.
How does the number of backslashes grow as the string is repeatedly escaped?
1 \
2 \\
4 \\\\
8 \\\\\\\\
We see it grows exponentially: to escape each backslash, you need two. So I guess I’m surprised that this is a post about a 1.5GB string and not the exponentially increasing work on each navigation causing performance problems, or the session crashing from running out of memory when a user navigates too far. Maybe the service wasn’t being used much yet or there’s some other reason that strings only grew to ridiculous sizes and not oom sizes. I’m curious why it didn’t exhibit sooner. (I saw a somewhat similar problem long ago but then it was more like doing ‘this.json = this.messages_array.to_json()’ on every new message, so merely quadratic. I think it was noticed not long after we first had a lot of messages added to one of the objects)
The solution to this problem has always been known: don't use the same escape character in different formats. With non-overlapping escape characters, you don't need to escape more than once.
So we see that in C, the escape character for strings delivered to the C compiler is \, while the escape character for strings delivered to printf is %.
It always boggles my mind that someone looked at this example of a working design and concluded that the escape character for a regular expression engine should be \, overlapping with the escape character for a literal string.
Computers are fast. I once worked on a service sitting at the top of a deep RPC stack. Stack traces were often folded into RPC error messages to help diagnostics. Well, if you have a failure at one layer, propagated to the next layer, and the next layer, and sometimes replicated for each looked up item (to support partial failure semantics), then you can end up with gigabytes of stack traces in memory for some fraction of time. Very hard to figure out until tasks started dying and leaving behind heap dumps during a wide-spread incident at a lower layer of the stack.
My back-of-the-envelope estimate for how long it would take to construct the 1.5GB string is about 200ms to 300ms (1.5GB write, 0.75 read, extra from growing a buffer for the result, sequential R/W at 10GB/s; copying to resize the buffer at 20). Which is somewhat fast but I’m still surprised that the latency increase wasn’t noticed first.
More importantly, the time and memory doubles with each further navigation which means a user shouldn’t have had to navigate much farther to see much worse performance. Maybe this was a known issue that no one had debugged by then.
My best guess is that screens were arranged in a tree and users would navigate ‘back’ instead of ‘up’ so it was hard to get a long history.
I joined the team long after this was in production, but I can testify that the service has a lot of moving parts and attributing the reliability / efficiency costs for a specific issue was challenging.
A better fix could have been switching to some other (binary) serialization protocol (BSON or similar) - the problem here doesn't seem to have been the size of the history itself, but the exponential growth of the escaping.
This is reminiscent of some APIs I've had the great pleasure of needing to integrate with, where a SOAP API would take XML containing Username, Password, Method, Data... where Data was a string, itself containing barely-documented XML, but serialized as a string before being placed in the XML. You'd think that if they knew how to stand up a SOAP API, they'd know that XML is literally designed to be nested. Alas, such was not the case. I shudder to think about their security under the hood...
I remember dealing with a small ISV that was told to make an RPC API, so they used SOAP (via WCF). That would have been fine, except that every function took a single "string" parameter that was invalid XML that a normal parser couldn't read. You had to use their hand-rolled parser.
And then the whole thing was "encrypted"... with a static RSA key. I don't mean they used a stream cipher based on a hard-coded key, which would bad enough. No, they used the RSA keys directly for bulk data encryption.
I flat rejected their solution, then was overruled as being "too picky", and this monstrosity went into production.
A recursive data structure is fine, if you implement the serialization in a sane way. The escaping of your escaped strings containing other escaped strings seems like a code smell.
Wow! So this is actually a drama I had a secondary role in.
A few remarks:
1. Team: Nitzan (the blog author, and an awesome dude!) was at the time the Production Engineer monitoring the top-line capacity metrics. The Java heapdump tooling was hacked together by E.A., with some tough bits by A.S., who's a force of nature. The bulk of the mitigation was carried out by Y.B. over several months. I was the person who analyzed a couple of memory dumps and framed URL strings as a worthwhile 80/20 goal.
2. The main difficulty in the mitigation project was that the pathological strings were accessed over hundreds of callsites using some semantics like
2.1. A lengthy build-up very carefully constructing an API to represent URLs in as compact a way as possible, and plugging it in where convenient,
2.2. Carrying out a massive automated rewrite at the source level ("codemod"), which is possible in Java but not for the faint of heart. I think he used some JetBrains tooling to get that going. I consider his work a tour de force.
I vaguely recall there being some modest CPU improvements in the process.
3. Organizational dynamics: the codebase was originally written by very competent people who did not work through that specific detail because it was not important for their original use case. However subsequently the code underwent several years of almost solely rewarding improvement in end-user metrics, leading to an overall inadequate state that was hurting the organization as it was hitting hard scaling limits. In fact, "engineering excellence" was not even a category for recognition at that time. In this climate I could not find my own voice as a SW professional and chose to quit after a little bit more than a year.
I have a pretty good memory, I'd be happy to give more context if appropriate, just ping me.
A.S --> E.A, no? :)
I decided to abstract away the URL part as it wasn't easily explainable and not important to the story. I figured the JSON-in-JSON part is interesting enough.
Anyway, the resulting memdump with many \\s remains my most audience-engaging slide to date, and figured out the non-company audience deserves to know.
Good times.
You're fully right! I added them both - E.A. was the original toolmaker, and A.S. worked more down the chain. He definitely was the person who walked me through the finer points of retrieving the data.
I wrote some stress-test cases for a Java library recently which involved 1-GB and 2-GB strings.
On the surface, a java.lang.String should be a wrapper around a char[], whose maximum length is Integer.MAX_VALUE = 2 147 483 647. This is roughly how older implementations of JDK did things.
But as of JDK 9 and JEP 254, String's private field has type byte[], and each character uses either 1 byte if it's encodable in ISO 8859-1 or 2 bytes if it requires UTF-16.
This means that strings containing only ASCII characters can be up to about 2 billion in length, whereas strings with real Unicode content can only be up to about 1 billion in length. This is a bit of an unfortunate regression in functionality.
I'm currently consulting with a company in the aviation industry that is struggling to adapt a legacy-bound culture to modern software engineering techniques. Among other things they have no sense of systems-level concerns. I've recently identified one of the most common user activities ends up executing the same expensive end-to-end query at least 4 times for each interaction, never bothering to memoize the results. While there is a rudimentary cache in place, the lack of awareness that the same interaction involving at least 3 network requests per query even when the cache is warm has not yet struck them as a problem. As this system is currently only handling a fraction of the total traffic it will be expected to manage when it's fully live, it's clearly a slow-burning fuse that will explode in their faces when they try to make the switchover complete.
> I read a post about someone who found that their system had something like 1.2 GB strings full of backslashes because they were using JSON for internal state, and it kept escaping the " characters, so it turned into \\\\\\\\\\\\\\\\\\\\\\\\\" type of crap. That part was new to me, but the description of the rest of it seemed far too familiar.
> And I went... hey, I think I know that particular circus!
Ha, this is a nice coincidence. I was looking for a cool domain name a billion years ago and came up with Backslasher since I was just learning text processing in Bash/Perl.
My lifelong goal is beating the same-named horror film in search results.
Why not just restrict the history to a configurable number of entries? Having like 20-50 is probably enough. It might no be 100% correct but the effort:use ratio seems good.
Meanwhile, Firefox has big problems when you give it a 6MB Data URL, you get crashes or graphical glitches throughout the browser while it tries to draw the URL bar.
That seems like a weird oversight, since Firefox is normally quite good at rendering large assets. For example, I prefer viewing large SVG images (like those generated by dependency graphing tools) in Firefox, because unlike Chrome, it allows me to pan and zoom within the image.
The previous session data probably needs to be deserialized from json to objects first before adding in the new session object. The whole container can then be serialized to json at one shot. This avoids the recursive escape encoding.
Wonder if it's worth putting special logic into json serialization libraries that politely raises some alarms when writing like a few hundred \ in a row.
> So each screen has the previous screen the user visited, to allow the user to go “back” and get the exact screen they were in before (state, scrolling position, validation notices etc).
this is the actual problem, a clear design error if the entire screen "stack" is part of per-request or per-session state while also being unbounded
It seems like just using the browser to store forward/backwards state with window.history would be easier than worrying about this server side. Maybe it was a legacy site that didn't work well with forward/back or a native app?
They were serialising the previous object as a string (which may also contain a key called previous which is a string) so quoted values, backslashes and any other escape characters would be escaped. Over multiple screens this grows quickly, e.g. this simple struct which has only an id column grows to 2KB in only 9 steps:
let obj = { id: 1 }
for (let id = 2; id <= 10; id++) {
obj = { id, previous: JSON.stringify(obj) }
}
The problem as stated is more complicated than it needs to be, due to forcing the serialized structure to match the in memory structure. Their in memory structure is a singly linked list. With JSON as a serialization format, that’s probably better modeled as an array with the “previous” link stored as the head (history[0]), and its “previous” stored as the tail (history[1]).
If this sounds like a cons cell, that’s what it is! It’s also a linked list. The laziness they want to achieve (only de/serialize one screen at a time per navigation forward or back) is also straightforward.
- forward: serialize the current screen, trim the leading/trailing quote off the previous serialization, insert with a comma before the new trailing bracket
- back: parse history, head is your desired “previous” screen, tail is your now-“previous” screen’s history
There’s a tiny amount of overhead to this approach, but not nearly as much as repeatedly reserializing the same string to shoehorn it into a serialized structure that doesn’t let you cheat a little bit with well known start/end characters.
The growth is exponential - each backslash becomes two in the next iteration. Thus after n iterations we have 2^n - 1 backslashes, and we only need 30 iterations to hit a gigabyte (and that's assuming only one quotation mark in the original JSON).
The formula for number of needed servers seems like a fine approximation; I'm very confused how you could use the pigeonhole principle. Perhaps to say that if you have N servers, one of them has at least total_users/N users? But assuming you have a decent methodology for balancing users across servers, and can rebalance to use a new server when added, this has basically the same effects as the formula given so I'm clearly missing why you pointed this out.
As others have pointed out for the authors discussion this fact doesn’t really matter.
However what does matter is that the authors probably is not estimating the average correctly partially due to this bug. If they scaled without fixing their excessive string storage, they would probably find their estimate to be off.
Even with the fix I’d be surprised if the underlying distribution of data per user is normally distributed.
That can obviously be rounded. Nobody will really try to purchase 27% of a server. A different issue is that the formula should include "1+Safety Buffer". Otherwise it needs to be run twice, once with the safety buffer and once without. Even so, I didn't take the formulae on this page to be accurate so much as representational.
The real formula we used to order servers was more complex than this, and included regional distribution, requirement peaks etc. I can expand more on that process if this is interesting.
However, for the context of this post, this crude approximation suffices IMO
Leaving aside the need to send a massive set of screen images for the moment, a severe issue with that protocol is the lack of compression. It doesn't need to use a quadtree[1], although that will significantly speed up the interface when sliding the tree depth to the screen action being taken. Even RLE[2], Huffman[3], or any other standard string compression[4][5] would solve this issue shown in the blog.
It's also good to properly manage transmission of that list of screens, even after you've fixed the backslashes. The article stated that they degraded the user experience by artificially limiting the string length instead of fixing the issue.
Yes. In what way does that mean that the transmission size should not be compressed? Bitmapped images are not the only data structure whose size will shrink by removing redundancy.
I think you're being downvoted because you've claimed "That's a case of not solving the problem.", but I think that actually better describes this answer. It's clever, certainly, but misses the fact that the stack of screens was never intended to be recursively escaped and changing the form that it took was the real fix rather than rubbing some compression sauce on what was never intended to be lots of backslashes in the first place. And indeed, that's what the author did: they shipped a bandaid fix while working on a more comprehensive fix, one which didn't require RLE or a quadtree (!).
You're describing the desire to avoid protocol compression when transmitting data that grows without bound. That has nothing to do with fixing the backslashes. The problem isn't backslashes. Those are just a symptom of data that can have any length.
While this is true, it bears mentioning that in a service handling many requests they all smear together memory-wise, so _any_ latency savings can also ease memory pressure (a memory allocation has an area: bytes × time). So for instance it can make sense to minimize memory held across an RPC boundary, or even to perform CPU optimizations if they would improve the latency,
> e.g. examining a memdump rather than just measuring overall memory utilization
A flame graph also would have highlighted this (without the need to look at what might be user data), and is usually where I start looking for memory optimization. I mean if you don't know what code is allocating lots of memory where would you even start anyway?
Additionally I've found that looking at outliers, while a good idea, can sometimes lead to wild goose chases if you're not careful. There can be pathological edges cases which are rare enough to not actually matter in practice.