A 1.5GB string

missblit · on April 8, 2023

> Understanding we are memory-bound is crucial, as it directs our efforts towards reducing memory consumption in order to accommodate more users on the server.

While this is true, it bears mentioning that in a service handling many requests they all smear together memory-wise, so _any_ latency savings can also ease memory pressure (a memory allocation has an area: bytes × time). So for instance it can make sense to minimize memory held across an RPC boundary, or even to perform CPU optimizations if they would improve the latency,

> e.g. examining a memdump rather than just measuring overall memory utilization

A flame graph also would have highlighted this (without the need to look at what might be user data), and is usually where I start looking for memory optimization. I mean if you don't know what code is allocating lots of memory where would you even start anyway?

Additionally I've found that looking at outliers, while a good idea, can sometimes lead to wild goose chases if you're not careful. There can be pathological edges cases which are rare enough to not actually matter in practice.

thedufer · on April 8, 2023

This service keeps around a lot of long-lived per-session state, which leads to a very different memory profile from a typical web service, like what you're describing. Your point is valid in general, but doesn't seem to be relevant here.

missblit · on April 9, 2023

Yeah, but any excuse to talk about memory profiling! It's one of my favorite past-times.

userbinator · on April 9, 2023

Ironically, not long ago there was an article about how hard string handling is in C, so programs tend to be written to avoid manipulating strings as much as possible. In other languages, where strings are easy to use, it encourages inefficiency like this. The term "stringly typed" also comes to mind.

JSON truly is the new XML, both in terms of advantages and disadvantages. I wish people would stop using it for everything and realise that using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.

ilyt · on April 9, 2023

That's a massive exaggeration. XML is both much more complex (and bug-prone) to parse, and wastes more space.

And the problem here wasn't JSON in the first place anyway, it was

> The “back” stack is unlimited, meaning we’re saving more and more state until we explode

And storing previous state in serialized string in current state, instead of just having it stored natively. You'd have same problem if you tried to serialize <format> into a field in <format>

userbinator · on April 9, 2023

You'd have same problem if you tried to serialize <format> into a field in <format>

Length-delimited formats don't need (exponentially increasing) escaping.

simplotek · on April 9, 2023

> And the problem here wasn't JSON in the first place anyway

Yes it was. They stored state in a huge JSON dump which ended up working like a zip bomb by the way they had to escape a string. This alone resulted in an exponential increase in memory that quickly became a problem.

Think about it for a second: would they experience this issue if they stored an in-memory stack per user? They wouldn't.

Heck, it seems that they could live with their JSON choice if they replaced '/' with any other inescapable character and converted it back to '/' when reading.

xigoi · on April 9, 2023

XML would also have a problem with this: you'd replace < with <, then with &lt;, then with &amp;lt;, … It's not exponential, but still inefficient.

simplotek · on April 9, 2023

Inefficient is not the same as problematic. They turned what otherwise would be a linear increase in memory usage into an exponential increase, which expectedly resulted in memory exhaustion problems.

There are plenty of problems to go around this approach and overall system design, but the weakest link is undoubtedly how JSON was used.

jmull · on April 9, 2023

Can you imagine arguing in a tech meeting that a reason not to use JSON is that it doesn't support unbounded recursive levels of escaping very well?

ilyt · on April 10, 2023

That is just entirely wrong; not serializing JSON into JSON would turn it back into linear.

deepsun · on April 9, 2023

But why do you bring XML? The parent is against human-readable formats, and XML is one of them.

berkes · on April 9, 2023

I'd argue that JSON is one of the least Human Readable serialization formats.

It lacks comments. Has no way to offer context. Cannot impose limitations, and normalizing needs additional, often poorly supported RFCs (JSON-LD or pointers).

It also is a poor non-human readable format, as it lacks crucial types (datetime, decimal) is rather verbose (just imagine the terabytes bandwidth wasted, globally on all these commas and curly braces) and cannot be parsed streaming.

berniedurfee · on April 9, 2023

Agreed. Lack of comments and lack of proper schema enforcement makes JSON a junk pile of a data format.

Definitely not an improvement over XML. Hopefully the next popular standard data format will learn from the past and take the best of previous attempts.

kbelder · on April 9, 2023

The problem is, that despite JSON's clear and numerous flaws, people still overwhelmingly prefer it to XML. XML is that unpleasant.

tristan957 · on April 9, 2023

JSON has schemas though. I use them all the time.

berkes · on April 10, 2023

JSON itself (rfc 8259) has no schema, nor any mention of it. There are several additional RFCs that allow to add schema or other semantics or limitations, though. But JSON has no schemas.

smcin · on April 12, 2023

I think you mean "the JSON standard doesn't define any schema".

berkes · on April 12, 2023

Indeed. Nor does it mention any place, extension or addition where a scheme could be defined.

Someone · on April 9, 2023

> as it lacks crucial types (datetime, decimal)

Decimal is there, isn’t it, as part of ‘number’, which lumps together fixed-size integers, IEEE floats and doubles, bigints and bigfloats, and leaves it to the parser to make the right guess as to how to interpret them (often assisted by having the programmer specify a target data type)?

Lumping all those together as ‘number’ makes for a much simpler grammar, but I think that’s not worth it for its negative effects on interoperability. How many json parsers can (easily) read a value of 1234567890123456890E0000000001 and write back the same string to a json file?

Apart from datetime, I’d add duration as a basic type, too, even though that has the problem of being imprecise. Users will want to specify durations such as “1m” or “1y” (is that 365 days? 365.2425 to correct for leap years? Does it depend on interpretation by the user? is it a lunar year of 354/355 days as in the Islamic calendar? Etc)

IshKebab · on April 9, 2023

The lack of comments isn't really a big deal for serialisation. Nobody is really going to bother adding comments when data is serialised. You could do that with XML if you wanted and I have never seen a single occurrence.

The issue is that JSON ended up being used as a configuration format too, and comments are critical there.

JSON5 is the best answer I have found.

koito17 · on April 9, 2023

EDN at least provides date literals, UUID literals, and decimals. It also does not require commas to be littered everywhere. Unfortunately EDN is only really well-supported within Clojure. I'm not sure what is meant by "context problem" for JSON, but keywords and symbols in EDN can be namespaced, so that one can distinguish e.g. :human/name and :computer/name. However, it still requires you to write a schema. And due to lack of support in other languages, protocol buffers seem to be a much better choice.

berkes · on April 9, 2023

Context being "this branch is a link" or "this string is a creditcard-number" or "this employee here is-a Title-over there, at Company over there".

codeflo · on April 9, 2023

1. I 100% agree about comments.

2. I’m still not convinced it’s the file format’s job to encode every possible data type. We used to have deserialization layers for that, but for some reason, we don’t do that anymore. Nowadays we expect every file format to be self-describing, which I think is wasteful and limiting. But that’s a matter of opinion, obviously.

3. What do you mean by “cannot be parsed streaming”? Or to be perhaps more precise, what’s missing in existing streaming JSON parsers? I haven’t used any of them yet.

mxmlnkn · on April 9, 2023

Many JSON parsers require the whole JSON file to be in memory. Parsers that allow parsing in streaming manner are the exception.

I only found jsoncons to work reliable but at the cost of speed compared to something fast like rapidjson. And even jsoncons might have the problem that large strings have to be fully in memory, i.e., if someone has the fun idea to store a 1 GB string inside a JSON, then you still have a potential out-of-memory event at your hand.

There are other attempts to fix this problem with the jsonl format but that format feels so ad-hoc that I'm not convinced yet. It stores one JSON object per line. Of course, this comes with its own problems such as each JSON object potentially still being unbounded and, now that lines are used to separate JSON objects, you cannot format those JSON objects in a human-readable way anymore. Also, the first time I encountered a file like this, it simply had a .json extension and I was wondering why every JSON parser failed to parse that file. Because it isn't valid JSON and should have a separate extension.

tlavoie · on April 10, 2023

Streaming parsers may be the exception, but are likely the correct tool to use when handling large inputs becomes a problem. Similarly, XML parsers evolved the SAX model (as opposed to DOM, where the document had to fit in memory.) It is often sufficient to rip through the file in an event-based fashion to extract what you want, and throw the rest away.

For cases where a single string is truly massive, I would lean towards defining a maximum size to prevent denial of service attacks.

berkes · on April 9, 2023

3. Like sibling comment or said: most parsers need the entire file (in memory) before being able to start parsing. XML is just as bad here: both, technically, need the parser to see the closing part before being able to parse the chunk.

kordlessagain · on April 9, 2023

Then I guess the same is for all these dicts I use in Python.

timcobb · on April 9, 2023

But in systems where JSON is used inefficiently, it’s rarely (with some notable exceptions [0]) the a performance limiting factor.

[0] https://news.ycombinator.com/item?id=26296339

pohuing · on April 9, 2023

That one wasn't even down to json, but rather some c stdlib functions being in linear time because c strings(which then compounded into quadratic) right?

Culonavirus · on April 9, 2023

> using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.

Database dumps say hello. :)

alex_sf · on April 9, 2023

If you want human-readable dumps, there's no reason you can't convert the more-efficient format into something human-readable.

smittywerben · on April 9, 2023

I just drag my Windows XML script driver into SYSTEM32 folder (for my HP indigo 7000 series ink press) then I press the rest.

flandish · on April 9, 2023

If you want human readable db output, I might introduce you to what I’ve spent most of my career doing: etl. :)

afloyd · on April 9, 2023

Human readable formats should exist for more or less one purpose, interfacing with a human, eg configuration.

timcobb · on April 9, 2023

Developing and debugging systems with JSON (vs, say, protocol buffers) is much easier… configuration is rarely interfaced with by humans vs. debug logs or dumps etc, which are always being looked at.

catiopatio · on April 9, 2023

With something like protocol buffers, you have an explicit schema, parsing is well-defined, and you rarely (if ever) need to manually inspect the serialized representation.

userbinator · on April 9, 2023

It's not hard to read what isn't "human readable" anyway; it just takes a little bit of practice and learning. For example, I can "sight-read" x86 and Z80 Asm from a hexdump, and bits of ASN.1 and a few other proprietary protocols too.

userbinator · on April 9, 2023

Did I strike a nerve? It's unfortunate that so many seem to be under the impression that "human readable" is really an euphemism for "like English" and are narrow/closed-minded enough to not consider the possibility that people can learn just about any language with enough exposure and a bit of effort.

4f 6e 63 65 20 79 6f 75 20 73 74 6f 70 20 6c 65 61 72 6e 69 6e 67 2c 20 79 6f 75 20 73 74 61 72 74 20 64 79 69 6e 67 2e

hyperpape · on April 9, 2023

You didn’t strike a nerve, but your comment is misguided.

Of course people can read binary formats. It is just that they read them more slowly, and it requires quite a bit of exposure/practice, so that people usually invest in tooling to simplify the process.

I suggest that when you see a comment like this, you ask yourself “if I change ‘human readable’ to ‘easily human readable’ would I need to offer my correction?”

code_biologist · on April 9, 2023

Your comment comes across the same way as that long ago "Show HN" where somebody said that Dropbox is pretty much just rsync.

deepsun · on April 9, 2023

But json is terrible for configuration, e.g. comments are absolutely required. Toml and Hocon are for configuration.

GoblinSlayer · on April 10, 2023

json isn't human readable, it's a serialization format that happens to use ascii.

jmull · on April 9, 2023

Developers are the humans (at least for now) that interface with JSON.

And devs don't care about just config files put API requests and responses and any other kind of structured/semi-structured communication and data.

jffhn · on April 9, 2023

>using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.

In the encode/decode library we use, the generated toString were accidentally so close to json, that we made them exactly so (mostly had to add quotes). Now we can use json tooling on top of efficient messaging.

dan-robertson · on April 9, 2023

How does the number of backslashes grow as the string is repeatedly escaped?

  1 \
  2 \\
  4 \\\\
  8 \\\\\\\\

We see it grows exponentially: to escape each backslash, you need two. So I guess I’m surprised that this is a post about a 1.5GB string and not the exponentially increasing work on each navigation causing performance problems, or the session crashing from running out of memory when a user navigates too far. Maybe the service wasn’t being used much yet or there’s some other reason that strings only grew to ridiculous sizes and not oom sizes. I’m curious why it didn’t exhibit sooner. (I saw a somewhat similar problem long ago but then it was more like doing ‘this.json = this.messages_array.to_json()’ on every new message, so merely quadratic. I think it was noticed not long after we first had a lot of messages added to one of the objects)

thaumasiotes · on April 9, 2023

The solution to this problem has always been known: don't use the same escape character in different formats. With non-overlapping escape characters, you don't need to escape more than once.

So we see that in C, the escape character for strings delivered to the C compiler is \, while the escape character for strings delivered to printf is %.

It always boggles my mind that someone looked at this example of a working design and concluded that the escape character for a regular expression engine should be \, overlapping with the escape character for a literal string.

MayeulC · on April 9, 2023

This is also why I like the sed syntax, that allows you to pick any delimiter, though escapes are still \ for regexes.

MBCook · on April 9, 2023

It only takes about 30 doublings to get to 1.5 billion from an initial \\ if I have my math right.

I guess the history wasn’t that deep.

MichaelZuo · on April 9, 2023

I don't understand how this made it into production without anyone noticing the exponential if it's really a straight doubling each time.

Xorlev · on April 9, 2023

Computers are fast. I once worked on a service sitting at the top of a deep RPC stack. Stack traces were often folded into RPC error messages to help diagnostics. Well, if you have a failure at one layer, propagated to the next layer, and the next layer, and sometimes replicated for each looked up item (to support partial failure semantics), then you can end up with gigabytes of stack traces in memory for some fraction of time. Very hard to figure out until tasks started dying and leaving behind heap dumps during a wide-spread incident at a lower layer of the stack.

dan-robertson · on April 9, 2023

My back-of-the-envelope estimate for how long it would take to construct the 1.5GB string is about 200ms to 300ms (1.5GB write, 0.75 read, extra from growing a buffer for the result, sequential R/W at 10GB/s; copying to resize the buffer at 20). Which is somewhat fast but I’m still surprised that the latency increase wasn’t noticed first.

More importantly, the time and memory doubles with each further navigation which means a user shouldn’t have had to navigate much farther to see much worse performance. Maybe this was a known issue that no one had debugged by then.

My best guess is that screens were arranged in a tree and users would navigate ‘back’ instead of ‘up’ so it was hard to get a long history.

Backslasher · on April 9, 2023

I joined the team long after this was in production, but I can testify that the service has a lot of moving parts and attributing the reliability / efficiency costs for a specific issue was challenging.

donio · on April 9, 2023

\ C-x ( C-a C-k C-y C-y C-x ) C-u 30 C-x e

andbberger · on April 9, 2023

i mean i don't know what i was expecting but you successfully fork bombed me

donio · on April 12, 2023

Very sorry about that, I didn't consider that somebody would actually type it in.

tgsovlerkhgsel · on April 8, 2023

A better fix could have been switching to some other (binary) serialization protocol (BSON or similar) - the problem here doesn't seem to have been the size of the history itself, but the exponential growth of the escaping.

crdrost · on April 8, 2023

Or, y'know, just

    jo.put("previous", previous);

Rather than

    jo.put("previous", previous.toJson());

Presumably the reason they didn't do this is because they couldn't handle a recursive schema?

btown · on April 8, 2023

This is reminiscent of some APIs I've had the great pleasure of needing to integrate with, where a SOAP API would take XML containing Username, Password, Method, Data... where Data was a string, itself containing barely-documented XML, but serialized as a string before being placed in the XML. You'd think that if they knew how to stand up a SOAP API, they'd know that XML is literally designed to be nested. Alas, such was not the case. I shudder to think about their security under the hood...

jiggawatts · on April 9, 2023

I remember dealing with a small ISV that was told to make an RPC API, so they used SOAP (via WCF). That would have been fine, except that every function took a single "string" parameter that was invalid XML that a normal parser couldn't read. You had to use their hand-rolled parser.

And then the whole thing was "encrypted"... with a static RSA key. I don't mean they used a stream cipher based on a hard-coded key, which would bad enough. No, they used the RSA keys directly for bulk data encryption.

I flat rejected their solution, then was overruled as being "too picky", and this monstrosity went into production.

alephaleph · on April 8, 2023

In the post they say they didn’t do this bc they’d just be trading long strings for deeply nested objects.

lalaithion · on April 8, 2023

Deeply nested objects grow linearly with the size of history, not exponentially.

NegativeLatency · on April 9, 2023

Agreed, I'd take deeply nested objects, over deeply nested strings (of strings...) of objects.

crdrost · on April 9, 2023

For the record I do agree that the schema

    data Screen = Screen {
        previous :: Maybe Screen,
        ...
    }

is inferior to what they eventually had,

    data ScreenHistory = ScreenHistory {
        previous10 :: [Screen],
        current :: Screen
    }
    -- Screen itself is nonrecursive

But I do think that storing JSON as a string was a weird antipattern in the first case, smacking of a premature optimization.

paulddraper · on April 9, 2023

That would have also fixed it, though IDK what is necessarily "better."

Recursively serialized JSON, XML, whatever is a code smell in general.

(Note that outside certain examples like this, in general BSON is not smaller than JSON. You'd need something like protobufs. )

kccqzy · on April 9, 2023

Recursively serialized data isn't a smell. How would you even serialize a tree structure without recursion?

icedchai · on April 9, 2023

A recursive data structure is fine, if you implement the serialization in a sane way. The escaping of your escaped strings containing other escaped strings seems like a code smell.

anonymoushn · on April 9, 2023

like this basically https://simdjson.org/api/0.4.0/md_doc_tape.html

dan-robertson · on April 9, 2023

You’d still be doing quadratic work over a user’s session but perhaps navigation is infrequent enough for it to not matter.

benatkin · on April 8, 2023

sighs in bencode

YouWhy · on April 9, 2023

Wow! So this is actually a drama I had a secondary role in.

A few remarks:

1. Team: Nitzan (the blog author, and an awesome dude!) was at the time the Production Engineer monitoring the top-line capacity metrics. The Java heapdump tooling was hacked together by E.A., with some tough bits by A.S., who's a force of nature. The bulk of the mitigation was carried out by Y.B. over several months. I was the person who analyzed a couple of memory dumps and framed URL strings as a worthwhile 80/20 goal.

2. The main difficulty in the mitigation project was that the pathological strings were accessed over hundreds of callsites using some semantics like

semanticallyUsefulURL = decodeURLString(urlStoredInStringForm)

What Y.B. ended up doing was

2.1. A lengthy build-up very carefully constructing an API to represent URLs in as compact a way as possible, and plugging it in where convenient,

2.2. Carrying out a massive automated rewrite at the source level ("codemod"), which is possible in Java but not for the faint of heart. I think he used some JetBrains tooling to get that going. I consider his work a tour de force.

I vaguely recall there being some modest CPU improvements in the process.

3. Organizational dynamics: the codebase was originally written by very competent people who did not work through that specific detail because it was not important for their original use case. However subsequently the code underwent several years of almost solely rewarding improvement in end-user metrics, leading to an overall inadequate state that was hurting the organization as it was hitting hard scaling limits. In fact, "engineering excellence" was not even a category for recognition at that time. In this climate I could not find my own voice as a SW professional and chose to quit after a little bit more than a year.

I have a pretty good memory, I'd be happy to give more context if appropriate, just ping me.

(Edited - added one more due prop!)

Backslasher · on April 9, 2023

A.S --> E.A, no? :) I decided to abstract away the URL part as it wasn't easily explainable and not important to the story. I figured the JSON-in-JSON part is interesting enough.

Anyway, the resulting memdump with many \\s remains my most audience-engaging slide to date, and figured out the non-company audience deserves to know. Good times.

YouWhy · on April 9, 2023

You're fully right! I added them both - E.A. was the original toolmaker, and A.S. worked more down the chain. He definitely was the person who walked me through the finer points of retrieving the data.

Backslasher · on April 9, 2023

The brain and the brawn :)

nayuki · on April 9, 2023

I wrote some stress-test cases for a Java library recently which involved 1-GB and 2-GB strings.

On the surface, a java.lang.String should be a wrapper around a char[], whose maximum length is Integer.MAX_VALUE = 2 147 483 647. This is roughly how older implementations of JDK did things.

But as of JDK 9 and JEP 254, String's private field has type byte[], and each character uses either 1 byte if it's encodable in ISO 8859-1 or 2 bytes if it requires UTF-16.

This means that strings containing only ASCII characters can be up to about 2 billion in length, whereas strings with real Unicode content can only be up to about 1 billion in length. This is a bit of an unfortunate regression in functionality.

ungamedplayer · on April 9, 2023

Naive question, why not just do the java equivalent of mmap and read the data from storage?

Seems like somewhere along the line reality outgrew the original requirements.

nayuki · on April 10, 2023

https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByt...

ungamedplayer · on April 12, 2023

Was the link to show me that mmap existed, or did that doc show some reason it could not be used.

cratermoon · on April 9, 2023

I'm currently consulting with a company in the aviation industry that is struggling to adapt a legacy-bound culture to modern software engineering techniques. Among other things they have no sense of systems-level concerns. I've recently identified one of the most common user activities ends up executing the same expensive end-to-end query at least 4 times for each interaction, never bothering to memoize the results. While there is a rudimentary cache in place, the lack of awareness that the same interaction involving at least 3 network requests per query even when the cache is warm has not yet struck them as a problem. As this system is currently only handling a fraction of the total traffic it will be expected to manage when it's fully live, it's clearly a slow-burning fuse that will explode in their faces when they try to make the switchover complete.

anonymoushn · on April 9, 2023

It seems like the memory used will grow only linearly if you escape " and \ as "\u0022" and "\u005c" respectively.

jldugger · on April 10, 2023

https://rachelbythebay.com/w/2023/04/09/note/

> I read a post about someone who found that their system had something like 1.2 GB strings full of backslashes because they were using JSON for internal state, and it kept escaping the " characters, so it turned into \\\\\\\\\\\\\\\\\\\\\\\\\" type of crap. That part was new to me, but the description of the rest of it seemed far too familiar.

> And I went... hey, I think I know that particular circus!

usr1106 · on April 8, 2023

Did they buy this domain just for this blog posting about backslashes? No, the first HN submission is from 2017. Coincidence or an agenda?

rsstack · on April 8, 2023

Coincidence :) I know him in real life and he has used that nickname for a decade+.

Backslasher · on April 9, 2023

Ha, this is a nice coincidence. I was looking for a cool domain name a billion years ago and came up with Backslasher since I was just learning text processing in Bash/Perl. My lifelong goal is beating the same-named horror film in search results.

augusto-moura · on April 8, 2023

Maybe the story is not that recent? I will bet on a coincidence though, pretty interesting nonetheless

ehPReth · on April 8, 2023

what?

stingraycharles · on April 8, 2023

The story is about backslashes. The domain is backslasher.net. Seems like a fun coincidence to mention.

sofaygo · on April 8, 2023

OP’s username is also Backslasher.

mbreese · on April 8, 2023

That’s quite the origin story for a username!

Phelinofist · on April 8, 2023

Why not just restrict the history to a configurable number of entries? Having like 20-50 is probably enough. It might no be 100% correct but the effort:use ratio seems good.

Backslasher · on April 9, 2023

We ended up doing that. We decided that while we have to store the stack, it can't hold 20 years of history.

> creating a dedicated real stack with self-imposed size limits and reporting.

scotty79 · on April 9, 2023

With this escaping of escaped strings they are doubling the number of escapes each time which means they can reach 1GB with just 30 steps.

Dwedit · on April 8, 2023

Meanwhile, Firefox has big problems when you give it a 6MB Data URL, you get crashes or graphical glitches throughout the browser while it tries to draw the URL bar.

TechBro8615 · on April 8, 2023

That seems like a weird oversight, since Firefox is normally quite good at rendering large assets. For example, I prefer viewing large SVG images (like those generated by dependency graphing tools) in Firefox, because unlike Chrome, it allows me to pan and zoom within the image.

chrisweekly · on April 8, 2023

My kneejerk rxn is, this seems more about URL length than asset rendering.

jfk13 · on April 9, 2023

As it happens, Firefox 112 has some improvements in this area, although it's still true that handling a multi-megabyte URL will be a bit painful.

wadefletch · on April 8, 2023

What is the CSS effect as you enter the page? Are they streaming the CSS itself, or just an animation in the browser? Very cool effect.

pkage · on April 8, 2023

It's a CSS animation from the Jekyll theme. Pulled from the minified CSS:

    @keyframes intro {
        0%   { opacity: 0 }
        100% { opacity: 1 }
    }

Applied via:

    animation: intro 0.3s both;
    animation-delay: 0.15s;

gumby · on April 9, 2023

Why are they saving the JSON as a string in the first place, and not as a datastructure?

Scarbutt · on April 9, 2023

It's a networked application.

iudqnolq · on April 9, 2023

But they still have to serialize the JSON object containing the string.

I wonder if it was an "optimization" to avoid serializing full object every time it was transferred?

Backslasher · on April 9, 2023

I do not know why this is a JSON object. It might have been for debuggability, although I don't agree with this choice.

ww520 · on April 8, 2023

Good analysis.

The previous session data probably needs to be deserialized from json to objects first before adding in the new session object. The whole container can then be serialized to json at one shot. This avoids the recursive escape encoding.

ben0x539 · on April 8, 2023

Wonder if it's worth putting special logic into json serialization libraries that politely raises some alarms when writing like a few hundred \ in a row.

preseinger · on April 9, 2023

> So each screen has the previous screen the user visited, to allow the user to go “back” and get the exact screen they were in before (state, scrolling position, validation notices etc).

this is the actual problem, a clear design error if the entire screen "stack" is part of per-request or per-session state while also being unbounded

the backslashes encoding stuff is a symptom

preseinger · on April 9, 2023

downvotes? what?

nothing i wrote is in any way controversial?

obviously you can't embed unbounded history in session state directly?

mkl · on April 9, 2023

You can store a heck of a lot more states if you don't double the storage required with each additional state. That's the actual problem.

sxv · on April 9, 2023

| 1.5GB string

Genomics data has entered the chat.

sour-taste · on April 9, 2023

It seems like just using the browser to store forward/backwards state with window.history would be easier than worrying about this server side. Maybe it was a legacy site that didn't work well with forward/back or a native app?

fake-name · on April 9, 2023

Nothing in the article indicates that the client here is a browser.

I read it as some service with a custom client application.

Backslasher · on April 9, 2023

This is correct :)

NegativeLatency · on April 8, 2023

> Since JSON fields have quotes, and those quotes need to be escaped when stored in a string, they are stored as \"

I don't think they do though?

andrewmackrodt · on April 8, 2023

They were serialising the previous object as a string (which may also contain a key called previous which is a string) so quoted values, backslashes and any other escape characters would be escaped. Over multiple screens this grows quickly, e.g. this simple struct which has only an id column grows to 2KB in only 9 steps:

    let obj = { id: 1 }

    for (let id = 2; id <= 10; id++) {
      obj = { id, previous: JSON.stringify(obj) }
    }

eyelidlessness · on April 9, 2023

The problem as stated is more complicated than it needs to be, due to forcing the serialized structure to match the in memory structure. Their in memory structure is a singly linked list. With JSON as a serialization format, that’s probably better modeled as an array with the “previous” link stored as the head (history[0]), and its “previous” stored as the tail (history[1]).

If this sounds like a cons cell, that’s what it is! It’s also a linked list. The laziness they want to achieve (only de/serialize one screen at a time per navigation forward or back) is also straightforward.

- forward: serialize the current screen, trim the leading/trailing quote off the previous serialization, insert with a comma before the new trailing bracket

- back: parse history, head is your desired “previous” screen, tail is your now-“previous” screen’s history

There’s a tiny amount of overhead to this approach, but not nearly as much as repeatedly reserializing the same string to shoehorn it into a serialized structure that doesn’t let you cheat a little bit with well known start/end characters.

NegativeLatency · on April 9, 2023

Right, I was suggesting something like this (since we're already in JSON land, why leave it for string land):

    let obj = { id: 1 }

    for (let id = 2; id <= 10; id++) {
      obj = { id, previous: obj }
    }

IngvarLynn · on April 8, 2023

Assuming 3 backslashes per iteration, one iteration per second - it would take 16 years to get to 1.5GB. This code problem looks deeper to me.

meithecatte · on April 8, 2023

The growth is exponential - each backslash becomes two in the next iteration. Thus after n iterations we have 2^n - 1 backslashes, and we only need 30 iterations to hit a gigabyte (and that's assuming only one quotation mark in the original JSON).

cyanydeez · on April 8, 2023

Every quote in a jsonstring requires backslashes. This is geometric explosion. You're assuming very simple state objects.

Vecr · on April 8, 2023

The formula to calculate the numbers of needed servers is wrong, as you can't have fractional servers. You should use the pigeonhole principle[0]

[0]: https://en.wikipedia.org/wiki/Pigeonhole_principle

nmilo · on April 8, 2023

So round it, this is engineering, not math. The safety buffer factor accounts for any rounding errors anyways.

sweettea · on April 8, 2023

The formula for number of needed servers seems like a fine approximation; I'm very confused how you could use the pigeonhole principle. Perhaps to say that if you have N servers, one of them has at least total_users/N users? But assuming you have a decent methodology for balancing users across servers, and can rebalance to use a new server when added, this has basically the same effects as the formula given so I'm clearly missing why you pointed this out.

Dylan16807 · on April 8, 2023

Use it how? The actual pigeonhole principle would conclude that you have multiple users on at least one server, and nothing else.

And anything that calculates individual users is the wrong way to do this math.

tiler2915072 · on April 8, 2023

As others have pointed out for the authors discussion this fact doesn’t really matter.

However what does matter is that the authors probably is not estimating the average correctly partially due to this bug. If they scaled without fixing their excessive string storage, they would probably find their estimate to be off.

Even with the fix I’d be surprised if the underlying distribution of data per user is normally distributed.

IncRnd · on April 8, 2023

That can obviously be rounded. Nobody will really try to purchase 27% of a server. A different issue is that the formula should include "1+Safety Buffer". Otherwise it needs to be run twice, once with the safety buffer and once without. Even so, I didn't take the formulae on this page to be accurate so much as representational.

gvx · on April 8, 2023

I assumed the "Safety Buffer" is really more of a "Safety Factor", and is equivalent to 1 + what you or I would call a Safety Buffer.

IncRnd · on April 8, 2023

Yep. It's really a non-issue. It's was pretty clear to me that the author knows how to use the formula.

Backslasher · on April 9, 2023

The real formula we used to order servers was more complex than this, and included regional distribution, requirement peaks etc. I can expand more on that process if this is interesting. However, for the context of this post, this crude approximation suffices IMO

sd9 · on April 8, 2023

The only difference in this case is to take the ceiling. The formula seems fine as an approximation.

cowthulhu · on April 8, 2023

The formula/metric is clearly an approximation?

ilyt · on April 9, 2023

It relies on common sense, thing that you lack.

IncRnd · on April 8, 2023

That's a case of not solving the problem.

Leaving aside the need to send a massive set of screen images for the moment, a severe issue with that protocol is the lack of compression. It doesn't need to use a quadtree[1], although that will significantly speed up the interface when sliding the tree depth to the screen action being taken. Even RLE[2], Huffman[3], or any other standard string compression[4][5] would solve this issue shown in the blog.

[1] https://en.wikipedia.org/wiki/Quadtree#Compressed_quadtrees

[2] https://en.wikipedia.org/wiki/Run-length_encoding

[3] https://en.wikipedia.org/wiki/Huffman_coding

[4] https://en.wikipedia.org/wiki/Data_compression#Image

[5] https://en.wikipedia.org/wiki/Lossless_compression#General_p...

Genbox · on April 8, 2023

Efficient data handling is better than spending cycles on generating a lot of backslashes and then burn cycles on compressing it.

IncRnd · on April 8, 2023

It's also good to properly manage transmission of that list of screens, even after you've fixed the backslashes. The article stated that they degraded the user experience by artificially limiting the string length instead of fixing the issue.

thfuran · on April 8, 2023

As an immediate stopgap before the proper fix was in place.

duskwuff · on April 9, 2023

"Screens" is referring to the state of a multi-step UI (similar to the back/forward cache in a browser), not a bitmapped image.

IncRnd · on April 9, 2023

Yes. In what way does that mean that the transmission size should not be compressed? Bitmapped images are not the only data structure whose size will shrink by removing redundancy.

Xorlev · on April 9, 2023

I think you're being downvoted because you've claimed "That's a case of not solving the problem.", but I think that actually better describes this answer. It's clever, certainly, but misses the fact that the stack of screens was never intended to be recursively escaped and changing the form that it took was the real fix rather than rubbing some compression sauce on what was never intended to be lots of backslashes in the first place. And indeed, that's what the author did: they shipped a bandaid fix while working on a more comprehensive fix, one which didn't require RLE or a quadtree (!).

IncRnd · on April 9, 2023

You're describing the desire to avoid protocol compression when transmitting data that grows without bound. That has nothing to do with fixing the backslashes. The problem isn't backslashes. Those are just a symptom of data that can have any length.