Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A 1.5GB string (backslasher.net)
308 points by Backslasher on April 8, 2023 | hide | past | favorite | 138 comments


> Understanding we are memory-bound is crucial, as it directs our efforts towards reducing memory consumption in order to accommodate more users on the server.

While this is true, it bears mentioning that in a service handling many requests they all smear together memory-wise, so _any_ latency savings can also ease memory pressure (a memory allocation has an area: bytes × time). So for instance it can make sense to minimize memory held across an RPC boundary, or even to perform CPU optimizations if they would improve the latency,

> e.g. examining a memdump rather than just measuring overall memory utilization

A flame graph also would have highlighted this (without the need to look at what might be user data), and is usually where I start looking for memory optimization. I mean if you don't know what code is allocating lots of memory where would you even start anyway?

Additionally I've found that looking at outliers, while a good idea, can sometimes lead to wild goose chases if you're not careful. There can be pathological edges cases which are rare enough to not actually matter in practice.


This service keeps around a lot of long-lived per-session state, which leads to a very different memory profile from a typical web service, like what you're describing. Your point is valid in general, but doesn't seem to be relevant here.


Yeah, but any excuse to talk about memory profiling! It's one of my favorite past-times.


Ironically, not long ago there was an article about how hard string handling is in C, so programs tend to be written to avoid manipulating strings as much as possible. In other languages, where strings are easy to use, it encourages inefficiency like this. The term "stringly typed" also comes to mind.

JSON truly is the new XML, both in terms of advantages and disadvantages. I wish people would stop using it for everything and realise that using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.


That's a massive exaggeration. XML is both much more complex (and bug-prone) to parse, and wastes more space.

And the problem here wasn't JSON in the first place anyway, it was

> The “back” stack is unlimited, meaning we’re saving more and more state until we explode

And storing previous state in serialized string in current state, instead of just having it stored natively. You'd have same problem if you tried to serialize <format> into a field in <format>


You'd have same problem if you tried to serialize <format> into a field in <format>

Length-delimited formats don't need (exponentially increasing) escaping.


> And the problem here wasn't JSON in the first place anyway

Yes it was. They stored state in a huge JSON dump which ended up working like a zip bomb by the way they had to escape a string. This alone resulted in an exponential increase in memory that quickly became a problem.

Think about it for a second: would they experience this issue if they stored an in-memory stack per user? They wouldn't.

Heck, it seems that they could live with their JSON choice if they replaced '/' with any other inescapable character and converted it back to '/' when reading.


XML would also have a problem with this: you'd replace < with &lt;, then with &amp;lt;, then with &amp;amp;lt;, … It's not exponential, but still inefficient.


Inefficient is not the same as problematic. They turned what otherwise would be a linear increase in memory usage into an exponential increase, which expectedly resulted in memory exhaustion problems.

There are plenty of problems to go around this approach and overall system design, but the weakest link is undoubtedly how JSON was used.


Can you imagine arguing in a tech meeting that a reason not to use JSON is that it doesn't support unbounded recursive levels of escaping very well?


That is just entirely wrong; not serializing JSON into JSON would turn it back into linear.


But why do you bring XML? The parent is against human-readable formats, and XML is one of them.


I'd argue that JSON is one of the least Human Readable serialization formats.

It lacks comments. Has no way to offer context. Cannot impose limitations, and normalizing needs additional, often poorly supported RFCs (JSON-LD or pointers).

It also is a poor non-human readable format, as it lacks crucial types (datetime, decimal) is rather verbose (just imagine the terabytes bandwidth wasted, globally on all these commas and curly braces) and cannot be parsed streaming.


Agreed. Lack of comments and lack of proper schema enforcement makes JSON a junk pile of a data format.

Definitely not an improvement over XML. Hopefully the next popular standard data format will learn from the past and take the best of previous attempts.


The problem is, that despite JSON's clear and numerous flaws, people still overwhelmingly prefer it to XML. XML is that unpleasant.


JSON has schemas though. I use them all the time.


JSON itself (rfc 8259) has no schema, nor any mention of it. There are several additional RFCs that allow to add schema or other semantics or limitations, though. But JSON has no schemas.


I think you mean "the JSON standard doesn't define any schema".


Indeed. Nor does it mention any place, extension or addition where a scheme could be defined.


> as it lacks crucial types (datetime, decimal)

Decimal is there, isn’t it, as part of ‘number’, which lumps together fixed-size integers, IEEE floats and doubles, bigints and bigfloats, and leaves it to the parser to make the right guess as to how to interpret them (often assisted by having the programmer specify a target data type)?

Lumping all those together as ‘number’ makes for a much simpler grammar, but I think that’s not worth it for its negative effects on interoperability. How many json parsers can (easily) read a value of 1234567890123456890E0000000001 and write back the same string to a json file?

Apart from datetime, I’d add duration as a basic type, too, even though that has the problem of being imprecise. Users will want to specify durations such as “1m” or “1y” (is that 365 days? 365.2425 to correct for leap years? Does it depend on interpretation by the user? is it a lunar year of 354/355 days as in the Islamic calendar? Etc)


The lack of comments isn't really a big deal for serialisation. Nobody is really going to bother adding comments when data is serialised. You could do that with XML if you wanted and I have never seen a single occurrence.

The issue is that JSON ended up being used as a configuration format too, and comments are critical there.

JSON5 is the best answer I have found.


EDN at least provides date literals, UUID literals, and decimals. It also does not require commas to be littered everywhere. Unfortunately EDN is only really well-supported within Clojure. I'm not sure what is meant by "context problem" for JSON, but keywords and symbols in EDN can be namespaced, so that one can distinguish e.g. :human/name and :computer/name. However, it still requires you to write a schema. And due to lack of support in other languages, protocol buffers seem to be a much better choice.


Context being "this branch is a link" or "this string is a creditcard-number" or "this employee here is-a Title-over there, at Company over there".


1. I 100% agree about comments.

2. I’m still not convinced it’s the file format’s job to encode every possible data type. We used to have deserialization layers for that, but for some reason, we don’t do that anymore. Nowadays we expect every file format to be self-describing, which I think is wasteful and limiting. But that’s a matter of opinion, obviously.

3. What do you mean by “cannot be parsed streaming”? Or to be perhaps more precise, what’s missing in existing streaming JSON parsers? I haven’t used any of them yet.


Many JSON parsers require the whole JSON file to be in memory. Parsers that allow parsing in streaming manner are the exception.

I only found jsoncons to work reliable but at the cost of speed compared to something fast like rapidjson. And even jsoncons might have the problem that large strings have to be fully in memory, i.e., if someone has the fun idea to store a 1 GB string inside a JSON, then you still have a potential out-of-memory event at your hand.

There are other attempts to fix this problem with the jsonl format but that format feels so ad-hoc that I'm not convinced yet. It stores one JSON object per line. Of course, this comes with its own problems such as each JSON object potentially still being unbounded and, now that lines are used to separate JSON objects, you cannot format those JSON objects in a human-readable way anymore. Also, the first time I encountered a file like this, it simply had a .json extension and I was wondering why every JSON parser failed to parse that file. Because it isn't valid JSON and should have a separate extension.


Streaming parsers may be the exception, but are likely the correct tool to use when handling large inputs becomes a problem. Similarly, XML parsers evolved the SAX model (as opposed to DOM, where the document had to fit in memory.) It is often sufficient to rip through the file in an event-based fashion to extract what you want, and throw the rest away.

For cases where a single string is truly massive, I would lean towards defining a maximum size to prevent denial of service attacks.


3. Like sibling comment or said: most parsers need the entire file (in memory) before being able to start parsing. XML is just as bad here: both, technically, need the parser to see the closing part before being able to parse the chunk.


Then I guess the same is for all these dicts I use in Python.


But in systems where JSON is used inefficiently, it’s rarely (with some notable exceptions [0]) the a performance limiting factor.

[0] https://news.ycombinator.com/item?id=26296339


That one wasn't even down to json, but rather some c stdlib functions being in linear time because c strings(which then compounded into quadratic) right?


> using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.

Database dumps say hello. :)


If you want human-readable dumps, there's no reason you can't convert the more-efficient format into something human-readable.


I just drag my Windows XML script driver into SYSTEM32 folder (for my HP indigo 7000 series ink press) then I press the rest.


If you want human readable db output, I might introduce you to what I’ve spent most of my career doing: etl. :)


Human readable formats should exist for more or less one purpose, interfacing with a human, eg configuration.


Developing and debugging systems with JSON (vs, say, protocol buffers) is much easier… configuration is rarely interfaced with by humans vs. debug logs or dumps etc, which are always being looked at.


With something like protocol buffers, you have an explicit schema, parsing is well-defined, and you rarely (if ever) need to manually inspect the serialized representation.


It's not hard to read what isn't "human readable" anyway; it just takes a little bit of practice and learning. For example, I can "sight-read" x86 and Z80 Asm from a hexdump, and bits of ASN.1 and a few other proprietary protocols too.


Did I strike a nerve? It's unfortunate that so many seem to be under the impression that "human readable" is really an euphemism for "like English" and are narrow/closed-minded enough to not consider the possibility that people can learn just about any language with enough exposure and a bit of effort.

4f 6e 63 65 20 79 6f 75 20 73 74 6f 70 20 6c 65 61 72 6e 69 6e 67 2c 20 79 6f 75 20 73 74 61 72 74 20 64 79 69 6e 67 2e


You didn’t strike a nerve, but your comment is misguided.

Of course people can read binary formats. It is just that they read them more slowly, and it requires quite a bit of exposure/practice, so that people usually invest in tooling to simplify the process.

I suggest that when you see a comment like this, you ask yourself “if I change ‘human readable’ to ‘easily human readable’ would I need to offer my correction?”


Your comment comes across the same way as that long ago "Show HN" where somebody said that Dropbox is pretty much just rsync.


But json is terrible for configuration, e.g. comments are absolutely required. Toml and Hocon are for configuration.


json isn't human readable, it's a serialization format that happens to use ascii.


Developers are the humans (at least for now) that interface with JSON.

And devs don't care about just config files put API requests and responses and any other kind of structured/semi-structured communication and data.


>using a "human-readable" format for data that is 99.999999% not going to be seen by a human is absolute insanity in terms of inefficiency.

In the encode/decode library we use, the generated toString were accidentally so close to json, that we made them exactly so (mostly had to add quotes). Now we can use json tooling on top of efficient messaging.


How does the number of backslashes grow as the string is repeatedly escaped?

  1 \
  2 \\
  4 \\\\
  8 \\\\\\\\
We see it grows exponentially: to escape each backslash, you need two. So I guess I’m surprised that this is a post about a 1.5GB string and not the exponentially increasing work on each navigation causing performance problems, or the session crashing from running out of memory when a user navigates too far. Maybe the service wasn’t being used much yet or there’s some other reason that strings only grew to ridiculous sizes and not oom sizes. I’m curious why it didn’t exhibit sooner. (I saw a somewhat similar problem long ago but then it was more like doing ‘this.json = this.messages_array.to_json()’ on every new message, so merely quadratic. I think it was noticed not long after we first had a lot of messages added to one of the objects)


The solution to this problem has always been known: don't use the same escape character in different formats. With non-overlapping escape characters, you don't need to escape more than once.

So we see that in C, the escape character for strings delivered to the C compiler is \, while the escape character for strings delivered to printf is %.

It always boggles my mind that someone looked at this example of a working design and concluded that the escape character for a regular expression engine should be \, overlapping with the escape character for a literal string.


This is also why I like the sed syntax, that allows you to pick any delimiter, though escapes are still \ for regexes.


It only takes about 30 doublings to get to 1.5 billion from an initial \\ if I have my math right.

I guess the history wasn’t that deep.


I don't understand how this made it into production without anyone noticing the exponential if it's really a straight doubling each time.


Computers are fast. I once worked on a service sitting at the top of a deep RPC stack. Stack traces were often folded into RPC error messages to help diagnostics. Well, if you have a failure at one layer, propagated to the next layer, and the next layer, and sometimes replicated for each looked up item (to support partial failure semantics), then you can end up with gigabytes of stack traces in memory for some fraction of time. Very hard to figure out until tasks started dying and leaving behind heap dumps during a wide-spread incident at a lower layer of the stack.


My back-of-the-envelope estimate for how long it would take to construct the 1.5GB string is about 200ms to 300ms (1.5GB write, 0.75 read, extra from growing a buffer for the result, sequential R/W at 10GB/s; copying to resize the buffer at 20). Which is somewhat fast but I’m still surprised that the latency increase wasn’t noticed first.

More importantly, the time and memory doubles with each further navigation which means a user shouldn’t have had to navigate much farther to see much worse performance. Maybe this was a known issue that no one had debugged by then.

My best guess is that screens were arranged in a tree and users would navigate ‘back’ instead of ‘up’ so it was hard to get a long history.


I joined the team long after this was in production, but I can testify that the service has a lot of moving parts and attributing the reliability / efficiency costs for a specific issue was challenging.


\ C-x ( C-a C-k C-y C-y C-x ) C-u 30 C-x e


i mean i don't know what i was expecting but you successfully fork bombed me


Very sorry about that, I didn't consider that somebody would actually type it in.


A better fix could have been switching to some other (binary) serialization protocol (BSON or similar) - the problem here doesn't seem to have been the size of the history itself, but the exponential growth of the escaping.


Or, y'know, just

    jo.put("previous", previous);
Rather than

    jo.put("previous", previous.toJson());
Presumably the reason they didn't do this is because they couldn't handle a recursive schema?


This is reminiscent of some APIs I've had the great pleasure of needing to integrate with, where a SOAP API would take XML containing Username, Password, Method, Data... where Data was a string, itself containing barely-documented XML, but serialized as a string before being placed in the XML. You'd think that if they knew how to stand up a SOAP API, they'd know that XML is literally designed to be nested. Alas, such was not the case. I shudder to think about their security under the hood...


I remember dealing with a small ISV that was told to make an RPC API, so they used SOAP (via WCF). That would have been fine, except that every function took a single "string" parameter that was invalid XML that a normal parser couldn't read. You had to use their hand-rolled parser.

And then the whole thing was "encrypted"... with a static RSA key. I don't mean they used a stream cipher based on a hard-coded key, which would bad enough. No, they used the RSA keys directly for bulk data encryption.

I flat rejected their solution, then was overruled as being "too picky", and this monstrosity went into production.


In the post they say they didn’t do this bc they’d just be trading long strings for deeply nested objects.


Deeply nested objects grow linearly with the size of history, not exponentially.


Agreed, I'd take deeply nested objects, over deeply nested strings (of strings...) of objects.


For the record I do agree that the schema

    data Screen = Screen {
        previous :: Maybe Screen,
        ...
    }
is inferior to what they eventually had,

    data ScreenHistory = ScreenHistory {
        previous10 :: [Screen],
        current :: Screen
    }
    -- Screen itself is nonrecursive
But I do think that storing JSON as a string was a weird antipattern in the first case, smacking of a premature optimization.


That would have also fixed it, though IDK what is necessarily "better."

Recursively serialized JSON, XML, whatever is a code smell in general.

(Note that outside certain examples like this, in general BSON is not smaller than JSON. You'd need something like protobufs. )


Recursively serialized data isn't a smell. How would you even serialize a tree structure without recursion?


A recursive data structure is fine, if you implement the serialization in a sane way. The escaping of your escaped strings containing other escaped strings seems like a code smell.



You’d still be doing quadratic work over a user’s session but perhaps navigation is infrequent enough for it to not matter.


sighs in bencode


Wow! So this is actually a drama I had a secondary role in.

A few remarks:

1. Team: Nitzan (the blog author, and an awesome dude!) was at the time the Production Engineer monitoring the top-line capacity metrics. The Java heapdump tooling was hacked together by E.A., with some tough bits by A.S., who's a force of nature. The bulk of the mitigation was carried out by Y.B. over several months. I was the person who analyzed a couple of memory dumps and framed URL strings as a worthwhile 80/20 goal.

2. The main difficulty in the mitigation project was that the pathological strings were accessed over hundreds of callsites using some semantics like

semanticallyUsefulURL = decodeURLString(urlStoredInStringForm)

What Y.B. ended up doing was

2.1. A lengthy build-up very carefully constructing an API to represent URLs in as compact a way as possible, and plugging it in where convenient,

2.2. Carrying out a massive automated rewrite at the source level ("codemod"), which is possible in Java but not for the faint of heart. I think he used some JetBrains tooling to get that going. I consider his work a tour de force.

I vaguely recall there being some modest CPU improvements in the process.

3. Organizational dynamics: the codebase was originally written by very competent people who did not work through that specific detail because it was not important for their original use case. However subsequently the code underwent several years of almost solely rewarding improvement in end-user metrics, leading to an overall inadequate state that was hurting the organization as it was hitting hard scaling limits. In fact, "engineering excellence" was not even a category for recognition at that time. In this climate I could not find my own voice as a SW professional and chose to quit after a little bit more than a year.

I have a pretty good memory, I'd be happy to give more context if appropriate, just ping me.

(Edited - added one more due prop!)


A.S --> E.A, no? :) I decided to abstract away the URL part as it wasn't easily explainable and not important to the story. I figured the JSON-in-JSON part is interesting enough.

Anyway, the resulting memdump with many \\s remains my most audience-engaging slide to date, and figured out the non-company audience deserves to know. Good times.


You're fully right! I added them both - E.A. was the original toolmaker, and A.S. worked more down the chain. He definitely was the person who walked me through the finer points of retrieving the data.


The brain and the brawn :)


I wrote some stress-test cases for a Java library recently which involved 1-GB and 2-GB strings.

On the surface, a java.lang.String should be a wrapper around a char[], whose maximum length is Integer.MAX_VALUE = 2 147 483 647. This is roughly how older implementations of JDK did things.

But as of JDK 9 and JEP 254, String's private field has type byte[], and each character uses either 1 byte if it's encodable in ISO 8859-1 or 2 bytes if it requires UTF-16.

This means that strings containing only ASCII characters can be up to about 2 billion in length, whereas strings with real Unicode content can only be up to about 1 billion in length. This is a bit of an unfortunate regression in functionality.


Naive question, why not just do the java equivalent of mmap and read the data from storage?

Seems like somewhere along the line reality outgrew the original requirements.



Was the link to show me that mmap existed, or did that doc show some reason it could not be used.


I'm currently consulting with a company in the aviation industry that is struggling to adapt a legacy-bound culture to modern software engineering techniques. Among other things they have no sense of systems-level concerns. I've recently identified one of the most common user activities ends up executing the same expensive end-to-end query at least 4 times for each interaction, never bothering to memoize the results. While there is a rudimentary cache in place, the lack of awareness that the same interaction involving at least 3 network requests per query even when the cache is warm has not yet struck them as a problem. As this system is currently only handling a fraction of the total traffic it will be expected to manage when it's fully live, it's clearly a slow-burning fuse that will explode in their faces when they try to make the switchover complete.


It seems like the memory used will grow only linearly if you escape " and \ as "\u0022" and "\u005c" respectively.


https://rachelbythebay.com/w/2023/04/09/note/

> I read a post about someone who found that their system had something like 1.2 GB strings full of backslashes because they were using JSON for internal state, and it kept escaping the " characters, so it turned into \\\\\\\\\\\\\\\\\\\\\\\\\" type of crap. That part was new to me, but the description of the rest of it seemed far too familiar.

> And I went... hey, I think I know that particular circus!


Did they buy this domain just for this blog posting about backslashes? No, the first HN submission is from 2017. Coincidence or an agenda?


Coincidence :) I know him in real life and he has used that nickname for a decade+.


Ha, this is a nice coincidence. I was looking for a cool domain name a billion years ago and came up with Backslasher since I was just learning text processing in Bash/Perl. My lifelong goal is beating the same-named horror film in search results.


Maybe the story is not that recent? I will bet on a coincidence though, pretty interesting nonetheless


what?


The story is about backslashes. The domain is backslasher.net. Seems like a fun coincidence to mention.


OP’s username is also Backslasher.


That’s quite the origin story for a username!


Why not just restrict the history to a configurable number of entries? Having like 20-50 is probably enough. It might no be 100% correct but the effort:use ratio seems good.


We ended up doing that. We decided that while we have to store the stack, it can't hold 20 years of history.

> creating a dedicated real stack with self-imposed size limits and reporting.


With this escaping of escaped strings they are doubling the number of escapes each time which means they can reach 1GB with just 30 steps.


Meanwhile, Firefox has big problems when you give it a 6MB Data URL, you get crashes or graphical glitches throughout the browser while it tries to draw the URL bar.


That seems like a weird oversight, since Firefox is normally quite good at rendering large assets. For example, I prefer viewing large SVG images (like those generated by dependency graphing tools) in Firefox, because unlike Chrome, it allows me to pan and zoom within the image.


My kneejerk rxn is, this seems more about URL length than asset rendering.


As it happens, Firefox 112 has some improvements in this area, although it's still true that handling a multi-megabyte URL will be a bit painful.


What is the CSS effect as you enter the page? Are they streaming the CSS itself, or just an animation in the browser? Very cool effect.


It's a CSS animation from the Jekyll theme. Pulled from the minified CSS:

    @keyframes intro {
        0%   { opacity: 0 }
        100% { opacity: 1 }
    }
Applied via:

    animation: intro 0.3s both;
    animation-delay: 0.15s;


Why are they saving the JSON as a string in the first place, and not as a datastructure?


It's a networked application.


But they still have to serialize the JSON object containing the string.

I wonder if it was an "optimization" to avoid serializing full object every time it was transferred?


I do not know why this is a JSON object. It might have been for debuggability, although I don't agree with this choice.


Good analysis.

The previous session data probably needs to be deserialized from json to objects first before adding in the new session object. The whole container can then be serialized to json at one shot. This avoids the recursive escape encoding.


Wonder if it's worth putting special logic into json serialization libraries that politely raises some alarms when writing like a few hundred \ in a row.


> So each screen has the previous screen the user visited, to allow the user to go “back” and get the exact screen they were in before (state, scrolling position, validation notices etc).

this is the actual problem, a clear design error if the entire screen "stack" is part of per-request or per-session state while also being unbounded

the backslashes encoding stuff is a symptom


downvotes? what?

nothing i wrote is in any way controversial?

obviously you can't embed unbounded history in session state directly?


You can store a heck of a lot more states if you don't double the storage required with each additional state. That's the actual problem.


| 1.5GB string

Genomics data has entered the chat.


It seems like just using the browser to store forward/backwards state with window.history would be easier than worrying about this server side. Maybe it was a legacy site that didn't work well with forward/back or a native app?


Nothing in the article indicates that the client here is a browser.

I read it as some service with a custom client application.


This is correct :)


> Since JSON fields have quotes, and those quotes need to be escaped when stored in a string, they are stored as \"

I don't think they do though?


They were serialising the previous object as a string (which may also contain a key called previous which is a string) so quoted values, backslashes and any other escape characters would be escaped. Over multiple screens this grows quickly, e.g. this simple struct which has only an id column grows to 2KB in only 9 steps:

    let obj = { id: 1 }

    for (let id = 2; id <= 10; id++) {
      obj = { id, previous: JSON.stringify(obj) }
    }


The problem as stated is more complicated than it needs to be, due to forcing the serialized structure to match the in memory structure. Their in memory structure is a singly linked list. With JSON as a serialization format, that’s probably better modeled as an array with the “previous” link stored as the head (history[0]), and its “previous” stored as the tail (history[1]).

If this sounds like a cons cell, that’s what it is! It’s also a linked list. The laziness they want to achieve (only de/serialize one screen at a time per navigation forward or back) is also straightforward.

- forward: serialize the current screen, trim the leading/trailing quote off the previous serialization, insert with a comma before the new trailing bracket

- back: parse history, head is your desired “previous” screen, tail is your now-“previous” screen’s history

There’s a tiny amount of overhead to this approach, but not nearly as much as repeatedly reserializing the same string to shoehorn it into a serialized structure that doesn’t let you cheat a little bit with well known start/end characters.


Right, I was suggesting something like this (since we're already in JSON land, why leave it for string land):

    let obj = { id: 1 }

    for (let id = 2; id <= 10; id++) {
      obj = { id, previous: obj }
    }


Assuming 3 backslashes per iteration, one iteration per second - it would take 16 years to get to 1.5GB. This code problem looks deeper to me.


The growth is exponential - each backslash becomes two in the next iteration. Thus after n iterations we have 2^n - 1 backslashes, and we only need 30 iterations to hit a gigabyte (and that's assuming only one quotation mark in the original JSON).


Every quote in a jsonstring requires backslashes. This is geometric explosion. You're assuming very simple state objects.


The formula to calculate the numbers of needed servers is wrong, as you can't have fractional servers. You should use the pigeonhole principle[0]

[0]: https://en.wikipedia.org/wiki/Pigeonhole_principle


So round it, this is engineering, not math. The safety buffer factor accounts for any rounding errors anyways.


The formula for number of needed servers seems like a fine approximation; I'm very confused how you could use the pigeonhole principle. Perhaps to say that if you have N servers, one of them has at least total_users/N users? But assuming you have a decent methodology for balancing users across servers, and can rebalance to use a new server when added, this has basically the same effects as the formula given so I'm clearly missing why you pointed this out.


Use it how? The actual pigeonhole principle would conclude that you have multiple users on at least one server, and nothing else.

And anything that calculates individual users is the wrong way to do this math.


As others have pointed out for the authors discussion this fact doesn’t really matter.

However what does matter is that the authors probably is not estimating the average correctly partially due to this bug. If they scaled without fixing their excessive string storage, they would probably find their estimate to be off.

Even with the fix I’d be surprised if the underlying distribution of data per user is normally distributed.


That can obviously be rounded. Nobody will really try to purchase 27% of a server. A different issue is that the formula should include "1+Safety Buffer". Otherwise it needs to be run twice, once with the safety buffer and once without. Even so, I didn't take the formulae on this page to be accurate so much as representational.


I assumed the "Safety Buffer" is really more of a "Safety Factor", and is equivalent to 1 + what you or I would call a Safety Buffer.


Yep. It's really a non-issue. It's was pretty clear to me that the author knows how to use the formula.


The real formula we used to order servers was more complex than this, and included regional distribution, requirement peaks etc. I can expand more on that process if this is interesting. However, for the context of this post, this crude approximation suffices IMO


The only difference in this case is to take the ceiling. The formula seems fine as an approximation.


The formula/metric is clearly an approximation?


It relies on common sense, thing that you lack.


That's a case of not solving the problem.

Leaving aside the need to send a massive set of screen images for the moment, a severe issue with that protocol is the lack of compression. It doesn't need to use a quadtree[1], although that will significantly speed up the interface when sliding the tree depth to the screen action being taken. Even RLE[2], Huffman[3], or any other standard string compression[4][5] would solve this issue shown in the blog.

[1] https://en.wikipedia.org/wiki/Quadtree#Compressed_quadtrees

[2] https://en.wikipedia.org/wiki/Run-length_encoding

[3] https://en.wikipedia.org/wiki/Huffman_coding

[4] https://en.wikipedia.org/wiki/Data_compression#Image

[5] https://en.wikipedia.org/wiki/Lossless_compression#General_p...


Efficient data handling is better than spending cycles on generating a lot of backslashes and then burn cycles on compressing it.


It's also good to properly manage transmission of that list of screens, even after you've fixed the backslashes. The article stated that they degraded the user experience by artificially limiting the string length instead of fixing the issue.


As an immediate stopgap before the proper fix was in place.


"Screens" is referring to the state of a multi-step UI (similar to the back/forward cache in a browser), not a bitmapped image.


Yes. In what way does that mean that the transmission size should not be compressed? Bitmapped images are not the only data structure whose size will shrink by removing redundancy.


I think you're being downvoted because you've claimed "That's a case of not solving the problem.", but I think that actually better describes this answer. It's clever, certainly, but misses the fact that the stack of screens was never intended to be recursively escaped and changing the form that it took was the real fix rather than rubbing some compression sauce on what was never intended to be lots of backslashes in the first place. And indeed, that's what the author did: they shipped a bandaid fix while working on a more comprehensive fix, one which didn't require RLE or a quadtree (!).


You're describing the desire to avoid protocol compression when transmitting data that grows without bound. That has nothing to do with fixing the backslashes. The problem isn't backslashes. Those are just a symptom of data that can have any length.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: