I think you need to work at companies that are large enough to allow for people ...

amscanne · on Nov 11, 2022

It’s not “too busy”. It’s that unless you have significant scale, the return for improvements like this are often not worth the engineering time. At companies with massive scale (read: large enough), even tiny improvements can pay for themselves.

rektide · on Nov 11, 2022

Poppycock. It doesnt need to be at this deep a level. But most orgs- small, medium, large, or xl, have essentially zero appreciation for knowing what the heck is happening in their systems. Theres a couple lay engineers about with some deep skills & knowledge, who the rest of the org has no idea what to do with & is unable to listen to. Apps get slow, cumbersome, crufty, shitty, tech debt piles higher, customers accept but grow increasingly doubtful, and thr company jusy fails to recognize forever & ever how close to the "thermocline of trust" it's gettint, where everyone internal & external just abandons the system & all hope. This is hugely prevalent, all over the place, & extremely few companies have the managent that is able to see, recognize, believe in it's people, witness the bad.

This is a case of eeking more out, optimizing, but the general diagnostician view is deeply needed widely. This industry has an incredibly difficult time accepting the view of those close to the machine, those who would push for better. Those who know are aliens to the exterior boring normal business, and it's too hard for the pressing mound of these alien concerns to bubble up & register & turn into corporate-political will to improve, even though this is often a critical limitation & constraint on company's acceptance/trust.

There's a lot of ultra passing this off comments. No one seems to get it. Scaling up is linear. Complexity & slowness is geometric or worse. Slowness as it takes root is ever more unsolvable.

amscanne · on Nov 12, 2022

You seem to be bringing in piles of other things here: technical debt, complexity, cruft, etc. My post is about none of those things. You also note that the prevalence of those things is because no one will listen to engineers “with some deep skills & knowledge”. At some point, you also have to assign some responsibility for failure to communicate effectively, and not just a failure to listen. Or maybe they are not listening also, and not understanding the full context? I’m not doubting that poor decisions happen, but I’m not sure it’s useful to assign unilateral blame to management based on a failure to listen to the engineers you’ve deemed special here.

rektide · on Nov 12, 2022

> You also note that the prevalence of those things is because no one will listen to engineers “with some deep skills & knowledge”. At some point, you also have to assign some responsibility for failure to communicate effectively, and not just a failure to listen. Or maybe they are not listening also, and not understanding the full context?

This rings immensely hollow to me, & borders on victim blaming. Oh sure; telling the lower rungs of the totem pole it's their fault for not convincing the business to care- for not being able to adequately tune in the business to the sea of deeper technical concerns- has some tiny kernel of truth to it. Maybe the every-person coder could do better, maybe, granted. But I see the structural issues & organization issues as vastly vastly more immense impediments to understanding-our-selves.

There is such a strong anti-elitism bias in society. We don't like know it alls, whether their disposition is classicaly braggadocious, or humble as a dove. We are intimidated by those who have real, deep, sincere & obvious masteries. We cannot outrun the unpleasantness of the alien, the foreign concerns, steeped in assurity & depth, that we can scarcely begin to grok. Techies face this regularly, are ostracized & distanced from with habit. Few, very few, are willing to sit at points of discomfort to trade with the alien, to work through what they are saying, to register their concerns.

> I’m not sure it’s useful to assign unilateral blame to management based on a failure to listen to the engineers

Again, granted! There absolutely are plenty of poor decisions all around. Engineers rarely have good sounding boards, good feedback, for a variety of reasons but the above forced alienation is definitely a sizable factor where engineers go wrong; being too alone in deciding, not knowing or not having people to turn to to figure shit out, to get not just superficial but critical review that strikes at the heart.

This again does not dissuade me from my core feelings on my core point. I think specifically most companies are hugely unable to assess their own products & systems health, unable to gauge the decay & rot within. Whether it's slog or real peril, there are few rangers tasked with scouting the terrain & identifying the features & failures. And the efforts are renewal/healing are all too often patchwork, haphazard, & random, done as emergency patches. These organisms of business are discorporated, lack cohesion & understanding of themselves & what they are. Having real on the ground truthfinders, truthtellers, assessers, monitors- having people close to the machine who can speak for the machine, for the systems, that is rarely a role we embrace, and so often we simply rely on the same chain of management which is also responsible for self/group-promotion & tasking & reporting which has far too many conflicted interests for it to be expectable for them to deliver these more unvarnished, technically minded views.

polynox · on Nov 21, 2022

For what it's worth I agree highly with this perspective. I'd invite you to be on my gang of post-apocalyptic systems engineers [1].

[1] https://www.usenix.org/system/files/1311_05-08_mickens.pdf

jeffreygoesto · on Nov 11, 2022

This.

Cheap money and small, but easyly observable gains made all the ones not knowing better appreciate linear scaling improvements.

mmis1000 · on Nov 11, 2022

It's probably cheaper for a small company to just add another machine instead of hire a whole team to do these.

abrookewood · on Nov 11, 2022

I think you and the parent comment are both correct. Engineers who are skilled enough to do this type of analysis are expensive and another VM/Instance is a few hundred a month at most.

jjav · on Nov 11, 2022

At the very small end of the scale this is very true. It doesn't take too much traffic volume to make it worth it though. It's just difficult to make anyone care. It doesn't take a particularly huge amount of traffic to make it worth spending a few weeks of engineering time to save $50K+ on annual hosting costs.

amscanne · on Nov 12, 2022

I think you’re considering only literal cost, and not the opportunity cost (which is typically much higher, and what you’re using to make investment decisions). Suppose you’ve an engineer and they’re paid $100k. Now you’d be tempted to say that anything that takes less than six months and saves more than $50k is worth it. But that’s not true at all. For one, that engineer is actually worth much more than $100k/yr to the company; it costs a lot more to hire and keep that person busy (recruiting, training, an office, a manager, product managers, project managers, etc.). But more importantly: what are the other things they could be doing? Small companies are rarely thinking about micro-efficiency because they are trying to change and grow their products. If this engineer is able to build a feature that will help them grow X% faster, that can have a massive multiplier effect on their prospects (and valuation, for which that $50k saving makes zero difference). Those are the things you’re comparing against, which is why the opportunity cost bar for pure $$ saving improvements is often much higher than it seems.

antoinealb · on Nov 11, 2022

Yup. One of my previous work was in a very tiny company where everyone was busy, yet my job was basically this type of performance optimization. The reason ? We were building a real time processing product, and we were not hitting our frame time, meaning the product could not be shipped. In those situations, convincing the pm that you are doing important performance work is trivial.

Cthulhu_ · on Nov 11, 2022

And for most companies, the issue is not in the lower levels; they will have tons of suboptimal software that needs years of work to make more performant or to rewrite before you even get to the point where you need to do CPU level analysis.

ilyt · on Nov 11, 2022

I wonder what the difference would be vs. "just" running 3-4 copies of apps on a node.

We did that (ops, not developing the actual app) few times where app scaled badly

snovv_crash · on Nov 11, 2022

You need more staff to manage a cluster than a single machine though. Over the course of 6 months you'll probably spend more time on maintaining the cluster (with all its overhead) than doing an optimization analysis to net you 3x performance.

fnordpiglet · on Nov 11, 2022

I’ve been at the largest companies on earth for about 20 years. They are precisely who can’t countenance such work.

abrookewood · on Nov 11, 2022

Most of the blog posts I see that are this detailed are usually from larger companies.

fnordpiglet · on Nov 11, 2022

I don’t think it’s related to size but rather corporate culture, and what behavior leads to profitability. Netflix is all about reducing cost in their technical infrastructure while providing a highly consistent high bandwidth relatively low latency experience. Exxon Mobil couldn’t care less. There are small pockets of Amazon that care, but most of Amazon is product management driven towards a business goal where costs and performance are only relevant when it interferes with the product goal. It’s less about size and more about priority for achieving the core business objectives. Companies that need this behavior but don’t prioritize it will sooner or later be overtaken by a Netflix. Companies that do not need this behavior but do prioritize it will be overtaken by a product focused competitor while the engineers and flipping bits to optimize marginal compute utilization.

xxpor · on Nov 11, 2022

Seems like poor analysis on their part then. If they're leaving 3.5x perf on the floor, they're not spending their money very well.

magicalhippo · on Nov 11, 2022

Leaving 3.5x perf of what on the floor though?

We're working on applications that's either waiting for the disk, waiting for the db, waiting for some http server or waiting for the user.

None of our customers will notice the difference if my button click event handler takes 50ms instead of 10ms, or if the integration service that processes a few dozen files per day spends 5 seconds instead of 1 second.

I'll easily trade 5x performance in 99% of my code if it makes me produce 5x more features, because most of the time my code runs in just a few milliseconds at a time anyway.

Of course, I'm weary of big-Oh, that one will bit ya. But a mere 3.5x is almost never worth chasing for us.

conradev · on Nov 11, 2022

Netflix is not doing this work to improve user latency, they are doing this work to minimize cost:

> At Netflix, we periodically reevaluate our workloads to optimize utilization of available capacity.

The idea being that if you pay for fewer servers you spend less money

abrookewood · on Nov 11, 2022

And at their scale, this makes sense. No one is going to do this work if if means they run 2 servers instead of 6.

anonymoushn · on Nov 11, 2022

> We're working on applications that's either waiting for the disk, waiting for the db, waiting for some http server or waiting for the user.

If you try to be good at waiting on many things, you can use one machine instead of a hundred

> None of our customers will notice the difference if my button click event handler takes 50ms instead of 10ms

They absolutely will, what

magicalhippo · on Nov 11, 2022

> If you try to be good at waiting on many things, you can use one machine instead of a hundred

We can't run on one, because customers run our application on-prem.

> They absolutely will, what

How can you be so sure?

anonymoushn · on Nov 11, 2022

Have you tested this? 40ms of additional interaction delay is extremely noticeable.

magicalhippo · on Nov 11, 2022

You're missing the point.

Sure, 10ms vs 40ms is measurable, and for the keen-eyed noticeable. But if you're only pressing the button once every 5 minutes, it doesn't matter. Similarly, if the button triggers an asynchronous call to a third-party webservice that takes seconds to respond, it doesn't matter. And so on.

Of course, for the things where users are affected by low latency, we try to take care. But overall that's a very, very small portion out of our full functionality.

loeg · on Nov 11, 2022

A lot of businesses reasonably use Python and Ruby, for example.

snovv_crash · on Nov 11, 2022

If you can run 1 server instead of 3 though... that means no ansible, no kubernetes, no wild deployment strategies, 1/3 the likelihood of hardware failure, etc

maccard · on Nov 11, 2022

This isn't true, at all. We are not "high availability" by any means but running on one server has significant risks, _especially_ if you don't use some form of automated provisioning. A destructive app logic bug is far more likely than a hardware failure, and in both cases the impact on your service is significant if you have one instance, but likely manageable if you have more than one.

okeuro49 · on Nov 11, 2022

Why wouldn't you use ansible to manage one server?

It's a great tool with a low learning curve that just requires SSH.