Makes me sad that running your own instances is now an "elephant in the room." N...

rosser · on Dec 6, 2017

Getting HA right is hard. DIY-ing it incurs risk, possibly deliberately, out of Not-Invented-Here-ism.

Source: PostgreSQL DBA for over a decade; have built multiple HA environments; have seen many ways of "doing it wrong", and how those can end up biting their creators.

nh2 · on Dec 6, 2017

On the other hand:

With hosted Postgres, when a failure does happen, isn't it much harder to get at the log files? They seem extremely useful to diagnose the problem and make sure it doesn't happen again, as the article shows. What's your experiene here, can you get at logs easily with hosted Posgres offerings?

And it seems the only way to get reliable Postgres HA for everyone, and to weed out the bugs, is if more people run Posgres HA themselves. For example, I find Stolon and Patroni great, but I would be more relaxed about them if they had 100x more users.

rosser · on Dec 6, 2017

We aren't using hosted postgres (much, yet). We provision EC2 instances and self-manage it. Failover is scripted, and manually invoked as needed.

None of us trust any of the automated failover solutions enough to use them. We want human judgement in that loop, even if it means being woken at 3AM to push the button. It's that hard to get right.

Just one incident like The Fine Article's is well more than our entire infrastructure's total downtime for the rolling year, and we have hundreds of postgres instances.

Done wrong, automated failover is a net increase in risk. And, in case my thesis is somehow unclear, it's hard to get right.

majidazimi · on Dec 7, 2017

It's not hard. The problem is operation team rarely exercise failures. Configure a test HA cluster in lab, and test it. If it works, push it to production. The production system, should be continuously tested with real failures to see whether fail over mechanisms actually taking over the responsibility or not.

OF COURSE every thing is going to work in the lab, BUT MAY BE there is some other corner case in the production that you haven't considered yet. --- Louis C.K. after a sleepless night of switching primary DB to secondaries.

derefr · on Dec 6, 2017

It's still "respectable" to run your own n=1 Postgres instance, maybe with WAL-E backup. It's sensible, as well, to create your own read-replicas to scale OLAP; and even to do your own shared-nothing sharding across regions. These are all "set and forget" enough that they can be the responsibility of your regular devops.

But, when you get to the point where you need multi-master replication, you're making a mistake if you aren't dedicating an ops person (i.e. a DBA) to managing your growing cluster. If you can't afford that ops person, much better to just pay a DBaaS provider to handle the ops for you, than to get hosed when your cluster falls apart and your week gets shot putting it back together.

Sinjo · on Dec 6, 2017

Right!

A thing that scares me is anyone saying they're running their own HA cluster (not single instance) for cost reasons. Infra people are not cheaper than the hosted solutions (Amazon RDS, Google Cloud SQL, Heroku Postgres).

foobarbazetc · on Dec 6, 2017

That’s a blanket statement that has very little basis in reality. Hosted Postgres is never going to give you the performance you need for low latency deployments.

bpicolo · on Dec 7, 2017

RDS i3 on bare-metal is in preview now, so it's not too far off

https://aws.amazon.com/blogs/aws/new-amazon-ec2-bare-metal-i...

Sinjo · on Dec 6, 2017

Oh for sure!

My claim is that you need to hire some expensive people if you want that performance, not that there aren't reasons to run your own database instances!

bmm6o · on Dec 6, 2017

Then you are running it yourself for latency reasons, not (just) the cost reasons in the GP's scenario.

qaq · on Dec 6, 2017

Cause AWS infra is managed by magic monkeys.

illumin8 · on Dec 6, 2017

I love nerding out over this personally, but if you're a startup, given the plethora of well managed offerings, you're frankly foolish to invest resources on this. Even if you eventually reach the point where it makes financial sense to hire a full time ops person or DBA, the opportunity cost of having a smart engineer (and it does take a smart engineer to manage a multi-master database) work on infrastructure instead of your actual product, is just stupid.

How many startups have failed because they spent too much money building "cool, nerdy, infrastructure" instead of just building a product?

cookiecaper · on Dec 7, 2017

The danger here is differentiating between "the infrastructure" and "the product". Useful database and/or infrastructure work _is_ "building the product".

There's nothing necessarily wrong with using pre-baked or hosted components when they fit the bill, but pretending like they're unrelated concerns is going down a bad road. A lot of recent fads are based on this self-centered, lazy fantasy from devs that $LANGUAGE_OF_THE_MONTH is the only thing that matters and it's a dark, sad situation.

There's a pretty consistent inverse relationship between technical quality and popularity because time and money spent on technical/engineering resources is time and money not spent on marketing and sales resources that bring cash in the door.

Ever wonder why, with a few exceptions, it never seems that the products everyone knows about are comparable to what you can find after a little bit of research online? This is why. The people who are building good stuff are spending the time and resources on building good stuff, whereas the people who aren't are spending the time and resources on making sure they're the path of least resistance.

So in that sense, yes, you are right. It is dumb to spend any time or money on anything other than the bare minimum skeleton needed to allow your sales people to start pimping your stuff.

Whether or not people recognize your product's superiority is more or less irrelevant, because first, they won't, and because second, the extra effort it takes to swim upstream and use your product instead of the mainstream solution won't really be worth the gains for most people no matter how much better it is. You can probably rattle 15 examples of software off the top of your head that is just like this. PostgreSQL is actually a great example of it.

Amazon has run amok feeding people who don't really deserve the title "developer" a load of crap about how you can click buttons in their wizards and be like a super-real grown-up coder-hero without any having to learn any of that outdated command line mumbo-jumbo. It's 2017 after all! Don't worry about that gobbligook hocus-pocus that the smelly old man in the network closet keeps muttering under his breath. That's for smelly old people and third-party Amazon contractors. You have Very Important JavaScript to write, just as soon as you finish dragging Legos--err--"Mega-Elastic Dynamo-tastic Sumerian-Beanstalkinator Units" around on AWS.

How have other professions handled this issue? After all, most people wouldn't know the difference between a safe bridge and an unsafe one, and most people wouldn't know the difference between safely-prepared food and unsafely-prepared food (until they've already eaten it). The profit incentive is to put the bare minimum in place and then sell sell sell.

We may not like the heavy hand of regulation that will clamp down on the software industry, permanently and officially gate it behind the blood-sucking ivory tower of the academic priesthood, and strip it of all vitality and creativity, but with the attitudes that have become prevalent over the last few years, we have no one to blame but ourselves.

fb03 · on Dec 7, 2017

Your comment make me warm and fuzzy, all I can see these days is $UNNEEDED_ADDED_COMPLEXITY. People genuinely want to get their jobs done but they don't pause for the first second to analyze if that extra lib is gonna pull a bunch of other dependencies which might break the whole thing in a million different ways down the road.

Next month a new super magical JS router component-ized and gulp-ified sass-y library is gonna come out and then I'm gonna be again in that position of the "old guy in the shop who is always convincing everyone not to upgrade to the bleeding edgiest version available". And I feel it's an epidemic.

illumin8 · on Dec 7, 2017

But seriously, what value does the infrastructure administration provide? If my SaaS app is built with JavaScript, why should I waste any time at all managing PostgreSQL WAL replication when I could be adding a new feature that $BigCo will pay me for?

AWS, Azure, and GCP give developers the ability to only worry about their code, and all the hard stuff like database replication, load balancing, security, secrets management, container orchestration and code deployment are handled for them. Linking lego blocks together doesn't make them a coder-hero, it just makes them more productive because the other 80% of the job is done for them, by a cloud provider that already knows how to do it the right way, at planet scale.

cookiecaper · on Dec 7, 2017

Like I said at the top, it's not that hosted or pre-baked solutions are bad, but it's the attitude that it's "frankly foolish" to expend any effort on it when we could blindly trust $cloudOverlord instead.

You have to know how things work at a reasonably detailed level to know whether or not $cloudOverlord's solution is appropriate or not. If you know that, then you can make an informed decision as to whether or not it's better to go with them, and the reality is that in many cases, there's no real reason to prefer $cloudOverlord's solution. It is, quite often, very expensive, not to mention more complex and entrenching oneself further into dependence on a third-party over which they have no influence or control, and whose business model is finding new ways to charge them [more] rent. It also frequently constricts the availability of patches, features, configs, and upgrades that would be available and useful in a self-administered setup.

As for planet scale, well, there was a deploy on our 100% AWS-backed infrastructure last week that went horrendously and everyone had to pull all-nighters through the weekend trying to troubleshoot the performance problems. "Planet scale" is not something automatically granted by paying through the nose for AWS. Like it or not, you have to have someone who knows what they're doing to get good performance at scale. (Our issues were caused primarily by management's refusal to accept this and preference to believe that just waving the money stick at Amazon would make all problems disappear, since AWS is a super-neato thing from a "planet scale" company.)

The issue that's become very blatant over the last few years is that you get a lot of people who assume that anything except the code they've personally written is a magical fairy box that does everything they want automatically, and then they get mad when they learn that in fact, you still have to understand the tools reasonably well just to use them properly, let alone to debug or troubleshoot issues that may be occurring within them.

Engineering time spent understanding, formulating, and composing the core building blocks of a product is likely to be more important to a project's lifecycle than the time spent writing easily-replaceable business logic in the top layer of the application.

That people are so eager to outsource these fundamental building blocks not out of simple technical expediency ("they're a better WAL wrangler than me and it will take less time for them to do it") but rather out of a sentiment that it's "stupid" to commit the time of "smart engineers" to infrastructure administration and/or design is extremely frustrating.

I understand that within the context of a startup, the VCs want the barebones version to sell ASAP so that they can "test the market" before they give their early-20s sucker-- uh, founders -- more money to burn. Some people may have extrapolated this impulse outside of any context in which it could be considered either responsible or reasonable.