Do you know what blew me off? When azure executes maintenance on for instance po...

nijave · on Oct 14, 2022

Postgres on Azure is terrible. Were you using Single Server, Flexible or Hyperscale/Citus?

xiwenc · on Oct 14, 2022

Single servers now. We are considering to move to flexible server or just database as a service (non-server)

nijave · on Oct 14, 2022

Yeah, that's the one we've had a lot of problems with.

> And stronger yet when the database is unusable due to an incident the cpu is maxed out and it doesnt allow any successful connection, nothing is detected

Apparently Azure's storage system that backs this uses some sort of thread pool and the thread pool can lock up/become exhausted leading to I/O starvation. When this happens, connection attempts fail. When the connection attempts fail, it can lead to a connection storm where all these new connections rolling in exhaust the CPU. The telltale indicator is Postgres checkpoints getting behind.

All the while, the DB I/O metrics look like they're completely fine because it's not hitting an I/O limit, it's hitting thread pool exhaustion in the some storage system under the instance, outside of Postgres.

You can also get some clues if this is the problem by enabling Performance Insights and checking the Waits tab. If all the top waits are related to I/O activity, that's another dead giveaway the storage system is locked up again. You can just web search the name of the waits to see what causes them. AWS has some nice docs detailing Postgres waits

xiwenc · on Oct 15, 2022

Thanks for the detailed explanation! We didnt look into this so detailed yet but what you are describing sounds familiar.

Since we have premium support (P1?), we had some internal azure postgresql engineer look at the issue and they pushed the problem back to us. Blaming our app not built correctly. That has been ping-ponging for over a year now.

Finally i saw this semi-acknowledgment in their health status yesterday.

Do you happen to know a proper solution? Are you waiting for them to fix this issue or moved to a different db service?

Perhaps the flexible server is better?

nijave · on Oct 17, 2022

We've talked to the Postgres product engineers many times. Proper solution is to run away from Single Server as quick as possible. Flexible or Citus Hyperscale may be good solutions. We're currently using Patroni to manage VM-based clusters (but still have a lot of data on SS)

Personally, I'd look into a 3rd party if you want managed Postgres (assuming you don't have contractual obligations that might complicate 3rd party access). There's vendors like EnterpriseDB, Scalegrid, etc that provide various solutions (I don't have any recomendations here--Postgres has a list of managed providers by country https://www.postgresql.org/support/professional_hosting/nort...)

llama052 · on Oct 18, 2022

The hard part for us is figuring out how to migrate away from single server when it's used in production. It takes eternity to migrate data away from the thing, we are looking at ~24 hours just to get data out, and then we need to figure out how to do a live cutover or backfill.

Absolutely agree on a third party. Azure is just a let down overall.

xiwenc · on Oct 14, 2022

The gods are angry i think. Woke up today and all our pg servers were unavailable. Checked service health and azure shows a global pg incident impacting pg servers.

And the funny thing? Status.azure.com is all green. No events in activity overview. No service health within the affected instance.

Workaround advised by azure? Upgrade to next plan. We already reached the maximum size. Maybe time for Citus . More $$$ for M$

djbusby · on Oct 14, 2022

Premium is a price point not a service level

xiwenc · on Oct 14, 2022

Indeed. Typical case where the purchasing people are not the same people that has to use it day to day.

aoetalks · on Oct 15, 2022

That feels broken. Did you open a support ticket or a GH issue on the documentation page?