Discovering Azure's unannounced breaking change with Cosmos DB

dagss · on Oct 13, 2022

This rhymes with my overall impression of Cosmos. It took us a while to see through the smokescreen because when talking to Microsoft support and representatives it is the Best Thing Ever and they sound so confident about it. But it really is a beta demo product sold with an alpha premium price tag.

If your traffic pattern is exactly right, and you always scale traffic up and never ever down and do not have spikes, I guess it is probably OK. The main problem is the docs are (or, at least were 2 years ago) not clear about all the caveats and restrictions but pretend it is a generic database that just works. So one has to discover all the caveats oneself.

Microsoft thinks the exact workings of the partitioning is something that should work so well you don't need to know it in detail. But, if your usecase is slightly off you end up really needing to know. I know at least one team who routinely copy all their data from one Cosmos instance to another and switch over traffic to the copy just to get a partitioning reset; it is one thing to have to do it; another to discover in production yourself it has to be done with no prior warning..

Also: The ipython+portal+Cosmos security meltdown from 1 1/2 years ago alone should be reason to look elsewhere.

(No, not a competitor, just have spent way way way too much engineering time moving first on and then off Cosmos and yes I am bitter)

maltalex · on Oct 14, 2022

This matches my (fairly extensive) experience with CosmosDb. It’s truly an awfully engineered product.

I cannot comprehend what organizational process lead not only to its creation, but also to its continued existence.

Is existence is a constant reminder that the road to hell is paved with good intentions, and that you better do your due diligence before adopting any piece of tech that’s not easy to replace. Even if it comes with a pedigree.

I wrote up some of the horrors stories a while back on HN: https://news.ycombinator.com/item?id=29295871

VWWHFSfQ · on Oct 13, 2022

After suffering through the AWS SimpleDB disaster 10 years ago I will never use any of the cloud providers' hairbrained databases ever again. I'll use bog-standard Postgres or MySQL if they host it but nothing else.

Godel_unicode · on Oct 14, 2022

What if I told you there’s now CosmosDB for PostgreSQL.

https://devblogs.microsoft.com/cosmosdb/distributed-postgres...

SigmundA · on Oct 14, 2022

Which is just managed PG Citus with the word Cosmos in the name it seems.

sixdimensional · on Oct 14, 2022

This was both shocking and not surprising at all, all at the same time.

It’s worth noting how cloud vendors have fallen back on tried and true databases such as Postgres (mind you, often replacing the guts of those databases with new implementations.. but still).

sofixa · on Oct 14, 2022

> It’s worth noting how cloud vendors have fallen back on tried and true databases such as Postgres (mind you, often replacing the guts of those databases with new implementations.. but still).

Have they? Amazon created Aurora from scratch, with compatibility layers for different database engines (MySQL, Postgres), and GCP did the same with Spanner, which i wouldn't call "fallen back".

sixdimensional · on Oct 14, 2022

Yes, that's true - those products are definitely more of trying to displace the open source tech by borrowing the interface. I still think it is a plain admission to the fact of the popularity of the original tech (e.g. MySQL and Postgres). For sure, credit due where it is due for implementing entirely new implementations - but it's such a common pattern to "borrow" the open source interfaces and reimplement the core engine.

That said, you didn't mention things like RDS or Google Cloud SQL, which are cloud-based versions of standard open source dbs.

robertlagrant · on Oct 14, 2022

Yes. CosmosDB is a brand first, product second. It launched purely as a rebadged DocumentDB.

onphonenow · on Oct 14, 2022

I liked simpleDB. They kept on supporting it beyond my expectations even after lots of other options were available and it was long depreciated. Curious what the disaster was.

jamesfinlayson · on Oct 14, 2022

Which disaster was that?

pjmlp · on Oct 14, 2022

I have not suffered a disaster, but I have suffered its use, I really don't see a value in them versus a classical RDMS.

nobodyandproud · on Oct 13, 2022

Is this analogous to NTFS?

For you young uns, back in the 1990s Microsoft was so convinced that NTFS made file fragmentation impossible that they didn’t provide a way to defrag for a very long time.

dspillett · on Oct 14, 2022

In fairness NTFS does usually make fragmentation a non issue away from pathological cases¹, much like ext2/3/4² which didn't have defrag tools³ for a long time either.

FAT12/FAT16/FAT32's earliest-free-block-no-matter-what allocation method⁴ trained people to believe that fragmentation is a universally rampant problem, but with a better designed allocation strategy it really isn't.

----

[1] for instance simultaneously growing files on near-full volumes

[2] more so ext4 with delayed allocation turned on

[3] at least not ones without attendant "are you really sure you want to use this on data you care about?" warnings

[4] I wonder if anyone retrofitted a brighter heuristic into an implementation of these or exFAT. Could be a (not massively useful but) interesting learning exercise for a budding OS developer.

MrBuddyCasino · on Oct 13, 2022

I used it a few months ago, it still is a half-baked piece of shit. Code quality of client libs even worse than AWS.

It could have been easy. We could have used Postgres.

sofixa · on Oct 14, 2022

> It could have been easy. We could have used Postgres.

And then all that would have been left is just to make that Postgres in a globally distributed DB and manage it (with something like Citus). Postgres' native scaling out (not even talking about globally distributed) capabilities are basically nonexistent so you need third party tooling.

MrBuddyCasino · on Oct 14, 2022

We didn't need a "globally distributed DB". This was an internal tool for a BigCorp.

Twirrim · on Oct 14, 2022

> Microsoft thinks the exact workings of the partitioning is something that should work so well you don't need to know it in detail. But, if your usecase is slightly off you end up really needing to know. I know at least one team who routinely copy all their data from one Cosmos instance to another and switch over traffic to the copy just to get a partitioning reset; it is one thing to have to do it; another to discover in production yourself it has to be done with no prior warning..

AWS DynamoDB used to do that. When I was working for a team in AWS we ended up discovering an unexpected bad hash choice, had a hot partition and ended up with a crazy number of partitions and a need to have really high r/w allocation on it. The only option at the time was to roll over to a new table (which we could do and fix our hash choice, thankfully, without too much hassle).

They fixed that stuff a few years ago now so most (all?) of my "here be dragons" concerns about DynamoDB have been addressed.

danjc · on Oct 14, 2022

Spoke with a customer last week who said their cost for cosmos is on the order of 10x what it would be for SQL.

I’ve learned to stay away from services that are high abstractions. Huge lock in, expensive and much more risk of it being eol’d.

In our platform we’ve actually dropped SQL in favor of Az blob and table storage. It’s going to reduce our database costs by 5x.

PaulWaldman · on Oct 13, 2022

>(No, not a competitor, just have spent way way way too much engineering time moving first on and then off Cosmos and yes I am bitter)

Can you share what you migrated onto and the results?

dagss · on Oct 14, 2022

We migrated to Azure SQL Hyperscale and are very happy about that.

jychang · on Oct 14, 2022

Ah. That explains a lot of comments in this thread.

Cosmos is used because some tables we use are larger than 64TB. It’s decently useful for large chunks of data.

dagss · on Oct 14, 2022

Well, even if WE do not have 64TB tables (where one need low latency for the entire table), I believe there are plenty of other mature NoSQL offerings out there? Whether you go to the other cloud vendors or self host.

Cosmos is at best unfinished. I used Google Data Store for years and it was finished in the sense that it always did what it says on the tin and didn't cause lots of problems for you (although I think higher latency than Cosmos, so may not be for every usecase -- there is also BigTable and Spanner).

And I don't believe Google would ever have done something like deploying a shared multitenant Jupyter system with full access to all customer's DB. Microsoft actually did that.

As I said the main problem is mainly how Cosmos is marketed, as a general purpose NoSQL. If it was marketed with all the caveats you learn about after you go live, and people only used it if they really had to, it would be a bit different.

Though realistically it would drive most people to other cloud providers.

redditor98654 · on Oct 14, 2022

I am proud of my team at AWS that took backwards compatibility very seriously. Even introducing a temporary backwards incompatibility was a no-go in design reviews.

We had a service that had a list API that was paginated. It returned a nextToken to specify the start of the next page of the results.

Internally we were doing a database migration to a completely different system and migrating one customer at a time. The problem was that if a customer was in the middle of a list call and had the next token with them which was generated from the previous database system, after migrating to the new database the older token would not be able to start from exactly where it should had the customer not been migrated. This was because it would not have all the information of the service to resume at the exact offset.

One option was to throw an error and let the customer retry the request; another option was to return some possibly duplicate items in the next page; none of these were good enough for both engineers and PMs and instead we decide to take up a bunch of additional work so that no customer would be impacted. This was 10+ weeks of additional work for the whole team but we did it because culturally it felt the right thing to do for the customer.

Note that the impact would have been tiny if at all. A customer would have to be in the middle of a paginated request and out migration system would have had to migrate that particular customer at that exact time and the impact would have been a few possibly duplicate items. But we didn’t know the actual impact of those temporary duplicated for a single call and we all agreed breaking changes like this are unexpected and cause customer to lose trust with us.

jiggawatts · on Oct 14, 2022

I read stuff like this about AWS and I wonder what the guys over at Azure are smoking. I can deploy trivial architectures with ordinary resources and hit four or five “broken by design” issues and at least one or two Heisenbugs that aren’t reproducible in other identical setups due to fun things like “legacy” subscriptions — an invisible internal setting.

NickGerleman · on Oct 14, 2022

I previously worked at Microsoft in an unrelated area, but had a friend who worked on CosmosDB, and later a different part of Azure.

There are some Microsoft products I genuinely love, but some are terrible. To an extent it is a reflection of the inconsistency in internal teams. Culture, values, skill level, and quality bar are all over the place depending on who you talk to, even compared to other large companies.

From what I had heard, CosmosDB was not a healthy team, and I would not consider using it as a product.

tkk23 · on Oct 14, 2022

With the funding of MS, how could this be turned around? Is it necessary to build a new team or is it enough to exchange leadership and let them bring in new members?

NickGerleman · on Oct 14, 2022

It's a good question well above my pay grade :). Changing culture isn't an easy problem.

I suspect the way Microsoft does interviewing and performance management (very local to the specific team) contributes to the inconsistency.

MSFT has also been fairly open to its employees that it does not try to compete with competitors like Google, Meta, or even Amazon, in terms of compensation. So it isn't really trying to get the best engineers, so long as it can continue to print money.

There are still folks there who are incredible, but the floor is shockingly low at times. Folks will self-select, so you will then get teams which are more homogeneously good or bad.

aoetalks · on Oct 15, 2022

>I suspect the way Microsoft does interviewing and performance management (very local to the specific team) contributes to the inconsistency.

I think pay is certainly a factor in talent bleeding to G/Meta in general, which they refuse to address. I imagine this isn't unique to MSFT.

stingraycharles · on Oct 14, 2022

It’s interesting that Microsoft doesn’t have this type of attitude as well for Azure, as they’re famous for their extreme care around backwards compatibility for Windows as well.

Perhaps it’s because Microsoft is still catching up with Azure, and as such prefers moving fast and occasionally breaking things?

aoetalks · on Oct 15, 2022

I work for Azure (different product) and we do care about breaking customer. We have to keep GA APIs around for years even as they’re being deprecated. The AzureRM module was deprecated over a year ago I think, and it will work until 2024.

This really feels like a bug to me, and probably didn’t trip monitors due to 2 reasons: 1) Given that portal does the right thing (and probably ARM template samples), this was very small percentage of traffic.

2) the failures would look like client side errors, making it less likely to trip monitors.

*PS my comment is not an official response (I don’t even remotely work on CosmosDB) but I’ll forward this internally

DishyDev · on Oct 13, 2022

As someone whose job involves maintaining uptime of a critical system that's dependent on Cosmos DB this sort of thing is scary. Where there's been other reliability issues with Cosmos before we've not had an understanding customer base, and it feels very out of my control.

I'm finding a lot of the reliability guarantees of Azure PaaS services are overblown or come with big caveats when you start to work with them in a serious way. For example I've had some bad reliability issues with Azure Functions not firing, or their premium function runtimes becoming unresponsive. And it seems like that's just the start of the outstanding issues with them https://github.com/Azure/azure-functions-host/issues

I think people need to look more carefully at these PaaS guarantees and look at what that 99.999% reliability Microsoft are claiming actually means.

nijave · on Oct 14, 2022

After using AWS for 3+ years and GCP for about 6 months, I can say Azure significantly lags behind them. Their service reliability is astonishingly poor. I think our most recent issue was 67 VM failures in a VMSS (of 55 nodes) backing AKS (Azure Kubernetes Service) in a single month. The health events said there were some kind of "remote storage errors" making the VMs unhealthy

That's a couple months after the Ubuntu/systemd incident (Azure's "blessed" Linux image is Ubuntu and it has unatttended-upgrades enabled including on managed infrastructure like AKS (where you can't turn it off without dirty hacks). A bad Ubuntu update caused hosts to lose their DNS from DHCP config rendering massive amounts of machines in partially broken states)

https://thenewstack.io/ubuntu-linux-and-azure-dns-problem-gi...

xiwenc · on Oct 13, 2022

Do you know what blew me off? When azure executes maintenance on for instance postgresql servers, there is no record of that activity in the activity logs or anything to note in service health. The service was unavailable during the maintenance. And stronger yet when the database is unusable due to an incident the cpu is maxed out and it doesnt allow any successful connection, nothing is detected.

How can this be a premium iaas/paas? Azure feels like the MS teams of tele conference. Companies buy in because they are already in the MS world. Not because azure is better.

nijave · on Oct 14, 2022

Postgres on Azure is terrible. Were you using Single Server, Flexible or Hyperscale/Citus?

xiwenc · on Oct 14, 2022

Single servers now. We are considering to move to flexible server or just database as a service (non-server)

nijave · on Oct 14, 2022

Yeah, that's the one we've had a lot of problems with.

> And stronger yet when the database is unusable due to an incident the cpu is maxed out and it doesnt allow any successful connection, nothing is detected

Apparently Azure's storage system that backs this uses some sort of thread pool and the thread pool can lock up/become exhausted leading to I/O starvation. When this happens, connection attempts fail. When the connection attempts fail, it can lead to a connection storm where all these new connections rolling in exhaust the CPU. The telltale indicator is Postgres checkpoints getting behind.

All the while, the DB I/O metrics look like they're completely fine because it's not hitting an I/O limit, it's hitting thread pool exhaustion in the some storage system under the instance, outside of Postgres.

You can also get some clues if this is the problem by enabling Performance Insights and checking the Waits tab. If all the top waits are related to I/O activity, that's another dead giveaway the storage system is locked up again. You can just web search the name of the waits to see what causes them. AWS has some nice docs detailing Postgres waits

xiwenc · on Oct 15, 2022

Thanks for the detailed explanation! We didnt look into this so detailed yet but what you are describing sounds familiar.

Since we have premium support (P1?), we had some internal azure postgresql engineer look at the issue and they pushed the problem back to us. Blaming our app not built correctly. That has been ping-ponging for over a year now.

Finally i saw this semi-acknowledgment in their health status yesterday.

Do you happen to know a proper solution? Are you waiting for them to fix this issue or moved to a different db service?

Perhaps the flexible server is better?

nijave · on Oct 17, 2022

We've talked to the Postgres product engineers many times. Proper solution is to run away from Single Server as quick as possible. Flexible or Citus Hyperscale may be good solutions. We're currently using Patroni to manage VM-based clusters (but still have a lot of data on SS)

Personally, I'd look into a 3rd party if you want managed Postgres (assuming you don't have contractual obligations that might complicate 3rd party access). There's vendors like EnterpriseDB, Scalegrid, etc that provide various solutions (I don't have any recomendations here--Postgres has a list of managed providers by country https://www.postgresql.org/support/professional_hosting/nort...)

llama052 · on Oct 18, 2022

The hard part for us is figuring out how to migrate away from single server when it's used in production. It takes eternity to migrate data away from the thing, we are looking at ~24 hours just to get data out, and then we need to figure out how to do a live cutover or backfill.

Absolutely agree on a third party. Azure is just a let down overall.

xiwenc · on Oct 14, 2022

The gods are angry i think. Woke up today and all our pg servers were unavailable. Checked service health and azure shows a global pg incident impacting pg servers.

And the funny thing? Status.azure.com is all green. No events in activity overview. No service health within the affected instance.

Workaround advised by azure? Upgrade to next plan. We already reached the maximum size. Maybe time for Citus . More $$$ for M$

djbusby · on Oct 14, 2022

Premium is a price point not a service level

xiwenc · on Oct 14, 2022

Indeed. Typical case where the purchasing people are not the same people that has to use it day to day.

aoetalks · on Oct 15, 2022

That feels broken. Did you open a support ticket or a GH issue on the documentation page?

rrdharan · on Oct 13, 2022

Even as bad as their reliability issues are, I'd still be more worried about their security issues:

https://www.wiz.io/blog/chaosdb-explained-azures-cosmos-db-v... https://msrc-blog.microsoft.com/2021/08/27/update-on-vulnera...

ethbr0 · on Oct 13, 2022

> I think people need to look more carefully at these PaaS guarantees and look at what that 99.999% reliability Microsoft are claiming actually means.

Hypercloud managed service SLAs: all the fun of novel complex, technical solutions in production + the transparency of cast iron + the pendanticism of being a contract lawyer

Which leaves exactly zero people who are excited to be at that intersection.

wharfjumper · on Oct 14, 2022

Asking as someone who knows nothing about CosmosDB other than the frequent negative comments I read on HN, do you know what led to you using it?

prepend · on Oct 13, 2022

This reenforces my idea that no one uses Cosmos because it is utter garbage.

It sounds cool, but I was surprised when after what I think should be the worst and dumbest security design flaw breach [0] there wasn’t much uproar.

I thought maybe no one is using it so there wasn’t much impact.

Pushing out breaking changes without telling your customers also gets explained by there not being any (or many since these folks found it) users.

Could you image how big of a deal it would be if a breaking change or elevated privs bug were in actually used products.

[0] https://www.techtarget.com/searchsecurity/news/252505973/Res...

dharmab · on Oct 13, 2022

Back around 2017-2018 unannounced breaking changes in Azure services were so common, my team coined a term "Cloud Monday" (echoing Patch Tuesday) because usually our integration tests would break between 8-10AM Pacific Time on Mondays. (They did eventually become far less frequent.)

TurkTurkleton · on Oct 13, 2022

> my team coined a term "Cloud Monday"

Azure being a shade of blue, you should've called it "Blue Monday"[0]. Could've even rigged up something to play the song when integration tests mysteriously failed. How does it feel/ to treat me like you do?/ When you've laid your hands upon me/ and told me who you are?...

[0]: https://www.youtube.com/watch?v=c1GxjzHm5us

HorizonXP · on Oct 14, 2022

So, as someone who was in the midst of planning a migration of a multi-billion $ revenue platform to using CosmosDB...

Alternatives? LOL

Basically just looking for geo-redundant, high read & write throughput. Our intention was to leverage Azure Event Grid/Kafka Connect to have event streaming used to coordinate writes between Redis (cache), Cosmos (transactional DB), and our systems of record (legacy). Majority of read/writes would occur via our API, but some would occur via the systems of record, hence the use of a log-based architecture.

noahl · on Oct 14, 2022

Spanner offers that on GCP, and I believe CockroachDB offers something similar cross-cloud.

Do you have any specific requirements for which cloud provider you use, or any particular interface you really need?

HorizonXP · on Oct 14, 2022

Azure is our cloud provider. Interface is flexible, since our current implementation leverages Prisma ORM connected to Postgres & SQL Server. We're going to have to rebuild it anyway.

noahl · on Oct 14, 2022

Got it, thank you! CockroachDB is the only one I know offhand that does what you're looking for. Another comment mentioned Vitess, which might also work.

It seems like there are a lot of options for large scale analytics, but I don't know a lot for high throughout geo-redundant transaction processing.

skunkworker · on Oct 14, 2022

Vitess maybe? from what I have demoed and for being open source that’s the one I would choose. Unfortunately there isn’t a Postgres compatible one that is as mature yet and certainly not free.

yellowfool · on Oct 14, 2022

Everything Azure Event* is crap, the drivers are so buggy.

Use Confluent and connect it with a managed postgres on aws or gcp. If you really need more, use Google Cloud Spanner.

maltalex · on Oct 14, 2022

EventHub is a perfectly acceptable product. It's basically Kafka. We used it at a fairly high scale with very few issues.

rrdharan · on Oct 14, 2022

DynamoDB or Bigtable?

HorizonXP · on Oct 14, 2022

We're on Azure, otherwise those were my goto choices too.

HdS84 · on Oct 14, 2022

Ravendb? I really like it's design and apis. Works well for my use cases.

robertlagrant · on Oct 14, 2022

What about Azure's Citus Postgres? I think it's called Hyperscale?

ayende · on Oct 14, 2022

RavenDB can do that

speedgoose · on Oct 13, 2022

I believe it is easy for a well-made software to immediately detect and report what goes wrong. With Sentry, Elk, or whatever else.

So, let's say I'm woken up in the middle of the night because my black box database as a service suddenly returns errors. If I'm not incompetent, I should have error messages and stacktraces available in a few seconds. If I'm a rich cloud customer, I can call the premium cloud support and ask for an explanation. If not, I would probably have to debug it myself.

With your service, I understand that I can blame the cloud provider faster. Maybe it can make the debugging session slightly faster when your monitoring also returns errors. End users don't care whether it's my code or the cloud provider code crashing, so it's a developer tool for emergencies. Did I understand well?

jmartens · on Oct 13, 2022

You got it right, it's a developer tool. Its not hard to get alerted about an issue, or to suspect a cloud dependency. Verifying it, which is typically required to take action, is what can take 10-30 minutes.

twodave · on Oct 13, 2022

Funny, I was just last week having an argument with one of our team leads. I'd told him to create a specific container without a partition key (which I wouldn't recommend except in certain circumstances), and he said he couldn't. I assumed he was just doing it wrong.

int0x2e · on Oct 13, 2022

In a document store, what does it even mean to create a container without a partition key? The document store has to partition the data somehow, and doing so implicitly sounds dangerous to me since all you're doing is creating a hotspot on one of the partitions...

twodave · on Oct 13, 2022

It essentially means everything goes on the same partition, which for the scenario in question was exactly what we wanted. Not everything needs to be hyper-scalable, and usually there’s an amount of overhead associated with making things so. In a demanding SLA sort of environment an extra hop to a new physical partition can mean the difference between getting paid or not.

dec0dedab0de · on Oct 13, 2022

This seems like an accident. Microsoft should treat it as a bug, and set the default on their backend to fix it.

whalesalad · on Oct 13, 2022

This is very typical Microsoft behavior, unfortunately.

nobodyandproud · on Oct 13, 2022

For new projects, why wouldn’t anyone use postgres?

semicolon_storm · on Oct 13, 2022

It's a lot easier to sell Microsoft products to management when working at a Microsoft shop.

isbvhodnvemrwvn · on Oct 14, 2022

Oracle of 2010s/2020s.

jen20 · on Oct 13, 2022

One reason is wanting zero-downtime failover.

haolez · on Oct 14, 2022

Postgres is more expensive than CosmosDB for someone who is just playing around and is constrained to managed systems for some reason.

pjmlp · on Oct 14, 2022

Because the customers we work with care mostly about SQL Server and Oracle.

eurasiantiger · on Oct 14, 2022

Why would anyone want to maintain their own postgres?

grogers · on Oct 14, 2022

The error message from azure are literally just “One of the specified inputs is invalid”? I get annoyed at AWS error messages because they aren't really machine readable (unless you're okay parsing a string that is subject to change), but at least they are almost always human readable with all the relevant details...

sublimefire · on Oct 14, 2022

Correct me if I'm wrong but the article does not mention which "outdated sdk" version was used. In addition to that every API call requires a version which is not mentioned in the article [1].

It is not clear to me if the issue was with an old SDK using the newest api version in calls or was it something else?

[1] https://learn.microsoft.com/en-us/rest/api/cosmos-db-resourc...

aoetalks · on Oct 15, 2022

They said in the article they tried updating their old SDK.

atraac · on Oct 14, 2022

I have similar opinion to some other comments. Some Azure services - like Application Insights - I absolutely loved, some I hated, CosmosDB, being the latter.

They needed years to finally introduce PATCH in CosmosDB, Request Units feel like they're obscure on purpose to hide insane cost of using this storage, being able to use Stored Procedures only on one Partition Key(while /id is being the default...), requests would often fail with 429 Too Many Requests when the container was set to Autoscale with obscene limits that were never hit.

Just setup Marten with Postgres and get it over with for fraction of the cost.

didip · on Oct 14, 2022

I feel like Azure should just give up on Cosmos DB and go all in on managed Citus DB.

outside1234 · on Oct 14, 2022

They did the opposite and now that is PostgreSQL for CosmosDB. Can’t make this shit up.

CiTyBear · on Oct 14, 2022

Breaking changes are very common with Python Azure SDK. First version where not pep8 compliant so when they decided to respect it, everything broke. Azure servicebus in python went from version 0.50.3 to 7.0.0 with almost everything renamed, class moved and so on.

aoetalks · on Oct 15, 2022

This is different than breaking changes in package upgrades.

This is a change without the package version even changing (even newest package didn’t work)

yellow_lead · on Oct 14, 2022

I find it pretty interesting that this company/product (Metrist) has created a monitoring tool for different cloud products, because their monitoring is so bad. Honestly a good idea, but a bit sad these companies can't do this themselves.

jmartens · on Oct 13, 2022

We used our own product to learn about and debug the issue. Its rather wild that they'd roll out this change so incrementally, which my colleague outlines here.

Scaevolus · on Oct 13, 2022

Gradual rollouts are pretty typical to give the team a chance to do a rollback before they cause a complete outage. This particular usage pattern probably just didn't appear as a significant enough spike in error rates.

jmartens · on Oct 13, 2022

Ya, that makes sense, it really isn't a normal use-case. I wish we kept tracking the other regions to see if they have had this change roll out to them yet, or if it's still slow rolling.

aoetalks · on Oct 15, 2022

Is a global outage better than a regional one? You can’t predict the impact of changes on a system as complex these.

rroot · on Oct 13, 2022

[flagged]

andrewstuart2 · on Oct 13, 2022

What are you doing here if you're not going to RTFA? The fifth paragraph pretty clearly describes the issue before they go into depth on how they determined that Azure did indeed publish a backwards-incompatible change without notice.

rroot · on Oct 14, 2022

I don't think it's appropriate to ask personal questions on HN. But since we're going personal I'd like to note that "RTFA" is not a professional language.

The paragraph you mention states the issue, but it's pretty far from being a summary.

What an "executive summary" section would include in this case in a very short text on:

* What the problem is.

* How long to took to resolve for how many people.

* How it effected the company/project.

You could add more relevant stuff, depending on what the article says. There are some good resource on the internet how to write summaries.

People who are looking for interesting stuff to read in their 10 minute coffee break might read the article after the summary instead of skipping it altogether.

johndfsgdgdfg · on Oct 13, 2022

[flagged]

pb7 · on Oct 13, 2022

Reading through your past comments, it's clear that you have a strong dislike of Google[0] and a history of reactionary comments lacking both substance and clarification when challenged[1,2,3,4,5,6,7]. If you're not going to post anything worthwhile, perhaps it's best for you to skip over posts about Google since it's clear you have an axe to grind and nothing more.

>HN used to be a place for interesting discussions. Now it's a grievance forum for entitled freeloaders.[8]

Be the change you seek.

[0] https://news.ycombinator.com/item?id=33120431 [1] https://news.ycombinator.com/item?id=33183900 [2] https://news.ycombinator.com/item?id=33158451 [3] https://news.ycombinator.com/item?id=33102921 [4] https://news.ycombinator.com/item?id=33102794 [5] https://news.ycombinator.com/item?id=33102761 [6] https://news.ycombinator.com/item?id=32937987 [7] https://news.ycombinator.com/item?id=32868992 [8] https://news.ycombinator.com/item?id=32657508

metadat · on Oct 13, 2022

Thanks for pointing this out. As a self-admitted Google disliker, I would prefer to at least see more variation in the rhetoric. The same message spouted over and over makes for extremely dull reading.

blibble · on Oct 13, 2022

in the interest of balance...

reading through your past comments, it's clear you rather like Google! [0] [1] [2]

even this corker: https://news.ycombinator.com/item?id=22108416

this one's pretty ironic too: https://news.ycombinator.com/item?id=23223195

I'd say it's quite likely you're a Google employee [3] [4]

[0] https://news.ycombinator.com/item?id=23224294 [1] https://news.ycombinator.com/item?id=23495439 [2] https://news.ycombinator.com/item?id=23229532 [3] https://news.ycombinator.com/item?id=23968906 [4] https://news.ycombinator.com/item?id=24608220

NicoJuicy · on Oct 13, 2022

Which service did Google shutdown unannounced in their cloud offering?

metadat · on Oct 13, 2022

Stadia, their cloud gaming platform. It was cancelled without warning two weeks ago:

https://news.ycombinator.com/item?id=33022768

dragonwriter · on Oct 13, 2022

> Stadia, their cloud gaming platform. It was cancelled without warning two weeks ago:

You mean, they didn’t give warning in advance of their 3 month+ shutdown warning, which is what actually happened two weeks ago?

metadat · on Oct 13, 2022

You're not wrong :)

Did you miss the aftermath where Google pulled the rug and people's game data was lost (or would have been if Bung and Ubisoft didn't dive and save)?

https://news.ycombinator.com/item?id=33057354

(11 days ago; 159 points, 198 comments)

https://www.forbes.com/sites/paultassi/2022/10/02/ubisoft-bu...

Disclosure: my submission

sofixa · on Oct 14, 2022

No data was lost, it will be lost when Stadia shuts down unless game developers do something about it (and they're the only ones that could anyways, it's not like Google can port your Assassin's creed Odyssey saves to Xbox).

NicoJuicy · on Oct 14, 2022

Cloud offering in the context of Azure should be related to GCE...

Stadia isn't related to that.

metadat · on Oct 14, 2022

You're being overly pedantic. Cloud is cloud. I'm 100% certain Stadia ran atop GCP.

It is a cloud offering in every sense that matters.

rrdharan · on Oct 14, 2022

Last I checked you couldn’t sign an Enterprise Licensing Agreement for Stadia:

https://learn.microsoft.com/en-us/azure/cost-management-bill...

https://cloud.google.com/skus/sku-groups/enterprise-agreemen...

That’s a huge difference that matters much more than the service implementation detail.

metadat · on Oct 14, 2022

What about Google Keep? Does this come with an enterprise licensing agreement?

I agree this is a good bar to test against, although I don't believe consumers are savvy enough to be aware of such matters. Nor should they be required to do so to get confidence what Google offers them today will remain tomorrow. With that said, should is an extremely arbitrary word.

NicoJuicy · on Oct 14, 2022

Stadia is obviously targeting the consumer.

Where GCE targets businesses, just like Azure.

B2B != B2C

VERY big difference

hu3 · on Oct 13, 2022

From your link:

> The service will remain live for players until January 18th, 2023. Google will be refunding all Stadia hardware purchased through the Google Store as well as all the games and add-on content purchased from the Stadia store.

I know it's trendy to hate on Google but this doesn't look unannounced to me.

hupt · on Oct 13, 2022

Cosmos was originally created for hosting massive datasets internally within Microsoft. For example they use it for the OS telemetry sent in from customer machines, and raw data for threat intelligence. As part of Microsoft's move of everything hosted on-premise to their cloud, they decided to upon up Cosmos to other users of Azure. But the primary customer is and will likely always be Microsoft themselves. Which is probably why we see these breaking changes, it'll be in response to some internal ticket most likely.

int0x2e · on Oct 13, 2022

Cosmos != CosmosDB.

The two have nothing in common (and trust me, it sure is fun having to constantly make sure which of the two someone is actually referring to every time...).

CurtHagenlocher · on Oct 13, 2022

CosmosDB is not the same as the internal Cosmos system.