This rhymes with my overall impression of Cosmos. It took us a while to see through the smokescreen because when talking to Microsoft support and representatives it is the Best Thing Ever and they sound so confident about it. But it really is a beta demo product sold with an alpha premium price tag.
If your traffic pattern is exactly right, and you always scale traffic up and never ever down and do not have spikes, I guess it is probably OK. The main problem is the docs are (or, at least were 2 years ago) not clear about all the caveats and restrictions but pretend it is a generic database that just works. So one has to discover all the caveats oneself.
Microsoft thinks the exact workings of the partitioning is something that should work so well you don't need to know it in detail. But, if your usecase is slightly off you end up really needing to know. I know at least one team who routinely copy all their data from one Cosmos instance to another and switch over traffic to the copy just to get a partitioning reset; it is one thing to have to do it; another to discover in production yourself it has to be done with no prior warning..
Also: The ipython+portal+Cosmos security meltdown from 1 1/2 years ago alone should be reason to look elsewhere.
(No, not a competitor, just have spent way way way too much engineering time moving first on and then off Cosmos and yes I am bitter)
This matches my (fairly extensive) experience with CosmosDb. It’s truly an awfully engineered product.
I cannot comprehend what organizational process lead not only to its creation, but also to its continued existence.
Is existence is a constant reminder that the road to hell is paved with good intentions, and that you better do your due diligence before adopting any piece of tech that’s not easy to replace. Even if it comes with a pedigree.
After suffering through the AWS SimpleDB disaster 10 years ago I will never use any of the cloud providers' hairbrained databases ever again. I'll use bog-standard Postgres or MySQL if they host it but nothing else.
This was both shocking and not surprising at all, all at the same time.
It’s worth noting how cloud vendors have fallen back on tried and true databases such as Postgres (mind you, often replacing the guts of those databases with new implementations.. but still).
> It’s worth noting how cloud vendors have fallen back on tried and true databases such as Postgres (mind you, often replacing the guts of those databases with new implementations.. but still).
Have they? Amazon created Aurora from scratch, with compatibility layers for different database engines (MySQL, Postgres), and GCP did the same with Spanner, which i wouldn't call "fallen back".
Yes, that's true - those products are definitely more of trying to displace the open source tech by borrowing the interface. I still think it is a plain admission to the fact of the popularity of the original tech (e.g. MySQL and Postgres). For sure, credit due where it is due for implementing entirely new implementations - but it's such a common pattern to "borrow" the open source interfaces and reimplement the core engine.
That said, you didn't mention things like RDS or Google Cloud SQL, which are cloud-based versions of standard open source dbs.
I liked simpleDB. They kept on supporting it beyond my expectations even after lots of other options were available and it was long depreciated. Curious what the disaster was.
For you young uns, back in the 1990s Microsoft was so convinced that NTFS made file fragmentation impossible that they didn’t provide a way to defrag for a very long time.
In fairness NTFS does usually make fragmentation a non issue away from pathological cases¹, much like ext2/3/4² which didn't have defrag tools³ for a long time either.
FAT12/FAT16/FAT32's earliest-free-block-no-matter-what allocation method⁴ trained people to believe that fragmentation is a universally rampant problem, but with a better designed allocation strategy it really isn't.
----
[1] for instance simultaneously growing files on near-full volumes
[2] more so ext4 with delayed allocation turned on
[3] at least not ones without attendant "are you really sure you want to use this on data you care about?" warnings
[4] I wonder if anyone retrofitted a brighter heuristic into an implementation of these or exFAT. Could be a (not massively useful but) interesting learning exercise for a budding OS developer.
> It could have been easy. We could have used Postgres.
And then all that would have been left is just to make that Postgres in a globally distributed DB and manage it (with something like Citus). Postgres' native scaling out (not even talking about globally distributed) capabilities are basically nonexistent so you need third party tooling.
> Microsoft thinks the exact workings of the partitioning is something that should work so well you don't need to know it in detail. But, if your usecase is slightly off you end up really needing to know. I know at least one team who routinely copy all their data from one Cosmos instance to another and switch over traffic to the copy just to get a partitioning reset; it is one thing to have to do it; another to discover in production yourself it has to be done with no prior warning..
AWS DynamoDB used to do that. When I was working for a team in AWS we ended up discovering an unexpected bad hash choice, had a hot partition and ended up with a crazy number of partitions and a need to have really high r/w allocation on it. The only option at the time was to roll over to a new table (which we could do and fix our hash choice, thankfully, without too much hassle).
They fixed that stuff a few years ago now so most (all?) of my "here be dragons" concerns about DynamoDB have been addressed.
Well, even if WE do not have 64TB tables (where one need low latency for the entire table), I believe there are plenty of other mature NoSQL offerings out there? Whether you go to the other cloud vendors or self host.
Cosmos is at best unfinished.
I used Google Data Store for years and it was finished in the sense that it always did what it says on the tin and didn't cause lots of problems for you (although I think higher latency than Cosmos, so may not be for every usecase -- there is also BigTable and Spanner).
And I don't believe Google would ever have done something like deploying a shared multitenant Jupyter system with full access to all customer's DB. Microsoft actually did that.
As I said the main problem is mainly how Cosmos is marketed, as a general purpose NoSQL. If it was marketed with all the caveats you learn about after you go live, and people only used it if they really had to, it would be a bit different.
Though realistically it would drive most people to other cloud providers.
I am proud of my team at AWS that took backwards compatibility very seriously. Even introducing a temporary backwards incompatibility was a no-go in design reviews.
We had a service that had a list API that was paginated. It returned a nextToken to specify the start of the next page of the results.
Internally we were doing a database migration to a completely different system and migrating one customer at a time. The problem was that if a customer was in the middle of a list call and had the next token with them which was generated from the previous database system, after migrating to the new database the older token would not be able to start from exactly where it should had the customer not been migrated. This was because it would not have all the information of the service to resume at the exact offset.
One option was to throw an error and let the customer retry the request; another option was to return some possibly duplicate items in the next page; none of these were good enough for both engineers and PMs and instead we decide to take up a bunch of additional work so that no customer would be impacted. This was 10+ weeks of additional work for the whole team but we did it because culturally it felt the right thing to do for the customer.
Note that the impact would have been tiny if at all. A customer would have to be in the middle of a paginated request and out migration system would have had to migrate that particular customer at that exact time and the impact would have been a few possibly duplicate items. But we didn’t know the actual impact of those temporary duplicated for a single call and we all agreed breaking changes like this are unexpected and cause customer to lose trust with us.
I read stuff like this about AWS and I wonder what the guys over at Azure are smoking. I can deploy trivial architectures with ordinary resources and hit four or five “broken by design” issues and at least one or two Heisenbugs that aren’t reproducible in other identical setups due to fun things like “legacy” subscriptions — an invisible internal setting.
I previously worked at Microsoft in an unrelated area, but had a friend who worked on CosmosDB, and later a different part of Azure.
There are some Microsoft products I genuinely love, but some are terrible. To an extent it is a reflection of the inconsistency in internal teams. Culture, values, skill level, and quality bar are all over the place depending on who you talk to, even compared to other large companies.
From what I had heard, CosmosDB was not a healthy team, and I would not consider using it as a product.
With the funding of MS, how could this be turned around? Is it necessary to build a new team or is it enough to exchange leadership and let them bring in new members?
It's a good question well above my pay grade :). Changing culture isn't an easy problem.
I suspect the way Microsoft does interviewing and performance management (very local to the specific team) contributes to the inconsistency.
MSFT has also been fairly open to its employees that it does not try to compete with competitors like Google, Meta, or even Amazon, in terms of compensation. So it isn't really trying to get the best engineers, so long as it can continue to print money.
There are still folks there who are incredible, but the floor is shockingly low at times. Folks will self-select, so you will then get teams which are more homogeneously good or bad.
It’s interesting that Microsoft doesn’t have this type of attitude as well for Azure, as they’re famous for their extreme care around backwards compatibility for Windows as well.
Perhaps it’s because Microsoft is still catching up with Azure, and as such prefers moving fast and occasionally breaking things?
I work for Azure (different product) and we do care about breaking customer. We have to keep GA APIs around for years even as they’re being deprecated. The AzureRM module was deprecated over a year ago I think, and it will work until 2024.
This really feels like a bug to me, and probably didn’t trip monitors due to 2 reasons:
1) Given that portal does the right thing (and probably ARM template samples), this was very small percentage of traffic.
2) the failures would look like client side errors, making it less likely to trip monitors.
*PS my comment is not an official response (I don’t even remotely work on CosmosDB) but I’ll forward this internally
As someone whose job involves maintaining uptime of a critical system that's dependent on Cosmos DB this sort of thing is scary. Where there's been other reliability issues with Cosmos before we've not had an understanding customer base, and it feels very out of my control.
I'm finding a lot of the reliability guarantees of Azure PaaS services are overblown or come with big caveats when you start to work with them in a serious way. For example I've had some bad reliability issues with Azure Functions not firing, or their premium function runtimes becoming unresponsive. And it seems like that's just the start of the outstanding issues with them https://github.com/Azure/azure-functions-host/issues
I think people need to look more carefully at these PaaS guarantees and look at what that 99.999% reliability Microsoft are claiming actually means.
After using AWS for 3+ years and GCP for about 6 months, I can say Azure significantly lags behind them. Their service reliability is astonishingly poor. I think our most recent issue was 67 VM failures in a VMSS (of 55 nodes) backing AKS (Azure Kubernetes Service) in a single month. The health events said there were some kind of "remote storage errors" making the VMs unhealthy
That's a couple months after the Ubuntu/systemd incident (Azure's "blessed" Linux image is Ubuntu and it has unatttended-upgrades enabled including on managed infrastructure like AKS (where you can't turn it off without dirty hacks). A bad Ubuntu update caused hosts to lose their DNS from DHCP config rendering massive amounts of machines in partially broken states)
Do you know what blew me off? When azure executes maintenance on for instance postgresql servers, there is no record of that activity in the activity logs or anything to note in service health. The service was unavailable during the maintenance. And stronger yet when the database is unusable due to an incident the cpu is maxed out and it doesnt allow any successful connection, nothing is detected.
How can this be a premium iaas/paas? Azure feels like the MS teams of tele conference. Companies buy in because they are already in the MS world. Not because azure is better.
Yeah, that's the one we've had a lot of problems with.
> And stronger yet when the database is unusable due to an incident the cpu is maxed out and it doesnt allow any successful connection, nothing is detected
Apparently Azure's storage system that backs this uses some sort of thread pool and the thread pool can lock up/become exhausted leading to I/O starvation. When this happens, connection attempts fail. When the connection attempts fail, it can lead to a connection storm where all these new connections rolling in exhaust the CPU. The telltale indicator is Postgres checkpoints getting behind.
All the while, the DB I/O metrics look like they're completely fine because it's not hitting an I/O limit, it's hitting thread pool exhaustion in the some storage system under the instance, outside of Postgres.
You can also get some clues if this is the problem by enabling Performance Insights and checking the Waits tab. If all the top waits are related to I/O activity, that's another dead giveaway the storage system is locked up again. You can just web search the name of the waits to see what causes them. AWS has some nice docs detailing Postgres waits
Thanks for the detailed explanation! We didnt look into this so detailed yet but what you are describing sounds familiar.
Since we have premium support (P1?), we had some internal azure postgresql engineer look at the issue and they pushed the problem back to us. Blaming our app not built correctly. That has been ping-ponging for over a year now.
Finally i saw this semi-acknowledgment in their health status yesterday.
Do you happen to know a proper solution? Are you waiting for them to fix this issue or moved to a different db service?
We've talked to the Postgres product engineers many times. Proper solution is to run away from Single Server as quick as possible. Flexible or Citus Hyperscale may be good solutions. We're currently using Patroni to manage VM-based clusters (but still have a lot of data on SS)
Personally, I'd look into a 3rd party if you want managed Postgres (assuming you don't have contractual obligations that might complicate 3rd party access). There's vendors like EnterpriseDB, Scalegrid, etc that provide various solutions (I don't have any recomendations here--Postgres has a list of managed providers by country https://www.postgresql.org/support/professional_hosting/nort...)
The hard part for us is figuring out how to migrate away from single server when it's used in production. It takes eternity to migrate data away from the thing, we are looking at ~24 hours just to get data out, and then we need to figure out how to do a live cutover or backfill.
Absolutely agree on a third party. Azure is just a let down overall.
The gods are angry i think. Woke up today and all our pg servers were unavailable. Checked service health and azure shows a global pg incident impacting pg servers.
And the funny thing? Status.azure.com is all green. No events in activity overview. No service health within the affected instance.
Workaround advised by azure? Upgrade to next plan. We already reached the maximum size. Maybe time for Citus . More $$$ for M$
> I think people need to look more carefully at these PaaS guarantees and look at what that 99.999% reliability Microsoft are claiming actually means.
Hypercloud managed service SLAs: all the fun of novel complex, technical solutions in production + the transparency of cast iron + the pendanticism of being a contract lawyer
Which leaves exactly zero people who are excited to be at that intersection.
Back around 2017-2018 unannounced breaking changes in Azure services were so common, my team coined a term "Cloud Monday" (echoing Patch Tuesday) because usually our integration tests would break between 8-10AM Pacific Time on Mondays. (They did eventually become far less frequent.)
Azure being a shade of blue, you should've called it "Blue Monday"[0]. Could've even rigged up something to play the song when integration tests mysteriously failed. How does it feel/ to treat me like you do?/ When you've laid your hands upon me/ and told me who you are?...
So, as someone who was in the midst of planning a migration of a multi-billion $ revenue platform to using CosmosDB...
Alternatives? LOL
Basically just looking for geo-redundant, high read & write throughput. Our intention was to leverage Azure Event Grid/Kafka Connect to have event streaming used to coordinate writes between Redis (cache), Cosmos (transactional DB), and our systems of record (legacy). Majority of read/writes would occur via our API, but some would occur via the systems of record, hence the use of a log-based architecture.
Azure is our cloud provider. Interface is flexible, since our current implementation leverages Prisma ORM connected to Postgres & SQL Server. We're going to have to rebuild it anyway.
Got it, thank you! CockroachDB is the only one I know offhand that does what you're looking for. Another comment mentioned Vitess, which might also work.
It seems like there are a lot of options for large scale analytics, but I don't know a lot for high throughout geo-redundant transaction processing.
Vitess maybe? from what I have demoed and for being open source that’s the one I would choose. Unfortunately there isn’t a Postgres compatible one that is as mature yet and certainly not free.
I believe it is easy for a well-made software to immediately detect and report what goes wrong. With Sentry, Elk, or whatever else.
So, let's say I'm woken up in the middle of the night because my black box database as a service suddenly returns errors. If I'm not incompetent, I should have error messages and stacktraces available in a few seconds. If I'm a rich cloud customer, I can call the premium cloud support and ask for an explanation. If not, I would probably have to debug it myself.
With your service, I understand that I can blame the cloud provider faster. Maybe it can make the debugging session slightly faster when your monitoring also returns errors. End users don't care whether it's my code or the cloud provider code crashing, so it's a developer tool for emergencies. Did I understand well?
You got it right, it's a developer tool. Its not hard to get alerted about an issue, or to suspect a cloud dependency. Verifying it, which is typically required to take action, is what can take 10-30 minutes.
Funny, I was just last week having an argument with one of our team leads. I'd told him to create a specific container without a partition key (which I wouldn't recommend except in certain circumstances), and he said he couldn't. I assumed he was just doing it wrong.
In a document store, what does it even mean to create a container without a partition key? The document store has to partition the data somehow, and doing so implicitly sounds dangerous to me since all you're doing is creating a hotspot on one of the partitions...
It essentially means everything goes on the same partition, which for the scenario in question was exactly what we wanted. Not everything needs to be hyper-scalable, and usually there’s an amount of overhead associated with making things so. In a demanding SLA sort of environment an extra hop to a new physical partition can mean the difference between getting paid or not.
The error message from azure are literally just “One of the specified inputs is invalid”? I get annoyed at AWS error messages because they aren't really machine readable (unless you're okay parsing a string that is subject to change), but at least they are almost always human readable with all the relevant details...
Correct me if I'm wrong but the article does not mention which "outdated sdk" version was used. In addition to that every API call requires a version which is not mentioned in the article [1].
It is not clear to me if the issue was with an old SDK using the newest api version in calls or was it something else?
I have similar opinion to some other comments. Some Azure services - like Application Insights - I absolutely loved, some I hated, CosmosDB, being the latter.
They needed years to finally introduce PATCH in CosmosDB, Request Units feel like they're obscure on purpose to hide insane cost of using this storage, being able to use Stored Procedures only on one Partition Key(while /id is being the default...), requests would often fail with 429 Too Many Requests when the container was set to Autoscale with obscene limits that were never hit.
Just setup Marten with Postgres and get it over with for fraction of the cost.
Breaking changes are very common with Python Azure SDK.
First version where not pep8 compliant so when they decided to respect it, everything broke.
Azure servicebus in python went from version 0.50.3 to 7.0.0 with almost everything renamed, class moved and so on.
I find it pretty interesting that this company/product (Metrist) has created a monitoring tool for different cloud products, because their monitoring is so bad. Honestly a good idea, but a bit sad these companies can't do this themselves.
We used our own product to learn about and debug the issue. Its rather wild that they'd roll out this change so incrementally, which my colleague outlines here.
Gradual rollouts are pretty typical to give the team a chance to do a rollback before they cause a complete outage. This particular usage pattern probably just didn't appear as a significant enough spike in error rates.
Ya, that makes sense, it really isn't a normal use-case. I wish we kept tracking the other regions to see if they have had this change roll out to them yet, or if it's still slow rolling.
What are you doing here if you're not going to RTFA? The fifth paragraph pretty clearly describes the issue before they go into depth on how they determined that Azure did indeed publish a backwards-incompatible change without notice.
I don't think it's appropriate to ask personal questions on HN. But since we're going personal I'd like to note that "RTFA" is not a professional language.
The paragraph you mention states the issue, but it's pretty far from being a summary.
What an "executive summary" section would include in this case in a very short text on:
* What the problem is.
* How long to took to resolve for how many people.
* How it effected the company/project.
You could add more relevant stuff, depending on what the article says. There are some good resource on the internet how to write summaries.
People who are looking for interesting stuff to read in their 10 minute coffee break might read the article after the summary instead of skipping it altogether.
Reading through your past comments, it's clear that you have a strong dislike of Google[0] and a history of reactionary comments lacking both substance and clarification when challenged[1,2,3,4,5,6,7]. If you're not going to post anything worthwhile, perhaps it's best for you to skip over posts about Google since it's clear you have an axe to grind and nothing more.
>HN used to be a place for interesting discussions. Now it's a grievance forum for entitled freeloaders.[8]
Thanks for pointing this out. As a self-admitted Google disliker, I would prefer to at least see more variation in the rhetoric. The same message spouted over and over makes for extremely dull reading.
No data was lost, it will be lost when Stadia shuts down unless game developers do something about it (and they're the only ones that could anyways, it's not like Google can port your Assassin's creed Odyssey saves to Xbox).
What about Google Keep? Does this come with an enterprise licensing agreement?
I agree this is a good bar to test against, although I don't believe consumers are savvy enough to be aware of such matters. Nor should they be required to do so to get confidence what Google offers them today will remain tomorrow. With that said, should is an extremely arbitrary word.
> The service will remain live for players until January 18th, 2023. Google will be refunding all Stadia hardware purchased through the Google Store as well as all the games and add-on content purchased from the Stadia store.
I know it's trendy to hate on Google but this doesn't look unannounced to me.
Cosmos was originally created for hosting massive datasets internally within Microsoft. For example they use it for the OS telemetry sent in from customer machines, and raw data for threat intelligence. As part of Microsoft's move of everything hosted on-premise to their cloud, they decided to upon up Cosmos to other users of Azure. But the primary customer is and will likely always be Microsoft themselves. Which is probably why we see these breaking changes, it'll be in response to some internal ticket most likely.
The two have nothing in common (and trust me, it sure is fun having to constantly make sure which of the two someone is actually referring to every time...).
If your traffic pattern is exactly right, and you always scale traffic up and never ever down and do not have spikes, I guess it is probably OK. The main problem is the docs are (or, at least were 2 years ago) not clear about all the caveats and restrictions but pretend it is a generic database that just works. So one has to discover all the caveats oneself.
Microsoft thinks the exact workings of the partitioning is something that should work so well you don't need to know it in detail. But, if your usecase is slightly off you end up really needing to know. I know at least one team who routinely copy all their data from one Cosmos instance to another and switch over traffic to the copy just to get a partitioning reset; it is one thing to have to do it; another to discover in production yourself it has to be done with no prior warning..
Also: The ipython+portal+Cosmos security meltdown from 1 1/2 years ago alone should be reason to look elsewhere.
(No, not a competitor, just have spent way way way too much engineering time moving first on and then off Cosmos and yes I am bitter)