Cloud Postgres had some issues last year though -- when some UK datacenters stopped because of HVAC failure then HA PSQL instances got stuck during failover and there was no manual way of resolving this .
"Customers experienced downtime on Tuesday, 19 July 2022 starting at 09:25 US/Pacific. 36% of zonal (non-HA) instances in europe-west2-a were affected. Additionally, 31% of regional (HA) instances whose primaries were located in europe-west2-a experienced extended downtime because they were unable to successfully fail over to another zone. Finally, customers experienced some failures during the incident for backup, instance creation, update, delete, restart, export, and Database Migration Service operations. The total impact duration was 17 hours, 30 minutes."
Has Cloud Functions solved their logging issues yet?
Last I checked, the lifetime of the function was bound to the lifetime of the request. So you have to fully flush your logs to their log ingress service before you respond to a request, otherwise GCP will suspend the function and eat your logs.
Originally, this resulted in me having - to my surprise - very few logs coming from my functions.
Once I discovered what was going on, it resulted in me having to increase the latency of responding to the request while I waited for logging to flush to GCP's log ingress service.
Are you using a log service directly, or stdout? Stdout has no such issue. On Run (and therefore functions 2nd gen) you get a SIGINT before the container is shut down. Open Telemetry is pretty easy to configure to flush on SIGINT in the background.
My guess would be that their app isn’t running with PID 1. This can happen when services are launched from an entrypoint script in docker without using exec. If the request is fulfilled and the system sends a sigint to the container, docker will relay that to whatever is running on pid 1. If that’s their app, it’ll flush logs and quit. If their app is using another pid, docker will kill the container by force.
I built a service with a few functions (golang based) that ran just fine for over a year with no log flushing issues.
I did actually have an issue where it was logging two blank lines, which was annoying. It definitely wasn't coming from my code and it got fixed at one point, then reappeared later on. I gave up trying to resolve it.
That said, their logging system is fantastic. I found it really easy to deal with.
Although it works and is solid, I wouldn’t say it’s fantastic. My impression is that Google makes limited investment in it to steer customers towards their own services such as Cloud Spanner.
- Major versions are 6 months late
- Small instances are horribly slow
- Integration with other services is poor (e.g. the Cloud Run integration doesn’t work with a database private IP, so you have to fallback to configuring a VPC and connecting the standard way)
- IAM authentication, although great when it works, is complicated and poorly documented
- The UI has very few features, for example it isn’t possible to query the database from it
- Although I’ve never seen any provider have it, automatic upgrades between major versions would have been nice.
¯\_(ツ)_/¯ I started with a small instance, moved up to the next size so I could get more connections and it ran flawlessly for over a year with 50-60 RPS from Cloud Functions, hitting it 24/7. Total price was under $40 a month. Zero regrets and would do it again in a heartbeat.