This may be overly negative to a whole field, but I sometimes feek the platform teams add more hurdles than "stability and velocity".
At places with basically no platform team, no advanced cloud setup etc, I as a dev could understand everything, and deploying mostly meant getting my jar file or whatever running on some ec2 instance.
Now I need to write a dockerfile, loads of manifests, know about ingresses, use whatever custom tooling and abstractions they've built on top of the cloud provider.
And what used to be a single guy is now a huge team just adding more complexity. And I'm having a hard time understanding if it's worth it, or just a cost sink.
Coming at it from the other side, we have 20 - 25 Java teams deploying some 50 - 70 different java services in a number of software environments. For the removal of a number of systems on the older infra, we're expecting another 10 - 15 services being onboarded to the platform.
At that point, if every dev-team brews their own bespoke solution for each service, you're looking at a dozen or two solutions and you'll end up with a nontrivial amount of people just working on deployments and automation of their own "just scp jarfile to server". And if the solution of any team fails and the right person is unavailable, you're suddenly bleeding money because no one knows how to get it back working. Yes, "It's just a java service", but I've dealt with at least 6 ways of "just restart a java service" that isn't default init.d or systemd over the years. And most of them didn't work and had a "Oh yeah I forgot, you also have to.." shortly afterwards.
And then - at least in our line of business - you come to the fun section of "Customer Presales Questions". What's the access management for the server? Leaver/Joiner process? Separation of duties and roles? Patch cycles? Failover strategies? Backup strategies, geo-redundant backups, RTO, RPO, ... Buisness Continuity Plans?
I'd have to clock it, but doing 1 -2 of these 100+ question sheets costs more time than integrating with our platform - if you're used to the question sheets. And then one of us can answer 90%+ of these questions based on the standards of the platform.
> and automation of their own "just scp jarfile to server". And if the solution of any team fails and the right person is unavailable, you're suddenly bleeding money because no one knows how to get it back working.
I think that's the heart of the issue. At a certain point, people other than the application developers bear responsibility for the operation of the application.
The situation is not _fundamentally_ different: when it's 1-5 devs, everyone is responsible for the uptime. If one of those devs doesn't know how to deploy, you're still in the same boat as a larger team when something breaks.
But the situation becomes _politically_ different: management decides there needs to be an additional layer of responsibility: the platform team. They are the backstop. And so, instead of the platform team agreeing to bear responsibility for N different bespoke systems, they prescribe some common API (kubernetes, etc).
Of course they COULD prescribe "scp to a server and restart systemd unit" but then that's just not flexible enough. Some teams want different restart strategies, and it spirals out of control. At least kubernetes supports all that flexibility in a well-documented, battle-tested package.
So, the parties to blame are two, with different degrees of culpability:
1. Management decides there needs to be more shared ownership (otherwise you bleed money, which is worse when you're bigger and make more money)
2. Platform team agrees to support kubernetes, because OF COURSE our developers need all the bells and whistles, and elastic beanstalk isn't good enough because what if we need* Feature X 3 years from now?
> 2. Platform team agrees to support kubernetes, because OF COURSE our developers need all the bells and whistles, and elastic beanstalk isn't good enough because what if we need* Feature X 3 years from now?
This is valid criticism. It's leading to an interesting rift or tension, at least at my current workplace: Infrastructure like this enables teams to be agile, but the infrastructure itself rapidly stops being agile and you need to have a 1-2 year planning horizon.
A good platform enables you to deploy several times a day (if you want, you can always choose to go slower). But changing a core infrastructure piece if dozens of teams and dozens of dozens of service deployments depends on it... a year is suddenly a rather short time. In fact, I'd say you're not replacing such a core piece, you're rather adding a new deployment infrastructure and now you have 2 for a long time.
As such, going for all the bells and whistles you might need 2-3 years down the road can be a very valid point - as long as you have evidence and reason to believe the complexity will be utilized.
Otherwise, I'm totally a friend of having some Jenkins Job or Github action to rsync jar-files to from Artifactory to servers. Standardize the App-Server and the 1-2 java versions to have, have a proper systemd unit for it and have at it. We've run the company software like that for years. It just wouldn't work at the current scale.
> Of course they COULD prescribe "scp to a server and restart systemd unit" but then that's just not flexible enough
But what about if the server disappears or the disk failed and it won't start? It doesn't capture the entire process.
I remember at some point we needed to deploy to a new region for a customer and yes it was "too difficult" because some of those manual steps were forgotten and lost. In became a x month long effort to work out all the small details.
> 2. Platform team agrees to support kubernetes, because OF COURSE our developers need all the bells and whistles, and elastic beanstalk isn't good enough because what if we need* Feature X 3 years from now?
Whats so difficult about k8s? Kubernetes and especially the managed versions allow you to be a simple or complex as you want to be. What's the difficult part?
I think it depends on the size of the org. If there are only 1-5 devs, then yes, they would be doing the devops. But in OPs case, he is managing the infra for 70 engineers, there needs to be some formality in place, otherwise everything will spin out of control, if every engineer rolls his own server there that would quickly lead to chaos.
> At places with basically no platform team, no advanced cloud setup etc, I as a dev could understand everything, and deploying mostly meant getting my jar file or whatever running on some ec2 instance.
With an ec2 instance, how do you, for example, update the Java version? Store the database password? Add URLs the service is served at? If it’s done manually how do you add a second instance or upgrade the os?
Though, I agree the infra setups are usually overly complicated, and using a “high-level” service such as Heroku or one of its competitors for as long as possible, and even longer, is usually better, especially for velocity.
You stop your service, do apt-get update java, and then start it again? New URLs, update your nginx config file and restart nginx. Second instance? Dunno, provision a VM, ssh into it, FTP the jar over and stick a load-balancer in front of the two. When you get to 3 instances, we can maybe talk about a shell script to automate it. Heck, before we do that, we can just flash an image of the VM and ask EC2 to start another one up.
Literally 100's of ways to do it.
All this IAC and yaml configs and K8 are exactly like DI and IOC. You get sold on "simple", you start implementing it, and every single hurdle only has one answer: Add more of it or add more of this ecosystem into your stack (the one you just wanted to dip your toes into).
Before you know it, everything is taken over and your whole stack is now complicated, run by 50 different json yaml configs, and you now need tooling and templating to get it all working or to make one tiny change.
And if you have 3 services with 3 different people you'll have 3 different ways of doing it in your team. Suddenly you need 15 different tools at the right versions with the right configs to update a URL.
> Before you know it, everything is taken over and your whole stack is now complicated, run by 50 different json yaml configs, and you now need tooling and templating to get it all working or to make one tiny change.
I'm a developer as opposed to an "ops" person but in my career I've had far more issues with "well the machine for X has a very specific version of Y installed on it. We've tried upgrading it before but we had to manually roll it back" than I have had this. Those configs exist _somewhere_ if you're using AWS or something similar. If you want to avoid the complexity, use IAC (terraform) and simple managed services (DigitalOcean is the sweet spot for me).
if you don't have the problem it solves, don't use it. you need clusters services that scale up and down quickly. it's not the best way to deploy one server, it's the best way to deploy 10k servers, turn them off, and deploy them again. that's not even mentioning monitoring etc
Sounds a bit like those places aren’t at the scale where a platform team makes sense.
What about when the one dev deploying their jar to an ec2 instance moves on to another company, how does the next dev even understand what this jar stuff is when they just want to push their SPA to vercel.
Allowing devs to do whatever they want works at small new orgs but you need to put some kind of shape on it as they grow.
> What about when the one dev deploying their jar to an ec2 instance moves on to another company, how does the next dev even understand what this jar stuff is
I think that's more a problem of one person vs a team, as opposed to deploying JARs on EC2s vs cloud and a bunch of custom tooling.
If you have a team of devs and they all deploy JARs to EC2, one of them leaving won't be a problem, the rest will still know how to do it. If you were to have a single platform engineer who's built a bunch of custom tooling over a bunch of Kubernetes files, and nobody knows how it works or where the files are, and then they leave, you've got the same problem as the solo EC2 dev leaving.
> I think that's more a problem of one person vs a team
Not always. The "team" you imagine might also be 1 person deploying the jar and the rest copying them. When that person leaves suddenly know 1 knows what to do anymore either.
> Sounds a bit like those places aren’t at the scale where a platform team makes sense.
But so many orgs want to believe they are Google scale and you’re stuck with premature teams such as Platform Engineering. Then it just explodes and suddenly you have a director of platform Engineering and multiple sub teams and suddenly OKRs, ADRs, RFCs, team charters and perpetual Kubernetes upgrades and nobody can question its existence any more
SOC2 is behind a lot of that. Companies need to be compliant to sell into enterprises. They don’t want to go through the compliance process for each individual team, so they try to standardise. Lots of locally optimised teams don’t necessarily make an effective and efficient system.
Lots of companies do random things that just look good from the outside. I've been to places where code review wasn't a thing until SOC2 forced their hand and even then it wasn't well enforced.
If your whole company knows infrastructure well enough to build their own stuff and integrate it to existing solutions that are more or less islands of their own then you don't need a platform engineering team.
PE teams are more necessary when your development teams grow to include people who don't know infrastructure or when your compliance and security requirements need to scale passed most developers knowledge. At that point the overhead and abstractions are worth it.
> Now I need to write a dockerfile, loads of manifests, know about ingresses, use whatever custom tooling and abstractions they've built on top of the cloud provider.
Generally speaking, on every platform team I've worked on we've been able to maintain the ability of developers to continue to interact with the raw infrastructure as code. That's a careful dance done solely for the benefit of power users. Not every PE team knows this lesson though.
It’s hard. I’ve been on one of these kind of teams for a few years now.
We have probably 50-70 teams and well over 2000 deployable products.
There are good things, for sure. But of five teams, we’re the only one of them that is focused on the ‘customer’ (application developers).
The devops/infra teams provide ways for AppDevs to build what they need, but there seems to be no good abstractions being made.
Our team is named and presented as a team that provides common libraries, templates, ‘golden paths’, etc. But then the reality is we have barely any time for that. Instead we get tasked with projects that are indeed important from a $$$$ perspective, but it doesn’t fit well into an existing capability team.
Which is fine, but it feels dishonest to the rest of the engineers that are using our products thinking that’s the main thing we do.
It's fun for hacking but it won't scale gracefully. It's like saying all you need is create-react-app, usually a few months into the project you understand that it won't scale well and then the shit show begins
But our 6x ec2.small instances could serve the whole country buying public transport tickets every day. The k8s setup at a different place serve like 1/100th the amount of daily purchases, but have over 150 python pods to handle different stuff. Yeah, python is slow, but the complexity of the infrastructure is just insane. Yes, it's infinitely scalable, but we would never need that.
> Now I need to write a dockerfile, loads of manifests, know about ingresses, use whatever custom tooling and abstractions they've built on top of the cloud provider.
Doesn’t sound like there’s a whole lot of platform engineering here.
The general aim for a platform engineering team should be a UI/CLI that allows a dev to get a new service into production in minutes. Metrics, tracing, monitoring, alert routing, logging, DB, CI/CD, service/RPC stubs etc all done for you so you can get to writing code fast and not worry about the tower of complexity underlying all of this.
If "stability and velocity" are anything but metrics with empirical data as to their validity compared to historical metrics, perhaps you are misinformed about the reasoning behind them.
Adoption of products that enable one particular flavor of org structure that is shown to be useful won't help you if you don't adjust your org structure.
Without adoption of practices that take advantage of those shifts in complexity you will never receive the full benefits from them.
I highly encourage you to research the reasons behind these shifts in complexity, particularly the ways they are intended to increase independence of teams to increase organizational scaling ability.
Empathy is another area to perhaps work on. Because I was that 'single guy' for years and guess what, I missed every family graduation, wedding and funeral for a decade; had months where I had to wake up and restart you jboss instances every 45min 24 hours a day and then still had to be in the office at 9am for meetings where you would punt a ticket to next sprint to fix it.
Platform engineering done right is like an internal SaaS provider, and you should have embeds to help you with interfacing with them. Abstraction to allow for vendor mitigation using tools like Terraform is a good practice but not super custom.
But you can be bitter and complain, working for a company who chooses solutions on the golf course; or you can figure out what your org needs, find sponsors and allies and make positive changes.
If you feel the platform group is adding hurdles, you aren't working at a place that is doing platform engineering or you are understanding why some requirements need to be implemented.
That said as all cloud providers are SoA based, large amounts of custom tooling is a red flag that your org is not SoA based and you are going to have a bad time anyway.
As for manifests and egress, if you didn't care about them before, you probably were releasing insecure, unreliable balls of mud. So yes there will be an adjustment to becoming more of a software engineer. But that is just the reality of working in larger systems on more professional teams.
If you dislike a SoA model, ITIL and ITSM would have been true hell.
The company wide meetings to add basic services were way more subject to bike shedding and blocking in those days.
Anyways being proactive and helping a company alignment with modern practices doesn't happen by itself. If you are passionate about this, run with it.
For some services, copying a jar file is a completely valid pattern in SoA FWIW, and adding complexity without value is an anti-pattern.
Just like with programming, it is easy to forget you aren't the customer and to implement features that your customer don't want or need. Assuming that you don't have a reputation for being a Karen, try communication with the platform team and let them know what your pain points aren't. If they won't let you in the room make your case to someone who will sponsor you to get in the room.
But first realize that having a single person be a single point of failure, then expecting them to sacrifice their entire life to be on call for an entire organization simply isn't realistic.
I wish I hadn't lost two long term relationships and decades of family time doing it. I didn't do me or the companies I was at any favors.
At places with basically no platform team, no advanced cloud setup etc, I as a dev could understand everything, and deploying mostly meant getting my jar file or whatever running on some ec2 instance.
Now I need to write a dockerfile, loads of manifests, know about ingresses, use whatever custom tooling and abstractions they've built on top of the cloud provider.
And what used to be a single guy is now a huge team just adding more complexity. And I'm having a hard time understanding if it's worth it, or just a cost sink.