Almost ALL manuals for setting up something on AWS, third party tools, and sometimes AWS official tools, effectively have a step that requires admin permissions, like being able to attach arbitrary policies to an instance.
At that point you can either:
- give everybody Admin access
- involve the same 2 or 3 trusted people in processes that they shouldn't really be involved
- dig down into permissions to try to build something yourself that apparently the people writting the manuals gave up on doing
Looks very bleak, anybody has a different view in this? Seems unlikely that big organizations work like this on AWS.
* Have a separate AWS account for each developer that allows for experimentation without risking shared environments. Engineers can have admin access on their own account to enable rapid prototyping and experimentation.
* Have separate AWS accounts for separate services. Ideally a separate account per service/stage pairing. If you're even more mature in your cloud operational journey, you may go for one account per service/stage/region or per cell in a cellular architecture if you're really advanced. If you're just a small startup, maybe just go for a separate account for beta and prod stages.
* Because running services have their own account, the blast radius of an engineer modifying permissions is quite limited already. However, changes to infra (including IAM changes) should all be done through a CI/CD pipeline. Engineers can only change infra by submitting a PR to update the IaC definition and passing through your peer review and automated checks. Tools like AWS CDK with self-mutating pipelines where the pipeline itself is modeled via IaC are great for this.
* Use a higher level of abstraction to manage IAM permissions like AWS CDK. Manually figuring out the permissions needed is a nightmare and an exercise in frustration.
* Try to keep AWS credentials ephemeral with whatever third-party services you're working with.
* If you're using AWS CodePipelines, run all your pipelines in a separate DevOps account where all the pipelines live. Set a role that engineers can use on this account that lets them debug pipelines without mutating. If you're using Github actions or something else, you obviously need to guard permissioning of that other service and try to use decent practices for getting AWS credentials. In the case of Github, for example, use OIDC to assume ephemeral credentials to an IAM role rather than saving secrets of an access/secret key pair.
Major caveat. You’re going to want to have at least usage monitoring tools and ideally usage limiting tools setup before you give developers their own individual AWS accounts… at the sort of scale where sharing a dev account stops being viable, accidentally creating and leaving expensive resources around to rack up a bill becomes far too easy
This is the way. I’ve seen this happen countless times. It’s happened to me too. It’s happened to colleagues.
The worst case I’m aware of from first-hand knowledge was a large cluster of resources that got deployed for a product demo by a sales engineer and forgotten about. Turned into a nice ~$100,000 surprise in the quarterly budget.
Yup. The scope of the discussion was around permissioning/security, so I didn't get into billing, but you're absolutely right.
You should have CloudTrail, billing alarms, and dashboards all setup. It may also be a good idea to setup automatic spring cleaning that nukes resources every two weeks or so unless they have special tags to mark retention.
I came to realize that AWS is built for people who's full time job is AWS. It's like Jira, to get the most out of it you need a Jira workflow/configuration expert. You can make do as a smart cookie with shit to get done, but ultimately you'll be making compromises by virtue of not having time to dedicate to fully grokking it.
So I guess my experience is that the answer is supposed to be option 3 on your list, but that only happens if you have someone to delegate as the AWS expert and give them time to actually do that role.
This is also why just 'use the cloud' is not really good advice, as with any broad and scaled system you can't just tack it on to a project you happen to be doing. Inversely, not using the cloud because you need to know what you're doing isn't a good argument either; if you are going to build something that has to scale, be a bit elastic, and want to build on existing knowledge, you will end up using a larger system that requires domain-specific knowledge either way.
AWS isn't a hosting company (and that includes let-us-pretend-we-are services; looking at you: lightsail), just like Azure and GCP aren't. But even if you get a hosting company (Vultr? prgmr? DO?) you'll still need to know how to build virtual machine images, how to deploy them, and how to cycle them. None of that knowledge will be something native to whatever project you are building.
The only thing that gets somewhat close is the Deno and Vercel style of stuff, but even then you'll need to know how those work in order to make your project fit.
No this is exactly how it works. There’s technically ideal policy and a bunch of exceptions which are behind a “break glass in case of emergency” account. Realistically nothing works however many hours you spend in IAM unless you use the emergency account so that’s what everyone does.
In practice, you manually bootstrap your CI/CD, and if you really want to be secret squirrel secure, you can audit that.
Everything going forward gets to be commited via code, peer review, least privledged access, and is automatically deployed by the robot.
At that point, you might hand out some break glass credentials for a few items, or when your CI/CD goes down, but usage of that should alert and be manually reviewed.
You now have a pretty decent setup, as everything is checked.
That's what we say we do. Reality is that most of the time the entire team is logged in via delegated admin accounts working out what the automation fucked up or deleting resources which don't want to go away when the automation craps out or futzing with things that are broken or missing in Terraform.
I don't think it's a lie; just a pain in the arse.
I've worked in, and defined, secure environments where your only option is you need to raise another PR. For me this often degenerates into the PR spinning up a copy in another account, and the PR history being "test, test, test, ffs, test".
The more practical way around the problem is to have non-production accounts where you can delegate admin, without the risk of the production and tooling accounts, and use that to make sure your code works before you actually raise the PR and merge.
It's prudent to remember that such security should be risk based, so whilst there's risk that you leak non-prod credentials to a crypto farmer, the risk of your teams being non-productive is greater.
Largely agree with you though. I usually just run k8s environments and make concession to use the cloud stuff when it makes sense. Not being able to run locally is akin to punching yourself in the nuts. Creating a namespace in k8s and deploying ephemeral tests either locally or in a dev cluster is better experience in my opinion.
But you do have a lot of folks who treat their cloud provider like their favourite sports team.
We have a 20:1 ratio of developers and platform engineers where developers cannot do anything in AWS except via Infrastructure as Code. Works very well. Yes, you do need to make sure you only platform things you actually need, and not try to abstract the entire service catalog. But if you just pick 4 RDBMS options, 2 Document-oriented options, 2 queues, 2 PubSubs, 1 object store, 1 traffic management option, and one catch-all observability option, you really don't need to do all that much over time. Granted, this doesn't work for small-scale stuff or immature organisations.
IAM is a big pain. When deploying "something new" (say a lambda executing another lambda that accesses dynamo...) I spend more timing screwing around with terraform to configure IAM roles, permissions, and other garbage than I do actually debugging the code. Many junior developers don't understand it at all. "It works with my credentials!" (Of course it does: you're running it with admin access in your own dev account.)
We have a pretty solid set of organisation roles for devops, dev and others across several tens of (maybe hundreds) of accounts. Our accounts are divided into prod and nonprod accounts, so if you want the devs to have less permissions in prod that’s perfectly possible.
The only thing we have to be careful is is cross-account assumerole chaining.
But really it’s a lot of work, and I guess a competetive advantage, and quite specific to every organisation, so nobody ever made something like that open source.
Automated processes can help. But I’m not sure how you get around this universal feature of computing platforms. Someone has to have access to grant restricted access.
If this feels wrong to you, then you’re probably right that AWS is not the appropriate level of abstraction for your problem.
At that point you can either:
- give everybody Admin access
- involve the same 2 or 3 trusted people in processes that they shouldn't really be involved
- dig down into permissions to try to build something yourself that apparently the people writting the manuals gave up on doing
Looks very bleak, anybody has a different view in this? Seems unlikely that big organizations work like this on AWS.