Go's default toolchain is fine, everything else is optional. Some questionable advice in the article:
- Vendoring dependencies using "go mod vendor" is not a good default workflow - it bloats the repo, the checked in code is impossible to review, and is generally a pain to keep up to date. Don't, unless you really have to.
- There's no point in stripping a binary or even using UPX on it unless you're targeting extremely low memory environments (in which case Go is the wrong tool anyways), all it'll do it make it harder to debug.
I'm on the vendor bandwagon; always have been. I don't want a github outage to dictate when I can build/deploy. Yes, that happened. That is why we vendor :).
Now you can set up a proxy server; however, I don't want to do that. I'm pretty sure I have a few vendored packages that no longer exist at their original import path. For code reviews, we put off checking in the vendor path til the end if possible.
I have to strongly agree. Third party repos move, code on the internet disappears or silently changes, connectivity goes away at the most awkward time. You always want a point-in-time copy of your code and all dependencies under your control. Sometimes even for legal or security reasons.
Always vendor your dependencies in your private Git repo or a proxy you control. Or heck, even in some long term backup solution if you must. Experience trumps theory.
> I don't want a github outage to dictate when I can build/deploy. ...I'm pretty sure I have a few vendored packages that no longer exist at their original import path.
Golang now has an automatic transparent caching proxy at pkg.go.dev. If your build has ever worked, it should continue to work even if the original source goes away. Furthermore, your build should only break if both pkg.go.dev goes down, and the upstream source is unavailable (is down or has moved).
I do all my vendor in a "cache-proxy" thing (for lots of vendors). That box always runs, I just need upstream the first time I get the package. Doesn't bloat my code, makes sure package is available and makes audits of vendor stuff easy.
UPX only means smaller files on the disk, but it comes with a cost: it tends to increase memory requirements, because the binary on the disk cannot be mapped to memory anymore. Unless it's uncompressed somewhere in the filesystem.
Worse, if you run multiple instances of the same binary, none of them can be shared.
A bit simplified, without UPX, 100 processes of 100 MB binaries requires only 100 MB RAM for the code, but with UPX 10 GB.
Edit: In reality, likely only a fraction of that 100 MB needs actually to be mapped into memory, so without UPX true memory consumption is even less than 100 MB.
All true, but I think a compressed iso9660fs can actually support dynamic paging - the pages are decompressed into memory, obviously, but can be demand paged without staging them to media.
Can you expand on this a bit? I use upx at work to ship binaries. Are you saying these binaries have different memory usage upx’d than they do otherwise?
Normally operating system simply maps binaries, executables and loadable libraries (.dylib, .so, .dll, etc.) into memory. The cost is approximately same whether you do this once or 1000 times. The code is executed from the mapped area as-is.
However, when a binary is compressed, this cannot work, because in the file the binary is represented as a compressed data. The only way you can work around is to allocate some memory, decompress the binary there, map the region as executable and run it from there. This results a non-shareable copy of the data for each running instance.
Also impacts startup time. Really it's only appropriate for situations like games where you're very confident there will be just one instance of it, and it'll be long-running.
And even then, it's of dubious value when game install footprints are overwhelmingly dominated by assets rather than executable code.
I'm curious, was the practice of using upx there before you got there? We generally A/B test changes like this pretty thoroughly by running load tests against our traffic and looking at things like CPU and Memory pressure in our deploys.
While there are valid arguments against vendoring dependencies, I’m not convinced this is one of them in the typical case. It’s exceptionally easy to ignore certain directories when reviewing PRs in GitHub (although I still wish this was available as a repo-level preference), and I’d hope at least this would be the same in Gitlab, BitBucket, etc. I don’t review vendored dependencies, and I wouldn’t expect anyone else to, although the utility of that is admittedly domain-dependent.
Go also has the benefit that its dependencies tend not to be in deep chains, so the level of repo bloat when vendoring is usually not too terrible, at least relatively speaking.
Yeah, if you have a problem with it split it into two separate commits to review separately.
But WTF is this about not reading your dependencies. Read your dependencies! It is the most amazing superpower for someone to be like “Uh I don't know how Redux handles that and you can just tell them because you have read Redux. And that's also how you'll know, hey, do they have tests, are they doing weird things with monkeypatching classes or binaries at runtime, “oh the request is lazy—it doesn't get sent unless you attach a listener to hear the response,” what would it look like for the debugger to step through their code and is that reasonable for me to do or will I end up 50 layers deep into the call stack before the code actually does the thing.
I get it, this dependency is 100,000 LOC and if you printed it out that's basically 5 textbooks of code, you'd need a year to read all of that and truly understand it... Well don't use that dependency! “But I need it for dependency injection...” I mean if that's all then use a lightweight one or roll a quick solution in a day or explicitly inject your dependencies in 5 pages or or or. My point is just that you have so many options. If that thing is coming at 5 or 50 textbooks or whatever it is, what it actually means is that you are pulling in something with a huge number of bells and whistles and you plan on using 0.1% of them.
In this context, what would be useful is something like a linker-pruning at the source level.
That is, when your code is compiled, the linker can prune code that is never called. Then a feedback mechanism could show which part of the code is actually used (like looking in the .map of the linker).
Google's Closure compiler was doing this for JavaScript, where it matters because network bandwidth is a limited resource in some places. There it was called “tree shaking” if you want the jargon name for it.
There is a benefit to using "go mod vendor". Some corporate environments lock down their CI/CD pipelines. By vendoring everything, the CI/CD does not need to make external HTTP calls.
So, I don't bother with vendoring my dependencies ( usually ), but you have it the wrong way round.
Vendoring would make it more likely you're gonna review the changes, be ause you can quickly eyeball whether or not changes look significant, which is something you often won't get out of a go.sum change.
That's not totally without cost though, as it can break workflows that cherry pick commits between branches. eg main/master branch vs stable release branches
I don't think anyone is saying it's without cost, just that there are certain circumstances where you might want to bare the cost.
There's a generic question of how you build confidence in your dependcies not being compromised, and there's steps you can take to mitigate that without reading code, but if everyone was adopting that stance then we'd likely have no mitigations
If the problem is distribution, what's wrong with gzip? All the upsize of UPX and none of the downsides. If your distribution method is http, then you don't even have to write any code other than setting a Content-Encoding header.
I don't really believe that, at the speed of nic it makes pretty much 0 difference even on 30k servers. Shaving couple of ms at worse few seconds vs modifing a binary, def not worth it.
The servers are not all on gige. Many are on 100mbit and yes, that saturates the network when they are all updating. I learned through trial and error.
The updates are not pushed, they are pulled. Why? Because the machines might be in some sort of rebooting state at any point. So trying to first communicate with the machine and timeouts from that, would just screw everything up.
So, the machines check for an update on a somewhat random schedule and then update if they need to. This means that a lot of them updating at the same time would also saturate the network.
I’m curious why you’ve got servers on 100Mb. Last time I ran a server on 100Mb was more than 20 years ago. I remember the experience well because we needed AppleTalk support which wasn’t trivial on GbE (for reasons unrelated to GbE — but that’s another topic entirely).
What’s your use case for having machines on 100Mb? Are you using GbE hardware but dropping down to 100Mb, and if not, where are you getting the hardware from?
Sounds like you might work in a really interesting domain :)
Not the GP but edge devices on wifi/m2m are another scenario where you're very sensitive to deployment size.
Which can also be solved with compression at various other stages of the pipeline as mentioned by other commenters, but just to say that that's an easy case where this matters.
For large-ish scale distributed updates like that, maybe some kind of P2P type of approach would work well?
IBM used to use a variant of Bittorrent to internally distribute OS images between machines. That was more than a decade ago though, when I was last working with that stuff.
Another issue with that is that the systems I was running can go offline at any time. P2P, which could work, kind of wants a lot more uptime than what we had. It would just add some complexity to deal with individual downtime.
CI would run, build a binary that was stored as an asset in github. Since the project is private, I had to build a proxy in front of it to pass the auth token, so I used CF workers. GH also has limitations on number of downloads, so CF also worked as a proxy to reduce the connections to GH.
I then had another private repo with a json file in it where I could specify CIDR ranges and version numbers. It also went through a similar CF worker path.
Machines regularly/randomly hit a CF worker with their current version and ip address. The worker would grab the json file and then if a new version was needed, in the same response, return the binary (or return a 304 not modified). The binary would download, copy itself into position and then quit. The OS would restart it a minute later.
It worked exceptionally well. With CIDR based ranges, I could release a new version and only update a single machine or every machine. It made testing really easy. The initial install process was just a single line bash/curl to request to get the latest version of the app.
I also had another 'ping' endpoint, where I could send commands to the machine that would be executed by my golang app (running as root). The machine would ping, and the pong response would be some json that I could use to do anything on the machine. I had a postgres database running in GCP and used GPC functions. I stored machine metrics and other individual worker data in there that just needed to be updated every ping. So, I could just update column and the machine would eventually ping, grab the command out of the column and then erase it. It was all eventually consistent and idempotent.
At ~30k workers, we had about 60 requests per second 24/7 and cost us at most about $300 a month total. It worked flawlessly. If anything on the backend went down, the machines would just keep doing their thing.
Sounds like an interesting problem to have. Would something peer-to-peer like BitTorrent work to spread the load? Utilize more of the networks' bisectional bandwidth, as opposed to just saturating a smaller number of server uplinks. I recall reading many years ago that Facebook did this (I think it was them?)
> Vendoring dependencies using "go mod vendor" is not a good default workflow - it bloats the repo, the checked in code is impossible to review, and is generally a pain to keep up to date. Don't, unless you really have to.
Vendoring dependencies is a nice way of using private Go repositories as dependencies in CI builds without importing any security keys. Vendor everything from dev machine, and build it in CI. You don't even need an internet connection.
Sure, it makes sense. But that's another moving part in the machinery that you have to configure and maintain. It also makes sense to just keep things simple and vendor dependencies, sacrificing some extra space for simplicity of configuration. It just depends on what tradeoff you're looking for.
A Go vendoring pattern that I've found very useful is to use two repositories, the first for the main "project" repository, then a second "vendoring" repository that imports the first as a module, and also vendors everything.
This may require a few extra tricks to plumb through, for example, to make all cmd's be externally importable (i.e. in the project repository, transform "cmd/foo/%.go" from being an unimportable "package main" into an importable "cmd/foo/cmdfoo/%.go", then have a parallel "cmd/foo/main.go" in the vendoring repository that is just "func main() { cmdfoo.Main() }", same as you have in the project repository in fact).
Vendoring aside, this is also a useful pattern if you're "go:embed"ing a collection of build artefacts coming from another source, like a frontend HTML/JS/CSS project.
At this point, why not do the clean thing and have a forked repo per dependency. Setting up your "monorepo" like construct is as easy as a gitignore and a json file listing your dependencies and the specific hash, then have a script pull them and do a checkout.
This lifecycle is vastly cleaner and easier to update/control than vendoring, and also forces you to actually have explicit copies of everything your build needs in the same way that vendoring does, but in a cleaner, separated, traceable, manageable way.
> - Vendoring dependencies using "go mod vendor" is not a good default workflow - it bloats the repo, the checked in code is impossible to review, and is generally a pain to keep up to date. Don't, unless you really have to.
Go's setup is that if you don't vendor your dependencies then your build might break at any time, no?
> proxy.golang.org does not save all modules forever. There are a number of reasons for this, but one reason is if proxy.golang.org is not able to detect a suitable license.
If you're vendoring something without an appropriate license, you're skating on thin ice legally.
That's just one possible reason. The disclaimer does not specify all the possible reasons the proxy would drop a saved version. Treating it more like a cache seems appropriate.
Unless you're doing something stupid like "create a clean virtual environment for every build" then yea your build might break if you lose the internet or the packages disappear. Just don't ever do that stupid thing.
You're not expected to review the committed dependencies any more than you're expected to review the external repositories every time you update go.mod/sum. If you don't care, just ignore those parts - if you do care, you were already doing it.
I'd go way farther than "a bit of a project smell." I literally cannot think of a single instance in which vendoring a dependency for any reason other than, say, caching it for CI so you don't have to worry that the maintainer pulls a `left-pad` on you, has gone well.
If the package has bugs, you're far better off either waiting for upstream fixes, working around the bug in your application code, or just switching to a different library. That goes double if the library you're using is missing a feature you need, even if it's scheduled for the next version release.
Unless you're prepared to maintain a full-on fork of the dependency (and, if you do, please make it public), everything about vendoring for these reasons is 100% bad for you for very little incremental benefit. It's like the joke about regular expressions ("You have a problem and think 'I'll use regexes to solve it.' Now you have two problems"), except it's not a joke, and it sucks way more.
TL;DR: Vendoring to cache for CI/build servers, yes. Any other reason, just don't; it's not worth the headaches.
- Vendoring dependencies using "go mod vendor" is not a good default workflow - it bloats the repo, the checked in code is impossible to review, and is generally a pain to keep up to date. Don't, unless you really have to.
- There's no point in stripping a binary or even using UPX on it unless you're targeting extremely low memory environments (in which case Go is the wrong tool anyways), all it'll do it make it harder to debug.