Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
On building 30k Debian packages (moyix.blogspot.com)
139 points by zdw on Feb 6, 2022 | hide | past | favorite | 42 comments


> Use an SSD:

If you can get your hands on a buttload of memory, try using a tmpfs mount instead. For i/o heavy builds it's like turning on ludicrous speed.

> trying to automate everything in a language where failures are silent

  set -u -e -o pipefail
> and can do exciting things like call "rm -rf /" when you meant "rm -rf ${foo}/${bar}"

I mean, if you're calling any recursive-file-unlinking thing from Python you face the same problem. Same solution applies: normalize the path (readlink, realpath) and test if it's "/" before adding an argument to the command. (Also don't use -f unless you need to, it ignores errors)

  set -e -u -o pipefail
  _rm_r () {
      declare -a rmargs
      for arg in "$@" ; do
          fullpath="$(readlink -f "$arg")"
          if [ ! "$fullpath" = "/" ] ; then
              rmargs+=("$fullpath")
          fi
      done
      rm -r "${rmargs[@]}"
  }
  _rm_r "${foo}/${bar}"
But I agree with OP that you should use any language you're most familiar with to produce the best results possible in the least amount of time.


Yeah, the rancor for cheap laughs schtick fell flat for me

>Avoid shell hackery. This is probably controversial

You bet. It's less than 5 minutes to learn how to make shell _not_ fail silently and an additional 10 minutes (I'm overstating the time) to figure out how to check variables for whether they will evaluate to empty strings -- to avoid deletion catastropes.

Not saying anyone should be "forced" to learn this. It's simply more accurate and helpful to say, "I don't know and I don't want to learn" instead of blaming a language.


> "I don't know and I don't want to learn" instead of blaming a language.

I'm a staunch proponent of "anything over 100 lines should not be written in bash". You can do many more things easier and with less google-fu to find your niche SO answer.

Unless you have absolutely no clue about other scripting languages that already exist on Debian, just don't do it, there's so many better options that actually last and can even be read by other people.


I used to say the same thing, back when the only bash I knew was from whatever tidbits I picked up as I needed them. Then I got more familiar with bash. Now, nearly everything I write is bash. I have realized that more complex languages mostly just make things.... more complex, rather than simplifying them. For simple solutions I stick to simple tools, even if it takes more than 100 lines.

Here's one tool I wrote in Bash. It's an alternative to Terragrunt. 616 lines. https://github.com/pwillis-els/terraformsh

Here's another one. A static-app package manager/versioner/pinner thing. 409 lines of POSIX shell (the code is intentionally ugly, don't take this as an example of good code). https://github.com/peterwwillis/clinst/blob/main/clinst

For both of these tools, 90% of what they do is just managing environment variables and calling other programs, and that's what shell scripts are best at.


Sure, sorry if I was too harsh on bash here :) It really wasn't the right tool for the job in some parts of my use case though – parsing info out of Sources.gz using awk and sed and friends is just plain bad compared to using python-debian to get the info directly from the apt database.


I read it as somebody who knows they're sufficiently bad at bash to not be safe with it admitting that and doing something more sensible for their specific context.

NB: When I say "sufficiently bad at bash" I think of myself first. Just because I know about 'set -e' and friends doesn't mean I trust myself to get it right nor do I believe anybody else should trust me to either.


But Python works great when called from shell! I learned Python first, then shell, so my shell scripts largely use Python instead of sed/awk (except for the most basic cases).

This includes both inline invocations like python -c, but more commonly simply a .py file in the same directory as the shell script.

So I think of shell as the main(), and then it often calls Python and R. (One big reason I started using shell was for preparing data in Python and then analyzing it in R.)

That said, I think shell needs better support for JSON and TSV so you can interoperate with languages like Python, JS, and R better. Things like the 'read' builtin aren't great for this.


Thanks for taking my criticism as constructive! (that's how I meant it.) I found myself very interested in the exploration and discoveries that guided your process.


I do have a buttload(ish) of memory (256GB), but given that I'm running 32 builds in parallel (which themselves often run in parallel) I was a bit worried about the amount of RAM needed for compilation (depending on the linker, things like LLVM can consume huge amounts of RAM during compilation). It's a cool idea though, maybe I'll give it a try for my next rebuild :)

The pipefail bit is a nice trick; I should remember to do this by habit when writing bash scripts (and to send the script through shellcheck)!


Maybe you should also look into eatmydata. It can speed up apt (when installing dependencies) by an order of magnitude. It will also speed up your builds in general. (I am guessing you don't care about a half built package in case of a power failure)


This does look pretty nice, although I am not sure how well it would play with bear, which also uses LD_PRELOAD to intercept some calls.


Holy crap that's a lot of parallel builds. One thing to try would be fewer parallel builds, but increase the parallelization of the build tool per package, e.g. make -j16. Combined with the tmpfs and less resource contention from other parallel builds, it may end up faster, and you can clean up after each package so your tmpfs only has a couple of packages' worth of build files at once.

Also, are you using ccache / distcc?


The debian build system already runs `make -j$(nproc)` whenever the package supports it. So in principle sometimes I'm running 1024 (32*32) concurrent compilations, but it's rare for that many to actually be active at once (most builds have large portions that can't be parallelized, like running ./configure or tarring up the build artifacts at the end) – right now I have 280 processes running in 32 containers.

Not using ccache/distcc since I'm just running this on my own workstation. I'm not sure ccache would help here because each build runs in an isolated container environment, to prevent cross-contamination with other builds. I'm also instrumenting each build using bear to generate a compilation database, and I think distcc would mess with that? (I can't imagine distcc knows how to transfer over the bear wrapper binaries / shared libs?)


Well the idea with ccache/distcc is to store the cache on the host or a shared volume and reference it from your containers. Your container builds remain isolated but don't waste time rebuilding objects that were already compiled (and re-compiling obvs gets much faster). Assuming same architecture, compiler, linker, etc I think it won't affect the isolated-build-ness. Here's some examples: https://cinaq.com/blog/2020/05/10/speed-up-docker-builds-wit... https://stackoverflow.com/questions/39650056/using-ccache-wh...

To be honest it's been... a very long time... since I've used these. But the theory is that it all "just works", in that if it can compile something with cache or remotely, it does, but if not, it just works as normal. So I think it would work for your use case? Maybe a Gentoo user can correct me :)


Have you thought of using something like container snapshots to build the full dep tree of things off shared container images? For example: you build ncurses, snapshot the layer it produces, then use that to build nano, vi, etc.


At the moment all the build-deps are installed as binary packages, so I think this wouldn't work? I thought of trying to figure out which packages had shared build dependencies and building a tree of images that way, but solving the dependency constraints to get an efficient tree sounded unpleasant.

I did experiment with just preinstalling the top 300 or so most common build dependencies, but this ended up causing more problems due to conflicts (and not just version conflicts – some packages behave differently at build time depending on what's already installed; an example is any R package which will use my old nemesis xvfb-run if it's present on the system [1]) than it saved in time or space.

I do feel a bit guilty about how much I'm hammering the debian repos though, so I will probably set up a local apt mirror and point my sources.list to that instead next time.

[1] https://salsa.debian.org/r-pkg-team/dh-r/-/blob/master/dh/R....


Not exactly the same issue, but this general category of silliness is why I use Zsh for most of my scripts, instead of Bash or "Posix" Sh, whenever I can get away with it. The programmer experience in Zsh is much better than Bash around anything related to parameter expansion, quoting, and arrays, while still more or less using the same idioms and syntax.

In addition, the Zsh equivalent of the Bash "strict mode" is a bit stricter. Here's mine:

    emulate zsh
    setopt \
     err_exit \
     pipe_fail \
     warn_create_global \
     warn_nested_var \
     no_unset
https://git.sr.ht/~wintershadows/dotfiles/tree/master/item/l...


> If you can get your hands on a buttload of memory, try using a tmpfs mount instead.

If you don't have enough memory, using an nbdkit tmpdisk (https://libguestfs.org/nbdkit-tmpdisk-plugin.1.html) is a middle ground. It's backed by disk, but disables flushes which gives you a good 30% speed boost. (Obviously only use this for stuff you don't care about / can reproduce, because disabling flushes is dangerous and will lose data if the machine crashes).


In addition: you can use overlayfs + tmpfs to run the same build with multiple build args without using any extra memory. I had an experiment I was working on to run tests for mutation testing that I want to open source at some point that used this same principle.


`tmpfs` for output files of huge distro builds has been great, IME. (This was the first thing I tried when I got a big new compute server at home.)

For source files, I'm thinking SSD, and leave some of the RAM for filesystem caches.


This sounds like 1. fantastic fun, and 2. an effective way to find a number of bugs; remarks like "Unfortunately, not every package respects the build options" and "things that expect dbus to be present" (to build!?) sound like there are clear improvements begging to be worked out. (Not that I'm implying that the author, or anyone else, is obligated to do so; chasing bugs and deficiencies in packaging is easily a full-time job if allowed to be!)


I wonder if also doing this for FreeBSD's 46792 Ports would give any extra 'perspective':

* https://en.wikipedia.org/wiki/FreeBSD_Ports

* https://www.freebsd.org/ports/

There's already a 'pre-canned' bulk build infrastructure:

* https://docs.freebsd.org/en/books/handbook/ports/#ports-poud...

* https://wiki.freebsd.org/VladimirKrstulja/Guides/Poudriere


Nice idea! The more data the better, from my perspective :) Presumably a lot of the code overlaps, but there's also probably plenty that's only in FreeBSD.

I would love to extend this even further to include projects on GitHub, but once you step outside the distro build systems it gets impossible to actually compile them in any standardized way. I might try to do some kind of best-effort (try ./configure && make, cmake, etc.) though.


Or you could try a package repository for one of the C/C++ package managers. For example, for build2[1] we currently have ~300 packages[2] which you can all build in a uniform way.

[1] https://build2.org

[2] https://cppget.org


Neat, I will give those a shot!


> […] but once you step outside the distro build systems it gets impossible to actually compile them in any standardized way.

FreeBSD Ports allows you to specify GH information to pull down code/tarballs in a standardized fashion:

* https://wiki.freebsd.org/Ports/SimpleGithub

* https://docs.freebsd.org/en/books/porters-handbook/makefiles...

Then you'd list build- and run-time dependencies.


Right, and I definitely want to try building existing ports. But unless I'm missing something, this still requires you to write down by hand the dependencies and steps for actually building each project? That's not really feasible at the scale of GitHub.


True: there is no embedded metadata (à la Dublin Core) in the source code or in/on Github that tells you about how to compile the source code. There is still a human-involved step in looking at the source code, even if it is in a build infrastructure (like Ports or Homebrew) it still had to be put there by a human.

The closest thing for a source hub and automated build system that is probably CPAN and the like (PIP, CRAN).

Another build infrastructure, focused more in the HPC niche, is Spack:

* https://spack.io

* https://spack-tutorial.readthedocs.io/en/latest/index.html


Have you considered talking to the Debian infrastructure and build teams about the possibility of augmenting the Debian package builders to generate, collect, and publish this data as a standard part of the build process ?

It would provide an ongoing and always up-to-date data set rather than relying on external entities interest, good-will and resources to do it (such as yourself) and could provide significant insights into the entire Debian project.


I will be chatting with some folks who work on reproducible builds (turns out some of them are my colleagues here at NYU! :)) to see if I can use their infrastructure to extract some of this information. That would no doubt be a big boon for the reliability of the build process too – they've been at this for years whereas my own infrastructure was put together over the past couple weeks.


It would be interesting to know if at least part of the work could somehow be integrated into Software Heritage (https://www.softwareheritage.org/) or CodeMeta (https://codemeta.github.io).


What is the point of compile_commands.json database? I've tried to run bear on my makefile and then i've read the resulting json. It seemed to me like bash script with extra steps.

Is there some added value?


The main use of it is for things like language servers. In order to properly show things like compiler errors in your editor, your language server (or linting engine or whatever) needs to know EXACTLY the command you used to compile. The most obvious thing is preprocessor macros, but also things like which C/C++ standard are you using, have you turned off some number of warnings, etc.

compile_commands.json is a way to solve that problem in a "build system independent" way. It's just a list of exactly what commands where used for what files. As long as you generate that file correctly, you can use whatever build system you want (cmake, hand-written make file, 1000s of lines of Perl 5, a shell script from 1983, whatever) and your editor can correctly show errors, warnings, provide auto-completion, all that good stuff.

An alternative (at least for clangd) is to use compile_flags.txt, where you can just specify whatever flags you want the compiler to use

EDIT: if you're asking why this person in particular wants compile_commands.json files, it seems like he just wants a unified format for the compile commands of all the files in all the packages so he can use it for his analysis or whatever.


I think the idea is to avoid doing the whole checking dependencies, and fast-build your whole system. Suppose you build a new compiler plugin/pass, you just want to force-recompile most stuff without going through all the build and autostuff.

It helps building dependency graphs, so that you can run static analysis tools. Coverity IIRC does something like that to get the list of units, and build (preprocessor directives, etc) options.


Yup, pretty much. If you want a static analysis tool to be able to actually parse your C code and see it like the compiler would, it needs to have all the info that the compiler did – header files, defines (including ones passed in with -D options), what the working directory was, what environment variables were in effect, etc. Otherwise there are all sorts of problems like:

- Is this code even used? (Maybe it's only included for some specific configurations, or #ifdef'd out)

- What does MAGIC_MACRO(x,y) do? Maybe this is only in a header file that's generated at build time, and you can't understand/analyze code that uses it without knowing its definition.

- What compiler version was used? Code may be interpreted differently depending on the compiler (particularly if gcc or clang-specific extensions are used, but also just due to things like compiler bugs or idiosyncrasies)

Coverity wrote a nice article about some of these issues back in 2010: https://cacm.acm.org/magazines/2010/2/69354-a-few-billion-li...

> Law: You can't check code you don't see. It seems too trite to note that checking code requires first finding it... until you try to do so consistently on many large code bases. Probably the most reliable way to check a system is to grab its code during the build process; the build system knows exactly which files are included in the system and how to compile them. This seems like a simple task. Unfortunately, it's often difficult to understand what an ad hoc, homegrown build system is doing well enough to extract this information

[...]

> The right approach, which we have used for the past seven years, kicks off the build process and intercepts every system call it invokes. As a result, we can see everything needed for checking, including the exact executables invoked, their command lines, the directory they run in, and the version of the compiler (needed for compiler-bug workarounds). This control makes it easy to grab and precisely check all source code, to the extent of automatically changing the language dialect on a per-file basis.

[...]

> Law: You can't check code you can't parse. Checking code deeply requires understanding the code's semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the widely believed myth that programming languages exist.

> The C language does not exist; neither does Java, C++, and C#. While a language may exist as an abstract idea, and even have a pile of paper (a standard) purporting to define it, a standard is not a compiler. What language do people write code in? The character strings accepted by their compiler. Further, they equate compilation with certification. A file their compiler does not reject has been certified as "C code" no matter how blatantly illegal its contents may be to a language scholar. Fed this illegal not-C code, a tool's C front-end will reject it. This problem is the tool's problem.


> dpkg-buildpackage -rfakeroot

-rfakeroot has been the default since dpkg 1.14.7, released in 2007:

https://lists.debian.org/E1IekzH-0006ul-Cy@ries.debian.org


Thanks! I think I started using debian about 7 years before that, so I guess I never caught on :)


Why not use nix-os? Which is at least guaranteed to have all packages building?


It very much doesn't work like that. I work on Guix a little, and I remember someone saying ~5% of packages were broken at any given time. there is something of an equilibrium between packages breaking due to changes and someone fixing them. Packages are also not truly functional in the functional programming sense. For practical reasons, Nix/Guix just try their best to remove as much state as possible to make things behave consistently. But for example, the number of cores available is different between different machines and race conditions can occur.


Builds break over time[0]. Generally this is from big updates like gcc, glibc, llvm, openssl, etc.

Most software which is maintained upstream and within nixpkgs is fine. But there's a bunch of "cruft" that you accumulate over time.

[0]: https://hydra.nixos.org/jobset/nixpkgs/trunk

P.S. The link above builds for x86_64-linux, aarch64-linux, aarch64-darwin, x86_64-darwin, which also contributes to the failure rate.


This was my first thought as well. Their tool for managing builds (Hydra [1]) was pretty simple to set up last time I tried.

In practice a few percent of packages will fail to build (packaging is hard!) but it’d be good enough for getting large quantities of compiled code.

[1] https://nixos.wiki/wiki/Hydra


what a g, maybe you can teach a class on operating systems now




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: