musicmatzes blog

nixos


Please keep in mind that I cannot go into too much detail about my companies customer specific requirements, thus, some details here are rather vague.

Also: All statements and thoughts in this article are my own and do not necessarily represent those of my employer.


In my dayjob, I package FLOSS software with special adaptions, patches and customizations for enterprises on a range of linux distributions. As an example, we package Bash in up-to-date versions for SLES 11 installations of our customers.

Most software isn't as easy to build and package as, for example, bash. Also, our customers have special requirements, ranging from type of package format (rpm, deb, targz, or even custom formats...) to installation paths, compile time flags, special library versions used in the dependencies of a software and so on...

Our main Intellectual Property is not the packaging mechanisms, as there are many tools already available for packaging. Our IP is the knowledge of how to make libraries work together in custom environments, special sites, with non-standard requirements and how to package properly for the needs of our customers.

Naturally, we have tooling for the packaging. That tooling grew over years and years of changing requirements, even target platforms (almost nobody uses IRIX/HPUX/AIX/Solaris anymore, right? at least our customers moved from those systems to Linux completely) and also technology available for packaging.

End of 2020, I got the opportunity to re-think the problem at hand and develop a prototype for solving the problem (again) with some more state-of-the-art approaches. This article documents the journey until day.


The target audience of this article is technical users with a background in linux and especially in linux packaging, either with Debian/Ubuntu (.deb, APT) or RedHat/CentOS/SuSE (.rpm, YUM/ZYPPER) packages. Basic understanding of the concept of software packages and docker might be good, but are not necessarily needed.


Finding the requirements

When I started thinking about how to re-solve the problem, I already worked for about 1.5 years in my team. I compiled and packaged quite a bunch of software for customers and, in the process of rethinking the approach, also improved some parts of the tooling infrastructure. So it was not that hard to find requirements. Still, patience and thoroughness was key. If I'd miss a critical point, that new software I wanted to develop might result in a mess.

Docker

The first big requirement was rather obvious to find. The existing tooling used docker containers as build hosts. For example, if we needed to package a software for debian 10 Buster, we would spin up a docker container for that distribution, mount everything we needed to that container, copy sources and our custom build configuration file to the right location and then start our tool to do its job. After some automatic dependency resolving, some magic and a fair bit of compiling, a package was born. That package was then tested by us and, if found appropriate, shipped to the customer for installation.

In about 90% of the cases, we had to investigate some build failures if the package was new (we did not package that software before). In about 75% of the cases the build just succeeded if the package was a mere update of the software, i.e. git 2.25.0 was already packaged and now we needed to package 2.26.0 – the customizations from the old version were just copied over to the new one and it most of the time just worked fine that way.

Still, the basic approach used in the old tooling was stateful as hell. Tooling, scripts, sources and artifacts were mounted to the container, resulting in state which I wanted to avoid in the new tool. Even the filesystem layout was one big pile of state, with symlinks that were required to point to certain locations or else the tooling would fail in weird (and sometimes non-obvious) ways.

I knew from years of programming (and a fair amount of system configuration experience with NixOS) that state is the devil. So my new tool should use docker for the build procedure, but inside the running container, there should be a defined state before the build, and a defined state after the build, but nothing in between. Especially if a build failed, it shouldn't clutter the filesystem with stuff that might break the next or some other build.

Package scripting

The next requirement was a bit more difficult to establish – and by that I mean acceptance in the team. The old tooling used bash scripts for build configuration. These bash scripts had a lot of variables that implicitely (at least from my point of view) told the calling tool what to do. That felt like another big pile of state to me and I thought of ways to improve that situation without giving up on the IP that was in those bash scripts, but also providing us with a better way to define structured data that was attached to a package.

The scheme of our bash scripts were “phases” (bash functions) that were called in a certain order to do certain things. For example, there was a function to prepare the build (usually calling the configure script with the desired parameters, or preparing the cmake build). Then, there was a build function that actually executed the build and a package function that created the package after a successful build.

I wanted to keep that scheme, possibly expanding on it.

Also, I needed a (new) way of specifying package metadata. This metadata then must be available during the build process so that the build-mechanism could use it.

Because of the complexity of the requirements of our customers for a package, but also because we package for multiple customers, all with different requirements, the build procedure itself must be highly configurable. Re-using our already existing bash scripts seemed like a good way to go here.

Something like the nix expression language first came to my mind, but was then discarded as too complex and too powerful for our needs.

So I developed a simple, yet powerful mechanism based on hierarchical configuration files using “Tom's Obvious, Minimal Language”. I will come back to that later.

Parallelization of Builds

Most of the time, we are building software where the dependencies are few in depth, but lots in breadth. Visually, that means that we'd rather have this:


                              A
                              +
                              |
              +---------------+----------------+
              |                                |
              v                                v
              B                                C
              +                                +
              |                      +--------------------+
       +-------------+               |         |          |
       v      v      v               v         v          v
       D      E      F               G         H          I

instead of this:

              A
              +
              |
       +-------------+
       v      v      v
       B      C      D
       +             +
       |             |
       v             v
       E             F
       +             +
    +--+---+         |
    v      v         v
    G      H         I

Which means we could optimize a good part by parallelizing builds. In the first visualization, six packages can be built in parallel in the first step, and two in the second build step . In the second visualization, four packages could be built in parallel in the first step, and so on.

Our old tooling, though, would build all packages in sequence, in one container, on one host.

How much optimization could benefit us can be calculated easily: If each package would take one minute to build, without considering the tool runtime overhead, our old tooling would take 9 minutes for all 9 packages to be built. Not considering parallelization overhead, a parallel build procedure would result in 3 minutes for the first visualization and 4 minutes for the second.

Of course, parallelizing builds on one host does not result in the same duration because of sharing of CPU cores, and also the tooling overhead must be considered as well as waiting for dependencies to finish – but as a first approximation, this promised a big speedup (which turned out to be true in real life), especially because we got the option to parallelize over multiple build hosts, which is not possible at all with the old tooling.

Packaging targets

Because we provide support for not only one linux distribution, but a range of different distributions, the tool must be able to support all of them. This does not only mean that we must support APT- and RPM-based distributions, but we must also be able to extend our tooling later to be able to support other packaging tools. Also, the tool should be able to package into tar archives or other – proprietary – packaging formats, because apparently that's still a thing in the industry.

So having the option of packaging for different package formats was another dimension in our requirements matrix.

My idea here was simple and proved to be effective: Not to care about packaging formats at all! All these formats have one thing in common: At the end of the day, they are just files. And thus, the tool should only care about files. So the actual packaging is just the final phase which would ideally results in artifacts in all formats we provide.

Replicability

By using #nixos for several years now, I knew about the benefits of replicability (and reproducibility – patience!)!

In the old tooling, there was some replicability, because after all, the scripts that implemented the packaging procedure were there. But artifacts were only losely associated with the scripts. Also, as dependencies were implicitly calculated, updating a library broke replicability for packages that dependended on that library – consider the following package dependency chain:

A 1.0 -> B 1.0 -> C 1.0

If we'd update B to 2.0, nothing in A would tell us that a a-1.0.tar.gz was built with B 1.0 instead of 2.0, because dependencies were only specified by name, not by version. Only very careful investigation of filesystem timestamps and implicitely created files somewhere in the filesystem could lead to the right information. It was there – buried, hard to find and implicitly created during the packaging procedure.

There clearly was big room for improvement as well. The most obvious way was to make dependencies with their version number explicit. Updating dependencies, though, had to be easy with the new tooling, because you want to be able to update a lot of dependencies easily. Think of openssl, which releases bugfixes more than once a month. You want to be able to update all your packages without a hassle.

Another big point in the topic of replicability are logs. Logs are important to be kept, because they are the one thing that can tell you the most about that old build job you did two months ago for some customer, that resulted in this one package with that very specific configuration that you need to reproduce with a new version of the source now.

Yep, logs are important.

From time to time we did a cleanup of the build directories and just deleted them completely along with all the logs. That was, in my opinion, not good enough. After all, why not keep all logs?

The next thing were, of course, the build scripts. With the old tool, each package was a git repository with some configuration. On each build, a commit was automatically created to ensure the build configuration was persisted. I didn't like how that worked at all. Why not have one big repository, where all build configurations for all packages reside in? Maybe that's just me coming from #nixos where #nixpkgs is just this big pile of packages – maybe. I just felt that this should be the way this should work. Also, with some careful engineering, this would result in way less code duplication, because reusing stuff would be easy – it wouldn't depend on some file that should be somewhere in the filesystem, but a file that is just there because it was committed to the repository.

Also, replicability could be easily ensured by

  1. not allowing any build with uncommitted stuff in the package repository
  2. recording the git commit hash a package was built from

This way, if you have a package in a format that allows to record arbitrary data in its meta-information (.tar.gz does not, of course), you can just check out the commit the package was built from, get the logs for the relevant build-job, and off you go debugging your problem!

Reproducability

With everything recorded in one big git repository for all packages, one big step towards reproducibility was made.

But there was more.

First of all, because of our requirements with different customers, different package configurations and so on, our tooling must be powerful when it comes parametrization. We also needed to be able to record the package parameters. For example, if we package a software for a customer “CoolCompany” and we needed, for whatever reason, to change some CFLAGS to some or all packages, these parameters needed to be recorded, along with possible default-parameters we provide for individual packages anyways.

If we package, for whatever reason, some package with some library dependencies explicitely set to old versions of said library, that needs to be recorded as well.

Selecting Technologies

The requirements we defined for the project were not decisive for the selected technologies, but rather the history we had in our tooling and how our intellectual property and existing technology stack was built. And that is bash scripts for the package configuration(s) and docker for our build infrastructure. Nevertheless, a goal we had in mind while developing our new tooling was always that we might, one day, step away from docker. A new and better-suited technology might surface (not that I say that docker is particularly well suited for what we do here), or docker might just become too much of a hassle to work with – you'll never know! Either way, the “backend” of our tool should be convertible to not only be able to talk to docker, but also other services/tools, e.g. kubernetes or even real virtual machines.

As previously said (see “Package scripting”), the existing tooling (mostly) used bash scripts for package definition. Because these scripts contained a lot of intellectual property, and easy transferation of knowledge from the old tooling to the new tooling was crucial, the new tooling had to use bash scripts as well. Structural and static package metadata, although, could be moved to some other form. TOML, as established markup language in the Rust community, was considered and found appropriate. But because we knew from the old tooling, that repetition was an easy habit to fall for, we needed some setup that made re-use of existing definitions easy and enable flexibility, but also give us the option to fine-tune settings of individual packages, all while not giving away our IP. Thus, the idea of hierarchical package definition came up. Rather than inventing our own DSL for package definition, we would use a set of predefined keys in the TOML data structures and apply some “layering” technique on top to be able to have a generic package definition and alter that per-package (or more fine granular). The details on this follow below in the section “Defining the Package format”.

For replicability, aggregation of log output and tracking of build parameters we wanted some form of database. The general theme we quickly recognized was, that all data we were interested in, was immutable once created. i.e., a line of log output is of course not to be altered, but to be recorded as-is, but also parameters we submit to the build jobs, dependency versions, every form of flags or other metadata. Thus, an append-only storage format was considered best. Due to the nature of the project (mostly being a “let's see what we can accomplish in a certain timeframe”), the least complicated approach for data storage was to use what we knew: postgres – and never alter data once it hit the database. (To date, we have no UPDATE sql statements in the codebase and only one DELETE statement.)

Rust

Using Rust as the language to implement the tool was a no-brainer.

First of all, and I admit that was the main reason that drove me, I am most familiar with the Rust programming language. Other languages I am familiar with are not applicable for that kind of problem (that is: Ruby, Bash or C). Other languages that would be available for the problem domain we're talking about here would be Python or Go (or possibly even a JVM language), but I'd not consider any of them, mostly because not a single one of these languages gives me the expressiveness and safety that Rust can provide. After all, I needed a language where I could be certain that things actually worked after I wrote the code.

The Rust ecosystem gives me a handful of awesome crates (read: libraries) to solve certain types of problems:

  • The serde framework for easy de- and serialization from different formats, most notably for the problem at hand: the TOML, with the toml crate as an implementation of that format
  • The tokio framework for async/await, which is also needed for
  • shiplift, an library to talk to the Docker API from native Rust, leveraging the async/await capabilities of Rust
  • handlebars for rendering of templates, used for the package format implementation (read later in “Defining the Package format”)
  • The config crate, which I got maintainer rights during the implementation of butido, to ensure the continued development of that awesome configuration-handling library
  • the diesel SQL framework for talking to the PostgreSQL database
  • ... and many, many more libraries which made development a breeze (at the time of writing this article, butido is using 51 crates as dependency, 3 of which I am the author and 2 of which I got maintainer rights during the development of butido – after all, bus factor is a thing!)

Defining the Package format

As stated in one of the previous sections, we had to define a package format that could be used for describing the metadata of packages while keeping the possibility to customize packages per customer and also giving us the ability to use the knowledge from our old bash-based tooling without having to convert everything to the new environment.

Most of the metadata, which is rather static per package, could easily be defined in the format:

  • The name of the package
  • The version of the package
  • A list of sources of a package because packages can have multiple sources – think git, where sources and manpages are distributed as two individual tarballs
  • Buildtime dependencies (dependencies that are only needed for building the package, a common one being “make”)
  • Runtime dependencies (dependencies which needed to be installed on a system for the package itself to be usable, could be system packages or packages we've built ourselves)
  • A list of patches we apply to the package before building
  • A key-value list of environment variables for the build procedure that were local for the package but not for all packages in the tree
  • An allowlist of distributions (or rather: docker images) a package can be built on
  • A blocklist of distributions (or rather: docker images) a package can not be built on
  • A key-value list of additional meta information for the package, e.g. a description, a license field or the website of the project that develops the package

All these settings are rather static per package. The real complexity, though, comes with the definition of the build script.

The idea of “phases” in the build script was valuable. So we decided that each package should be able to define a list of “phases”, that could be compiled to a “script” that, when executed, transformed the sources of a package to some artifacts (for example, a .rpm file). The idea was, that the script alone knew about packaging formats, because (if you remember from earlier in the article), we needed flexibility in packaging targets: rpm, deb, tar.gz or even proprietary formats must be supported.

So, each package had a predefined list of phases that could be set to a string. These strings are then concatenated by butido and the resulting script is, along with the package sources, artifacts from the build of the package dependencies and the patches for the package, copied to the build-container and executed. To visualize this:

Sources      Patches      Script
   │            │            │
   │            │            │
   └────────────┼────────────┘
                ▼
         ┌─────────────┐
         │             │
         │  Container  │◄────┐
         │             │     │
         └──────┬──────┘     │
                │            │
                │            │
                ▼            │
            Artifacts────────┘

One ciritcal point, though, was repetition. Having to repeat parts of a script in multiple packages was a deal-breaker. Therefore being able to reuse parts of a script is necessary. Because we did not want to invent our own scripting language/DSL for this, we decided to use the layering mentioned before to implement reuse of parts of scripts.

Consider the following tree of package definition files:

/pkg.toml

/packageA/pkg.toml
/packageA/1.0/pkg.toml
/packageA/2.0/pkg.toml
/packageA/2.0/1/pkg.toml
/packageA/2.0/2/pkg.toml

/packageB/pkg.toml
/packageB/0.1/pkg.toml
/packageB/0.2/pkg.toml

The idea with that scheme was that we implement a high-level package definition (/pkg.toml), where variables and build-functionality was predefined, and later alter variables and definitions as needed in the individual packages (/packageA/pkg.toml or /packageB/pkg.toml), in different versions of these packages (/packageA/1.0/pkg.toml, /packageA/2.0/pkg.toml, /packageB/0.1/pkg.toml or /packageB/0.2/pkg.toml), or even in different builds of a single version of a package (/packageA/2.0/1/pkg.toml, /packageA/2.0/2/pkg.toml).

Here, the top level /pkg.toml would define, for example, CFLAGS = ["-O2"], so that all packages had that CFLAGS passed to their build by default. Later, this environment variable could be overwritten in /packageB/0.2/pkg.toml, only having an effect in that very version. Meanwhile, /packageA/pkg.toml would define name = "packageA" as the name of the package. That setting automatically applies to all sub-directories (and their pkg.toml files). Package-local environment variables, special build-system scripts and metadata would be defined once and reused in all sub-pkg.toml files, so that repetition is not necessary.

That scheme is also true for the script phases of the packages. That means, that we implement a generic framework of how a package is built in /pkg.toml, with a lot of bells and whistles – but not tied to a build tool (autotools or cmake or...) or a packaging target (rpm or deb or...), but with flexibility to handle all of these cases gracefully. Later, we customize parts of the script by overwriting environment variables to configure the generic implementation, or we overwrite whole phases of the generic implementation with a specialized version to meet the needs of the specific package.

In reality, this looks approximately like this:

# /pkg.toml
# no name = ""
# no version = ""

[phases]
unpack.script = '''
    tar xf $sourcefile
'''

build.script = '''
    make -j 4
'''

install.script = '''
    make install
'''

package.script = '''
    # .. you get the hang of it
'''

and later:

# in /tmux/pkg.toml
name = "tmux"
# still no version = ""

[phases]
build.script = '''
    make -j 8
'''

and even later:

# in /tmux/3.2/pkg.toml
version = "3.2"

[phases]
install.script = '''
	make install PREFIX=/usr/local/tmux-3.2
'''

not that the above example is accurate or sane, but it demonstrates the power of the approach: In the top-level /pkg.toml, we define a generic way of building packages. in /tmux/pkgs.toml we overwrite some settings that are equal for all packaged tmux versions: name and the build part of the script, and in the /tmux/3.2/pkg.toml file we define the last bits of the package definition: The version field and, because of some reason, we overwrite the install part of the script to install to a certain location.

One could even go further and have a /tmux/3.2/fastbuild/pkg.toml, where version is set to "3.2-fast" and build tmux with -O3 in that case.

Templating package scripts

The approach described above is very powerful and flexible. It has one critical problem, though: What if we needed information from the lowest pkg.toml file in the tree (e.g. /tmux/3.2/fastbuild/pkg.toml), but that information had to be available in /pkg.toml.

There are two solutions to this problem. The first one would be that we would define a phase at the very beginning of the package script that would define all the variables. The /tmux/3.2/fastbuild/pkg.toml file would overwrite that phase and define all the variables, and later phases would use them.

That approach had one critical problem: It would yield the layering of pkg.toml files meaningless, because each pkg.toml file would need to overwrite that phase with the appropriate settings for the package: if /tmux/3.2/pkg.toml defined all the variables, but one variable needs to be overwritten for /tmux/3.2/fastbuild/pkg.toml, the latter would still need to overwrite the complete phase. This was basically a pothole for the don't-repeat-yourself idea and thus a no-go.

So we asked ourselves: what data do we need in the top-level generic scripts that gets defined in the more specific package files? Turns out: only the static stuff! The generic script phases need the name of a package, or the version of a package, or meta-information of the package... and all this data is static. So, we could just sprinkle a bit of templating over the whole thing and be done with it!

That's why we added handlebars to our dependencies: To be able to access variables of a package in the build script. Now, we can define:

build.script = '''
    cd /build/{{this.name}}-{{this.version}}/
    make
'''

And even more complicated things, for example iterating over all defined dependencies of a package and check whether they are installed correctly in the container. And all that can be scripted in an easy generic way, without knowing about the package-specific details.

PostgreSQL

We decided to use postgresql for logging structured information of the build processes.

After identifying the entities that needed to be stored, setting up the database with the appropriate scheme was not too much of a hassle, given the awesome diesel crate.

Before we started with the implementation of butido, we identified the entities with the following diagram:

+------+ 1             N +---+ 1          1 +-----------------------+
|Submit|<--------------->|Job|-------+----->|Endpoint *             |
+--+---+                 +---+       |    1 +-----------------------+
   |                                 +----->|Package *              |
   |                                 |    1 +-----------------------+
   |                                 +----->|Log                    |
   |  1  +-----------------------+   |    1 +-----------------------+
   +---->|Config Repository HEAD |   +----->|OutputPath             |
   |  1  +-----------------------+   |  N:M +-----------------------+
   +---->|Requested Package      |   +----->|Input Files            |
   |  1  +-----------------------+   |  N:M +-----------------------+
   +---->|Requested Image Name   |   +----->|ENV                    |
   | M:N +-----------------------+   |    1 +-----------------------+
   +---->|Additional ENV         |   +----->|Script *               |
   |  1  +-----------------------+          +-----------------------+
   +---->|Timestamp              |
   |  1  +-----------------------+
   +---->|Unique Build Request ID|
   |  1  +-----------------------+
   +---->|Package Tree (JSON)    |
      1  +-----------------------+

Which is explained in a few sentences:

  1. Each job builds one package
  2. Each job runs on one endpoint
  3. Each job produces one log
  4. Each job results in one output path
  5. Each job runs one script
  6. Each job has N input files, and each file belongs to M jobs
  7. Each job has N environment variables, and each environment belongs to M jobs
  8. One submit results in N jobs
  9. Each submit was started from one config repository commit
  10. Each submit has one package that was requested
  11. Each submit runs on one image
  12. Each submit has one timestamp
  13. Each submit has a unique ID
  14. Each submit has one Tree of packages that needed to be built

I know that this method is not necessarily “following the book” of how to develop software, but this was the very first sketch-up of how our data needed to be structured. Of course, this is not the final database layout that is implemented in butido today, but nevertheless it hasn't changed fundamentally since. The idea of a package “Tree” that is stored in the database was removed, because after all, the packages are not a tree but a DAG (more details below). What hasn't changed, is that the script that was executed in a container is stored in the database, as well as the log output of that script.

Implementing a MVP

After a considerable planning phase and several whiteboards of sketches, the implementation started. Initially, I was allowed to spend 50 hours of work on the problem. That was enough for get some basics done and a plan on how to reach the MVP. After 50 hours, I was able to say that with approximately another 50 hours I could get a prototype that could be used to actually build a software package.

Architecture

The architecture of butido is not as complex as it might seem. As these things are best described with a visualization, here we go:

      ┌─────────────────────────────────────────────────┐
      │                                                 │
      │                    Orchestrator                 │
      │                                                 │
      └─┬──────┬───────┬───────┬───────┬───────┬──────┬─┘
        │      │       │       │       │       │      │
        │      │       │       │       │       │      │
    ┌───▼─┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌─▼───┐
    │     │ │     │ │     │ │     │ │     │ │     │ │     │
    │ Job │ │ Job │ │ Job │ │ Job │ │ Job │ │ Job │ │ Job │
    │     │ │     │ │     │ │     │ │     │ │     │ │     │
    └───┬─┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └─┬───┘
        │      │       │       │       │       │      │
      ┌─▼──────▼───────▼───────▼───────▼───────▼──────▼─┐
      │                                                 │
      │                    Scheduler                    │
      │                                                 │
      └────────┬───────────────────────────────┬────────┘
               │                               │
      ┌────────▼────────┐             ┌────────▼────────┐
      │                 │             │                 │
      │    Endpoint     │             │    Endpoint     │
      │                 │             │                 │
      └────────┬────────┘             └────────┬────────┘
               │                               │
┌─────┬────────▼────────┬─────┐ ┌─────┬────────▼────────┬─────┐
│     │ Docker Endpoint │     │ │     │ Docker Endpoint │     │
│     └─────────────────┘     │ │     └─────────────────┘     │
│                             │ │                             │
│       Physical Machine      │ │       Physical Machine      │
│                             │ │                             │
│ ┌───────────┐ ┌───────────┐ │ │ ┌───────────┐ ┌───────────┐ │
│ │           │ │           │ │ │ │           │ │           │ │
│ │ Container │ │ Container │ │ │ │ Container │ │ Container │ │
│ │           │ │           │ │ │ │           │ │           │ │
│ └───────────┘ └───────────┘ │ │ └───────────┘ └───────────┘ │
│                             │ │                             │
└─────────────────────────────┘ └─────────────────────────────┘

One part I could not visualize properly without messing up the whole thing is that each job talks to some other jobs. Also, some helpers are not visualized here, but they do not play a part in the overal architecture. But lets start at the beginning.

From top to bottom: The orchestrator uses a Repository type to load the definitions of packages from the filesystem. It then fetches the packages that need to be build using said Repository type, which does a recursive traversal of the packages. That process results in a DAG of packages. For each package, a Job is created, which is a set of variables that need to be associated with the Package so that can be built. That is environment variables, but also the image that should be used for executing the build script and some more settings.

Each of those jobs is then given to a “Worker” (named “Job” in above visualization). Each of these workers gets associated with the jobs that need to be successfully built before the job itself can be run – that's how dependencies are resolved. This association itself is a DAG, and it is automatically executed in the right order because each job waits on its dependents. If an error happened, either during the execution of the script in the container, or while processing of the jobs itself, the job sends the error to its parent job in the DAG. This way, errors propagate through the DAG and all jobs exit, either with success, an error from a child, or an own error.

Each of those jobs knows a “scheduler” object, which can be used to submit work to a docker endpoint. This scheduler keeps track of how many builds run at every point in time and blocks further spawning of further builds if there are too many running (which is a configuration option for the user of butido).

The scheduler knows each configured and connected Endpoint and uses it for submitting builds to a docker endpoint. These docker endpoints could be physical machines (like in the visualzation) or VMs or whatever else.

Of course, during the whole process there are some helper objects involved, the database connection is passed around, progress bar API objects are passed around and a lot of necessary type-safety stuff is done. But for the architectural part, that really was the essence.

Problems during implementation

During the implementation of butido, some problems were encountered and dealt with. Nothing too serious that made us re-do the architecture and nothing too complicated. Still, I want to highlight some of the problems and how they were solved.

Artifacts and Paths

One problem we encountered, and which is a constant source of happyness in every software project, was the paths to our input and output artifacts and the handling of them.

Luckily, Rust has a very convenient and concise path handling API in the standard library, so over several refactorings, we were able to not introduce more bugs than we squashed.

There's no individual commit I can link here, because quite a few commits were made that changed how we track(ed) our Artifact objects and path to artifacts during the execution of builds. git log --grep=path --oneline | wc -l lists 37 of them at the time of writing this article.

It is not enough for code to work. (Robert C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship)

Lessons learned is: Make use of strong types as much as you can when working with paths. For example, we have a StoreRoot type, that points to a directory where artifacts are stored in. Because artifacts are identified by their filename, there's a ArtifactPath type. If StoreRoot::join(artifact_path) is called, the caller gets a FullArtifactPath, which is an absolute path to the actual file on disk. Objects of the aformentioned types cannot be altered, but only new ones can be created from them. That, plus some careful API crafting, makes some classes of bugs impossible.

Orchestrator rewrite(s)

In above section on the general architecture of butido, the “Orchestrator” was already mentioned. This type takes care of loading the repository from disk, selecting the package that needs to be build, finds all the (reverse) dependencies of the package and transforms each package into a runnable job that can be submitted to the scheduler and to the endpoints from there, using workers to orchestrate the whole thing.

The Orchestrator itself, though, was rewritten twice.

In the beginning, the orchestrator was still working on a Tree of packages. This tree was processed layer-by-layer:

    A
   / \
  B   E
 / \   \
C   D   F

So this tree resulted in the following sets of jobs:

[
    [ C, D, F ]
    [ B, E ]
    [ A ]
]

and because the packages did not depend on each other, these lists would then be processed in parallel.

This was a simple implementation of a simple case of the problem that worked very well until we were able to run the first prototype. It was far from optimal though. In above tree, the package named E could start to build eventhough C and D were not finished building yet.

The first rewrite mentioned above solved that by reimplementing the job-spawning algorithm to perform better on such and similar cases – which are in fact not that uncommon for us.

The second rewrite of the Orchestrator, which happened shortly after the first one (after we understood the problem at hand even better), optimized that algorithm to the best possible solution. The new implementation uses a trick: It spawns one worker for each job. Each of those workers has “incoming” and “outgoing” channels (that's actually Multi-Producer-Single-Consumer-Queues). The orchestrator associates each job in the dependency DAG with its parent by connecting the childs “outgoing” channel with the parents “incoming” channel. Leaf nodes in the DAG have no “incoming” channel (they are closed right away after instantiation) and the “outgoing” channel of the “root node” in the DAG sends to the orchestrator itself.

The channels are used to send either successfully built artifacts to the parent, or an error.

Each worker then waits on the “incoming” channels for the artifacts it depends on. If it gets an error, it does nothing but send that error to its parent. If all artifacts are received, it starts scheduling its own build on the scheduler, sending the result of that process to its parent.

This way, artifacts and/or errors propagate through the DAG until the Orchestrator gets all results. And the tokio runtime, which is used for the async-await handling, orchestrates the execution of the processes automatically.

Tree vs. DAG

Another problem we encountered was not so much of a problem but rather a simplification we used when writing the package-to-job conversion algorithm the first time. When we began implementing the package-loading mechansim, we used a tree data structure for the recursive loading of dependencies. That meant that a structure like this:

A ------> B
|         ^
|         |
 `-> C --´ 

Was actually represented like this:

A ----->  B
|
|
 `-> C -> B 

That meant that B was built twice: Once as a dependency of A and once as a dependency of C. That was a simplification we used to get a working prototype fast, because implementing the proper structure (a DAG) was considered too involved at the time, given the “prototype nature” the project had back then.

Because we used clear seperation of concerns in the codebase and the module that implemented the datatype for holding the loaded package dependency collection was seperated from the rest of the logic, replacing the tree structure with a DAG structure was not too much involved (merge).

To lay a bit more emphasize on that: When we changed the implementation of the package dependency collection handling, we didn't need to rewrite the Orchestrator for that (except changing some interfaces), because the two parts of the codebase were seperated enough so that changing the implementation on one side didn't result in a rewrite on the other side.

Open sourcing

For me it was clear from the outset that this tool was not what we call our intellectual property. Our expertise is in the scripts that actually executed the build, in our handling of the large number of special cases for every package/target distribution/customer combination. The tool that orchestrated these builds is nothing that could be sold to someone, because it is not rocket science.

Because of that, I asked whether butido could be open sourced. Long story short: I was allowed to release the source code under a open source license and since we had some experience with the Eclipse Public License (EPL 2.0), I released it under that license.

Before we were able to release it under that license, I had to add some package lints (cargo-deny), to verify that all our dependencies were on terms with that and I had to remove one package that was pulled in but had an unclear license. Because our CI does check the license of all (reverse) dependencies, we can be sure that such an package is not added again.

butido 0.1.0

After about 380 hours of work, butido 0.1.0 was released. Since then, four maintenance releases (v0.1.{1, 2, 3, 4}) were released to fix some minor bugs.

butido is usable for our infrastructure, we've started packaging our software packages with it and started (re)implementing our scripts in the butido specific setup. For now, everything seems to work properly and at the time of writing this article, about 30 packages were already successfully packaged and built.

In general, the longer you wait before fixing a bug, the costlier (in time and money) it is to fix. (Joel Spolsky, Joel on Software)

Of course there might be still some bugs, but I think for now, the most serious problems are dealt with and I'm eager to start using butido at a large scale.

Plans for the future

We also have plans for the future: some of these plans are more in the range of “maybe in a year”, but some of them are also rather short-term.

Some refactoring stuff is always on such a list, as it is with butido: When implementing certain parser helpers, we used the “pom” crate, for example. This should be replaced by the way more popular “nom” crate, just to be future-proof here. Some frontend (CLI) cleanup should be done as well, to ensure consistency in our user expierience.

Logging was implemented using the “log” crate, but as we are talking about a async-await codebase here, we might switch to “tracing”.

Some more general improvements would be required as well. For example, we do store the “script” that gets compiled and send to the containers to execute the build, in the database for each build. This is not very efficient, as we basically have hundreds of copies of the script in the database at some point, only with small changes from one build to another (or even none).
But because the script is in the git history of the package repository, it would be easy to generate the script from an old checkout of the repository. This is a really nice thing that could be improved, as it would reduce database pressure by a nontrivial amount.

Still, the biggest pressure on the database is the fact that we write build logs to the database. And some of these build logs have already reached 10.000 lines. Of course it is not very efficient to store them in a postgres TEXT field, though at the time of initial development, it was the simple solution.
There are a few thoughts flying around how we could improve that while holding up our guarantees – especially that the log cannot be modified easily after it was written. Yes, of course, it can be altered in the postgres database as well, but the hurdles (writing an SQL statement) are much higher than with for example plaintext files (vim <file>). And do not get the impression that this scenario is far-fetched!

whatever can go wrong, will go wrong.

(Murphy's Law)

Having a more structured storage format of the log (especially with timestamps on each line in the log) could lead to some interesting forecasting features, where butido can approximate the time a build will finish matching the output lines of the build with the output lines of an old build.

The biggest idea for butido, though, is refactoring the software into a service.

The first step for that would be refactoring of the endpoint-handling modules into services. This way, if several developers submit builds, butido would talk to a hand full of remote services that represent one endpoint each. This service could then decide whether there are still resources available on the host or whether the incoming job should be scheduled at a later time.

The next step would then be to refactor the core software component into a service as well. In that world, the developer would only have a thin-client on the commandline which could be used to submit builds to the service. The service would then itself schedule the builds appropriately, taking care of resources automatically.
Of course, integration with other services, such as prometheus for gathering of statistics or nexus as an artifact repository would be easily possible, then.

All these things are more mid- or long-term ideas rather than things that will happen within just a few months.

Improvements over the status-quo

In the following section, I want to highlight some improvements over our old setup. Some changes cannot be measured, because they have impact on quality-of-live, joy of use and, for example, the organization of our setup. So, none of the following is hard science, because the tools compared here are so different in their approach, functionality and behaviour. Still, this section might give a good impression how things improved.

Non-measurable improvements

The package definition format is now typed, contrary to being only a bash script in our old setup. That gives us stronger guarantees about its correctness. The interpolation of package variables into the package scripts fails if a required variable is not present. The package definition now holds metadata about the package (such as a description, the project homepage, source URL of the package, license, dependencies). All these things could've been possible before, but were never implemented. Dependencies were not hardcoded into a package, but semi-automatically searched by the old tooling. Although we had to tell each package what it depended on, the exact version was heuristically searched for. Now it is hardcoded and we always know exactly which version will be used.

The package scripts are then compiled to one big script that executes the whole build of one package. Each of these scripts is ran through a linter (for example shellcheck) on each build, ensuring that we do not mess up minor things that get us in the long run. That was simply not possible before.

From experience I can say, that updating a package in butido is way faster than updating it with our old infrastructure. If the updated package does not need any further customizations than the prior version, a package update is as simple as:

  • mkdir package/new-version
  • cp package/{old,new}-version/pkg.toml
  • Updating the version string, source URL and SHA1-sum in that file
  • butido source download <package> && butido build ...

In our old setup, a whole new package would have been created. This is, of course, also scripted. But it is not as simple as the above steps:

  • mkdir package/new-version && cd package/new-version
  • create-new-project
  • wget "<SOURCES>"
  • tar xf <SOURCES>, renaming the source directory to match our naming scheme if necessary
  • Manually adapting the generated configuration of that package, although sometimes cping from an old version worked
  • Call the tooling to build

Of course all of these are not measurable improvements. So lets have a look at the measurable ones!

Measurable improvements

Because of the new package description format, we can configure each package build individually, for example how many make threads should be used to build it (think of make -j 10). Building small packages with a huge number of threads only introduces overhead, while building large packages with a small number results in longer builds.

The layered nature of our package format highly encourages code-reuse. Considering python, the full buildscript of python 3.7.5 is, in our new setup, 664 lines long. The actual package definitions for the python interpreter are just 68 lines, and that is including all metadata. The rest of these 664 lines is the surrounding framework for building our packages that is shared between all packages. The non-shared python-3.7.5-specific part is only 11 lines. In our old setup, the package definition for python 3.7.5 is 167 lines of bash, not including the framework itself. So to have a fair comparison, we must compare these 167 lines with the 68 lines or rather the 11 lines from the new setup, which would mean that our new setup is only 40%, or rather 6.6% of the old setup. Of course this comparison is not completely fair, but – in my opinion – a good approximation.

A full build of our python interpreter takes not even 10 minutes! That is, including generating build scripts for 12 packages, linting all individual scripts, checking that each of the sources have still the expected SHA1, putting all packages in a DAG and traversing this DAG, building the packages in parallel (on two dedicated machines), each package in one docker container.

For reference, the package tree python looks like this (some packages are displayed multiple times because this is a tree representation of the DAG):

python 3.7.5
├─ sqlite 3.35.1
├─ readline 8.1
├─ openssl 1.1.1i
├─ libxslt 1.1.34
│  ├─ libgcrypt 1.9.2
│  │  └─ libgpg-error 1.42
│  └─ libxml2 2.9.10
├─ libxml2 2.9.10
├─ libtk 8.6.10
│  └─ libtcl 8.6.10
├─ libtcl 8.6.10
├─ libffi 3.2.1
└─ libbzip2 1.0.8

As soon as the libraries in this tree are released, butido automatically reuses the packages if it finds that a build would result in the same package, reducing the buildtime even more!

The overhead that butido itself introduces is minimal. Considering the mentioned python package and its dependencies, a build that takes 9:22 minutes. The actual call to butido build takes 9:40 until it returns with exit code 0. This overhead is 3.1% of the complete time. Absolutely not interesting at all compared to the 10 minutes of overall time!

With our old tooling, that is hardly measurable, though. Because of the in-sequence-build nature and a lot of things happening in between the individual package building steps, as well as the fact that the tooling builds in one docker container on one host, the build for python takes considerably longer. Also, the gathering of meta-information and general overhead of the tool itself might be easily a multiple of the overhead of butido.

Comparing two builds

To get hard numbers for a clean build of a package, I had to use cmake in version 3.20.4, because our package for python differs too much from one setup to another (especially concerning dependendcies) to make a helpful comparison.

So, for the following numbers, I started a cmake 3.20.4 build for CentOS 7 with our old tooling, making sure that 24 threads were used (the machine has 24 cores). The cmake package has no dependencies, so we can compare the numbers directly here.

The old tooling took 3:46 minutes from calling the tool until it returned with exit code 0.

The call to butido was made with time butido build cmake --no-lint --no-verify -I local:rh7-default -E TARGET_FORMAT=rpm -- 3.20.4, whereas linting and hash-sum-verification was turned off to be more comparable (our old tooling does not do that). It took butido 3:32 minutes, whereas the actual build took 3:26 minutes. That means that butido has an overhead of 6 seconds, whereas the old tooling had an overhead of 14 seconds.

The comparability of those numbers is still a bit difficult, because of the following points:

  • butido does start the container on the remote host, whereas in the old tooling the container is already running
  • butido uploads all sources for the package and the build script over the network to the build-host into the container, whereas the old tooling has all that available
  • butido downloads the resulting package via network from the container, whereas the old tooling simply puts it on the filesystem
  • butido stops the container after a successfully built package, the old tooling does not

Still, these numbers show a significant improvement in speed, especially considering these points.

Some more things

Now is the time to conclude this article.

First of all, I want to thank my employer for letting me write that piece of software and especially for letting me share it with the community as open source software! I think I can say that it was a very interesting journey, I learned a lot, though I think I was very productive and – as far as I can say – it turned out to be a valuable piece of work for our future progress on the whole subject of building software packages.

I think I am allowed to speak on behalf of my team-collegues if I say that they/we are excited to use the new tooling.

I also want to thank everyone who is involved in the Rust ecosystem, either by hanging out on IRC, Matrix or other communication channels, commenting on github or, and especially, by contributing to crates in the ecosystem. The next – and last – chapter in this article will show that I tried to contribute back wherever I saw a chance!

One more thing that is very important to me as well and I want to highlight quickly here is that this article, but also butido itself, would have been way harder to write if we wouldn't have committed cleanly from day one. The fact that we crafted commit messages carefully (for example see here, here, here, here, here or here just for some cases), made navigating the history a breeze.

Contributions to the Rust ecosystem

During the development of butido, contributions to a few other crates from the ecosystem were made. Some of these are PRs that were created, some of these are discussions where I took part to improve something, some of these are issues I reported or questions I've asked the authors of a crate.

During the development of butido, I also got maintainer rights for the config-rs crate, which I did the 0.11.0 release for and where I'm eager to continue contributing.

I want to highlight that these contributions were not necessarily financed by my employer. A few of them were mission-critical for butido itself. Nevertheless, none of them were completely irrelevant for butido and although some of them are trivial, they all resulted in one way or another in benefit for the implementation of butido itself.

The following list is probably not exhaustive, although I tried to catch everything.

  • config-rs
    • Issue 163 on how relative paths are handled in the crate
    • Issue 167 on an edgecase, as only signed 64 bit integers are supported
    • Issue 170 for implementing github actions in the repository
    • Issue 171 about an API inconvenince
    • Issue 182 about an floating-point misuse bug
    • PR 137 on interface extensions for environment variable usage
    • PR 142 on a new supported backend format
    • PR 152 on updating a dependency
    • PR 154 updated an example
    • PR 155 formatted the codebase
    • PR 156 made formatting enforced
    • PR 166 added a builder-pattern style helper function
    • PR 169 fixing tests
    • PR 172 adding a new feature
    • PR 174 integrating a maintenance fork of the crate
    • PR 175 replacing travis with github-actions
    • PR 177 Release v0.11.0
    • PR 178 for support for different-sized integers
    • PR 179 code cleanup
    • PR 180 changing the interface to be builder-pattern style
    • PR 181 documentation upgrades
    • PR 183 adding a test for a know bug
    • PR 184 removing an old interface
    • PR 185 fixing a bug with custom seperators
    • PR 186 fixing an issue with CI
    • PR 187 adding a test
    • PR 188 adding a test for a known bug
    • PR 189 updating an inconsistent interface
    • PR 190 removing dead code
    • PR 191 removing unused import statements
    • PR 192 adding a test for a known bug
    • PR 193 misc changes
    • PR 195 adding a block-fixup-commits github-action
    • PR 196 on a new builder-pattern interface
    • PR 199 improving the github-actions workflow
    • PR 200 updating dependencies
    • PR 202 on a new supported backend format
  • shiplift
    • Issue 275 about an API inconvenience
    • PR 237 changed an library interface to be more ergonomic
    • PR 242 updated the error implementation to use the thiserror crate
    • PR 254 fixing a bug
    • PR 256 on more explicit initialization
    • PR 258 on interface deprecation and replacing it with an up-to-date one
    • PR 261 on an change for the default behaviour for pull options
    • PR 262 increasing type safety using types for data other than losely typed json objects
    • PR 265 a rustfmt configuration fixup
    • PR 267 – discussion on code refactoring
    • PR 273 fixed a documentation typo
    • PR 274 – discussion about README fixes
    • PR 276 documentation improvements
    • PR 277 adding a new, safe interface
    • PR 278 updating dependencies
    • PR 279 adding documentation links
    • PR 284 fixing lints
    • PR 286 fixing a clippy lint
  • indicatif
    • Issue 212 asked about an issue with the general usage of the library
    • Issue 216 asked about double-rendered bars
    • Issue 239 opened a feature request that progress-bar colors should be changable if process succeeded
  • tokio
    • PR 3106 asked for a nicer interface to build a stream from an iterator
    • PR 3685 fixing a documentation inconsistency
  • handlebars-rust

Thanks for reading.


“Thoughts” is (will be) a weekly roll-up of my mastodon feed with some notable thoughts collected into a long-form blog post. “Long form” is relative here, as I will only expand a little on some selected subjects, not write screens and screens of text on each of these subjects.

If you think I toot too much to follow, this is an alternative to follow some of my thoughts.


This week (2021-05-08 – 2021-05-14) I thought about vendor lock once more and played a bit with my raspberry.

Github and Vendor-Lock

My discontent about #github continues but I had to admit that github-actions is very nice and I like it more every time I have to work with it. I'm not sure whether this plays into the vendor-lock thing mentioned earlier.

I also voiced my discomfort... no, lets face it: my anger with people that cannot obey the simplest commit message rules.

Sensor stuff

I was able to boot my old Raspberry Pi (1) with raspbian (german) (unfortunately #nixos failed to boot and I don't know why), which makes my plans for sensors in my flat (last week, german) a little easier. It won't be able to run prometheus and grafana because 512MB of RAM is just not enough for these, I guess. Still, I am one step further towards the goal!

Movies

I watched “The Covenant” and asked the fediverse whether there's a database of movies with spiders, so I can avoid them. This would be in fact a really helpful database, and I am sad that no such things exists. I am not actually arachnophobic, but I really don't like them and prefer to watch movies without any spiders or similar creatures (crabs, scorpions, ...).

Idiotic shopping

Well, I had some encounters (german) with overly stressed workers at my local grocery store this week.

The thread linked above tells you everything I guess. I am still baffeled how idiotic some people in our society behave, even though this is not the worst kind of human being running around these days (especially if you consider covid-denialists and so on).

Happy new year!

I just managed to implement syncthing monitoring for my prometheus and grafana instance, so I figured to write a short blog post about it.

Note: This post is written for prometheus-json-exporter pre-0.1.0 and the configuration file format changed since.

Now, as you've read in the note above, I managed to do this using the prometheus-json-exporter. Syncthing has a status page that can be accessed with

$ curl localhost:22070/status

if enabled. This can then be used to push to prometheus using the prometheus-json-exporter mentioned above using the following configuration for mapping the values from the JSON to prometheus:

- name: syncthing_buildDate
  path: $.buildDate
  help: Value of buildDate

- name: syncthing_buildHost
  path: $.buildHost
  help: Value of buildHost

- name: syncthing_buildUser
  path: $.buildUser
  help: Value of buildUser

- name: syncthing_bytesProxied
  path: $.bytesProxied
  help: Value of bytesProxied

- name: syncthing_goArch
  path: $.goArch
  help: Value of goArch

- name: syncthing_goMaxProcs
  path: $.goMaxProcs
  help: Value of goMaxProcs

- name: syncthing_goNumRoutine
  path: $.goNumRoutine
  help: Value of goNumRoutine

- name: syncthing_goOS
  path: $.goOS
  help: Value of goOS

- name: syncthing_goVersion
  path: $.goVersion
  help: Value of goVersion

- name: syncthing_kbps10s1m5m15m30m60m
  path: $.kbps10s1m5m15m30m60m
  help: Value of kbps10s1m5m15m30m60m
  type: object
  values:
    time_10_sec: $[0]
    time_1_min: $[1]
    time_5_min: $[2]
    time_15_min: $[3]
    time_30_min: $[4]
    time_60_min: $[5]

- name: syncthing_numActiveSessions
  path: $.numActiveSessions
  help: Value of numActiveSessions

- name: syncthing_numConnections
  path: $.numConnections
  help: Value of numConnections

- name: syncthing_numPendingSessionKeys
  path: $.numPendingSessionKeys
  help: Value of numPendingSessionKeys

- name: syncthing_numProxies
  path: $.numProxies
  help: Value of numProxies

- name: syncthing_globalrate
  path: $.options.global-rate
  help: Value of options.global-rate

- name: syncthing_messagetimeout
  path: $.options.message-timeout
  help: Value of options.message-timeout

- name: syncthing_networktimeout
  path: $.options.network-timeout
  help: Value of options.network-timeout

- name: syncthing_persessionrate
  path: $.options.per-session-rate
  help: Value of options.session-rate

- name: syncthing_pinginterval
  path: $.options.ping-interval
  help: Value of options.ping-interval

- name: syncthing_startTime
  path: $.startTime
  help: Value of startTime

- name: syncthing_uptimeSeconds
  path: $.uptimeSeconds
  help: Value of uptimeSeconds

- name: syncthing_version
  path: $.version
  help: Value of version

When configured properly, one is then able to draw graphs using the syncthing-exported data in grafana.

There's nothing more to it.

tags: #nixos #grafana #prometheus #syncthing

My between-the-years project was trying to use my old Thinkpad to run some local services, for example MPD. I thought, that the Thinkpad did not even have a drive anymore, and was surprised to find a 256GB SSD inside of it – with nixos still installed!

AND RUNNING!

So, after almost two years, I booted my old Thinkpad, entered the crypto password for the harddrive, and got greeted with a login screen and an i3 instance. Firefox asked whether I want to start the old session again... everything just worked.

I was amazed.

Well, this is not the crazy thing I wanted to write about here. The problem now was: I update and deploy my devices using krops nowadays. This old installation had root login disabled, which is required for krops to work...

But, because nixos is awesome,... I did nothing more than checking out the git commit the latest generation on the Thinkpad was booted from, modified some settings for ssh server and root user ... rebuild the system and switched to the new build... and then started deployment for the new nixos 20.09 installation using krops.

All without hassle. I might run out of disk space now, because this deployes a full KDE Plasma 5 installation, but honestly it would surprise me... there should be enough space. I am curious, though, whether KDE Plasma 5 runs on the device. We'll see...

tags: #nixos #desktop

In the last few months, I was invited to join the nixos organization on github multiple times. I always rejected. Here's why.

Please notice

Please notice that I really try to write this down as constructive criticism. If any of this offends you in any way, please let me know and I'll rephrase the specific part of this article. I really do care about the nixos community, I've been a user of NixOS (on all my devices except phone) since mid 2014, I've been a contributor since January 2015 and I am continuing to be an user and an author of contributions.

I do think that Nix or even NixOS is the one true way how to deploy systems that need to be reproducible, even if that needs one to sacrifice certain comfort.

Context

Secondly, I need to provide some context from where I'm coming so the dear reader can understand my point of view in this article.

First of all, I did not start my journey with NixOS, of course. I was a late bloomer in regards to linux, in fact. I was introduced to Ubuntu by a friend of mine in 11th grade. I started to use Kubuntu, but only a few weeks later my friend noticed that I was getting better and better with the terminal, so maybe not even half a year later I switched to Archlinux, which I used on my desktops until I was introduced to NixOS. In that time, I learned how to write Java (which I do not do anymore btw), Ruby and C, started hacking a lot of funny things and managed to contribute patches to the linux kernel about two years later.

I'm not trying to show balls here! That last bit is important for this article, especially if you know how the kernel community works and how the development process of the kernel works. I guess you know where this is going.

I heard of NixOS in late 2014 at a conference in the black forest, where Joachim Schiele talked about it. A few months later, my latex setup broke from an update and I was frustrated enough by Archlinux to try something new.

I never looked back.

The “early days”

When I started using NixOS, Nix, the package manager, already existed for about ten years. Still, the community was small. When I went on the IRC channel or on the mailinglist, I could easily remember the nicknames and I was able to skim through the subjects of the mails on the list to see what was going on, eventhough I did not understand all of it.

That soon changed. I remember the 15.09 release when everyone was super excited and we were all “yeah, now we're beginning to fly” and so on. Fun times!

Problem 1: Commit access and development process

Now, lets get into the problems I have with the community and why I reject the invitations to join the github organization.

The problem

In fact, I started people asking and telling about this pretty early on: five(!) years ago, I started replying to an email thread with this message

Quote:

Generally, I think it would be best to prevent commit access as far as possible. Having more people to be able to commit to master results in many different opinions committing to master, which therefor results in discussions, eventually flamewars and everything.

Keeping commit access for only a few people does not mean that things get slower, no way!

[...]

What you maybe want, at least from my point of view, is staging branches. Some kind of a hierarchy of maintainers, as you have in the linux kernel. I fully understand that the linux kernel is a way more complex system as nixos/nixpkgs, no discussion here. But if you'd split up responsibilities, you may end up with

* A fast and secure development model, as people don't revert back and forth.

* Fewer “wars” because people disagree on things

* Less maintaining efforts, because the effort is basically split up in several small “problems”, which are faster to solve.

What I want to say is, basically, you want a well-defined and structured way of how to do things.

Also please note that there's another mail from Michael Raskin in that thread where we talked about 25 PRs for new packages. Right now we're at about 1.8k open pull requests, with over 580 of them for new packages.

I take that as proof that we did not manage to sharpen and improve the process.

Lets get to the point. I started telling people that the development process we had back then was not optimal. In fact, I saw it coming: The community started to grow at an great pace back then and soon I talked to people on IRC and Mailinglist where I was like “Who the hell is this, I've never seen this name before and they seem not to be new, because they already know how things work and teach me...“.

The community grew and grew, over 4500 stars on github (if that measures anything), over 4500 forks on github.

When we reached 1k open pull requests, some people started noticing that we might not be able to scale anymore at some point. “How could we possibly manage that amount of pull requests ever?“.

Now we're at about 1.8k open pull requests.

I proposed changes several times, including moving away from github, which does IMO not scale to that amount of issues and PRs, especially because of its centralized structure and because of its linear discussions.

I proposed switching to kernel-style mailinglist. I was rejected with “We do not have enough people for that kind of development model”. I suspect that people did not understand what I meant by “kernel-style” back then (nor do I think they understand now). But I'm sure, now more than ever, that a switch to a mailinglist-based development model, with enough automation in place for CI and static analysis of patches would have had the best possible impact for the community. Even if that'd mean that newcomers would be a bit thrown-off at first!

The current state of affairs is even worse. Right now (as of this commit) , we have

  • 1541 merges on master since 2020-01-01
  • 1601 patches pushed directly to master since 2020-01-01

Feel free to reproduce these numbers with

$ git log --oneline --first-parent --since 2020-01-01 --[no-]merges | wc -l

That means that we had 1601 possibly breaking patches pushed by someone who things they are good enough and that their code never breaks. I'll leave it to the dear reader to google why pushing to master is a bad idea in a more-than-one-person-project.

Another thing that sticks out to me is this:

$ git log  --first-parent --since 2020-01-01 --merges | \
    grep "^Author" |    \
    sort -u |           \
    wc -l
74

74! 74 people have access to the master branch and can break it. I do not allege incompetence to any of these people, but we all know that not always everything works as smoothly as we expected, especially in software development. People are tired sometimes, people do make mistakes, people do miss things when reviewing things. That's why we invented continuous integration in the first place! That some thing can check whether the human part of the process did the right thing and report back if they didn't.

The solution

My dream-scenario would be that nobody would have access to master except for a bot like bors (or something equivalent for the Nix communiy). The rust communit, which uses bors heavily does software develoment the right way. If all checks pass, merging is done automatically. If not, the bot finds the breaking change by using a clever bisecting algorithm and merges all other (non-breaking) changes.

In fact, I would go further and introduce teams. Each team would be responsible for one task in the community. For example there's different packaging ecosystems within the nixpkgs repository, one for every language. Each language could get a team of 3 to 5 members that coordinate the patches that come in (from normal contributors) and apply them to a <language>-staging branch. That branch would be merged on a regular basis (like... every week) to master, if all tests/builds succeed (just like the kernel community does it)!

A team could also be introduced for some subsets of packages... Qt packages, server software, but also nixpkgs-libs or even documentation (which is another subject on itself).

Problem 2: “Kill the Wiki”

In 2015, at the nixcon in Berlin, we had this moment with “Kill the Wiki”. As far as I remember it was Rok who said that (not sure though). I was not a fan back then, and I'm actually even less a fan of that decision now.

Killing the wiki was the worst thing we could do documentation-wise. Everytime I tell people about nixos, I have to tell them that there is no decent documentation around. There is, of course, the documentation that is generated from the repository. That one is okay for the initial setup, but it is more than far away from being a good resource if you just want to look up how some things are done.

The nixos.wiki efforts fill the gap here a bit, sure. But we could really do better.

The solution would be rather simple: Bring back a wiki software, even if we start from scratch here or “just” merge the efforts from nixos.wiki – or make that one the official one – it would be an improvement all the way!

Problem 3: “Kill the mailinglist”

Certainly, what does this community have with killing their own infrastructure? They killed the wiki, they killed the mailinglist... both things that are really valuable... but github is the one thing that actually slows us down ... and does not get killed... I am stunned, really.

The solution here is also really simple: Bring it back. And not googlegroups or some other shitty provider, just host a mailman and create a few mailinglists... like the kernel.

I hope I do not have to write down the benefits here because the reader should be aware of them already. But for short:

  • Threaded discussions (I can reply multiple times to one message, quote parts and reply to each part individually, creating a tree-style discussion where each branch focuses on one point)
  • Asyncronous discussions (I can reply to a message in the middle of a thread rather than appending)
  • Possibility to work offline (yeah, even in our age this is important)
  • User can choose their interface (I like to use mutt, even on my mobile if possible. Web UIs suck)

I am aware that the “replacement” (which it really isn't) discourd is capable of going into mailinglist-mode. Ask me how great that is compared to a real mailinglist!

It is not.

The silver lining...

This article is a rather negative one, I know that. I do not like to close words with that negative feeling.

In fact, we got the RFC process, which we did not have when I started using nixos. We have the Borg bot, which helps a bit and is a great effort. So, we're in the process of improving things.

I'm still positive that, at some point, we improve the rate of improvements as well and get to a point where we can scale up to the numbers of contributors we currently have, or even more.

Because right now, we can't.

Errata

I did make some mistakes here and I want to thank everyone for telling me.

Numbers

Some nice folks on the nixos IRC/matrix channel suggested that my numbers for PRs vs. pushes to master were wrong, as githubs squash-and-merge feature is enabled on the github repository for nixpkgs.

It seems that about 4700 PRs were merged since 2020-01-01. This does proof my numbers wrong. Fact is: on my master branch of the nixpkgs github repository, there are 3142 commits. It seems that not all pull-requests were to master, which is of course true because PRs can and are filed against the staging branch of nixpkgs and also the stable branches.

Github does not offer a way to query PRs that are filed against a certain branch (at least not in the web UI), as far as I see.

So let's do some more fine-granular analysis on the commits I see on master:

git log --oneline --first-parent --since 2020-01-01 | \
    grep -v "Merge pull request" | \
    wc -l
1650

As github does create a commit message for the merge, we can grep that away and see what the result is. I am assuming here that nobody ever changes the default merge commit message, which might not be entirely true. I assume, though, that it happens not that often.

So we have 3142 commits from which are 1650 not github-branch-merges.

From time to time, master gets merged into staging and the other way round:

  • 20 merges from master to staging
  • 5 merges from staging to master

That leaves us at 1625 commits where the patch landed directly on master. How many of these patches were submitted via a pull request is not that easy to evaluate. One could write a crawler that finds the patches on github and checks whether they appear in a PR... but honestly, my point still holds true: If only one breaking patch lands on master per week, that results in enough slow-down and pain for the development process.

The inconsistency in the process is the real problem, having a mechanism that handles and schedules CI jobs and merges and a clear merge-window for per-topic changesets from team-maintained branches would give the community some structure. New contributors could be guided more easily as they would have a counterpart to contact for topic-specific questions and negotiations wouldn't be between people anymore but between teams, which would also give the whole community some structure and would also clearify responsibilities.

tags: #nixos #community

This will be a short article.

I did it. I switched to wayland. On my new device, which was installed with nixos 19.03 (which came out just a few days ago), I just switched away from X and i3 to wayland and sway.

And everything just works. How awesome is that?

tags: #nixos #wayland

On April 4th, NixOS 18.03 was released.

It is by far the best NixOS release so far, featuring Nix 2.0, major updates for the KDE desktop environment as well as the Gnome Desktop Environment, Linux kernel updated from 4.9 to 4.14 and glibc, gcc and system updated.

With this release, I switched from the unstable channel, which is basically the “rolling release” part of NixOS, to stable (18.03). I did that because I wanted to make sure I get even better stability. Don't get me wrong, even with the unstable channel, I had maybe two or three times in the last year where updating was not possible because one package did not build. But I want to be able to update always, and with 18.03 I get that (at least I hope so).

Also, because as soon as I'm on vacation and possibly without the ability to connect to the internet (or fast internet), I need a certain level of stability. And with the stable channel, the download size for updates shrinks, I guess, because the stable channel only gets small updates and security fixes.

I hope I will be able to switch from 18.03 to 18.09 in October without having too much trouble and downloading the world.

The update/upgrade process was surprisingly simple. The Manual explains what to do and it worked like every other unstable update: Executing some bash commands and then relying on the guarantees NixOS gives me.

Now I'm running stable.

tags: #software #nixos

34c3 was awesome. I prepared a blog article as my recap, though I failed to provide enough content. That's why I will simply list my “toots” from mastodon here, as a short recap for the whole congress.

  • (2017-12-26, 4:04 PM) – Arrived at #34c3
  • (2017-12-27, 9:55 AM) – Hi #31c3 ! Arrived in Adams, am excited for the intro talk in less than 65 min! Yes, I got the tag wrong on this one
  • (2017-12-27, 10:01 AM) – Oh my god I'm so excited about #34c3 ... this is huge, girls and boys! The best congress ever is about to start!
  • (2017-12-27, 10:25 AM) – Be awesome to eachother #34c3 ... so far it works beautifully!
  • (2017-12-27, 10:31 AM) – #34c3 first mate is empty.
  • (2017-12-27, 10:46 AM) – #34c3 – less than 15 minutes. Oh MY GOOOOOOOOOD
  • (2017-12-27, 10:49 AM) – Kinda sad that #fefe won't do the Fnord this year at #34c3 ... but I also think that this year was to shitty to laugh about it, right?
  • (2017-12-27, 10:51 AM) – #34c3 oh my good 10 minutes left!
  • (2017-12-27, 11:02 AM) – #34c3 GO GO GO GO!
  • (2017-12-27, 11:16 AM) – Vom Spieltrieb zur Wissbegierig! #34c3
  • (2017-12-27, 12:17 PM) – People asked me things because I am wearing a #nixos T-shirt! Awesome! #34c3
  • (2017-12-27, 12:59 PM) – I really hope i will be able to talk to the #secushare people today #34c3
  • (2017-12-27, 1:44 PM) – I talked to even more people about #nixos ... and also about #rust ... #34c3 barely started and is already awesome!
  • (2017-12-27, 4:28 PM) – Just found a seat in Adams. Awesome! #34c3
  • (2017-12-27, 8:16 PM) – Single girls of #34c3 – where are you?
  • (2017-12-28, 10:25 AM) – Day 2 at #34c3 ... Yeah! Today there will be the #mastodon #meetup ... Really looking forward to that!
  • (2017-12-28, 12:32 PM) – Just saw ads for a #rust #wayland compositor on an info screen at #34c3 – yeah, awesome!
  • (2017-12-28, 12:37 PM) – First mate today. Boom. I'm awake! #34c3
  • (2017-12-28, 12:42 PM) – #mastodon ads on screen! Awesome! #34c3
  • (2017-12-28, 12:45 PM) – #taskwarrior ads on screen – #34c3
  • (2017-12-28, 3:14 PM) – I think I will not publish a blog post about the #34c3 but simply list all my toots and post that as an blog article. Seems to be much easier.
  • (2017-12-28, 3:15 PM) – #34c3 does not feel like a hacker event (at least not like the what I'm used to) because there are so many (beautiful) women around here.
  • (2017-12-28, 3:36 PM) – The food in the congress center in Leipzig at #34c3 is REALLY expensive IMO. 8.50 for a burger with some fries is too expensive. And it is even less than the Chili in Hamburg was.
  • (2017-12-28, 3:43 PM) – Prepare your toots! #mastodon meetup in less than 15 minutes! #34c3
  • (2017-12-28, 3:50 PM) – #34c3 Hi #mastodon #meetup !
  • (2017-12-28, 3:55 PM) – Whuha... there are much more people than I've expected here at the #mastodon #meetup #34c3
  • (2017-12-28, 4:03 PM) – Ok. Small #meetup – or not so small. Awesome. Room is packed. #34c3 awesomeness!
  • (2017-12-28, 4:09 PM) – 10 minutes in ... and we're already discussing pineapples. Community ftw! #34c3 #mastodon #meetup
  • (2017-12-28, 4:46 PM) – Limiting sharing of #toots does only work if all instances behave! #34c3 #mastodon #meetup
  • (2017-12-28, 4:56 PM) – Who-is-who #34c3 #mastodon #meetup doesn't work for me... because I don't know the 300 usernames from the top of my head...
  • (2017-12-28, 5:17 PM) – From one #meetup to the next: #nixos ! #34c3
  • (2017-12-28, 5:57 PM) – Unfortunately the #nixos community has no space for their #meetup at #34c3 ... kinda ad-hoc now!
  • (2017-12-28, 7:58 PM) – Now... Where are all the single ladies? #34c3
  • (2017-12-28, 9:27 PM) – #34c3 can we have #trance #music please?
  • (2017-12-28, 9:38 PM) – Where are my fellow #34c3 #mastodon #meetup people? Get some #toots posted, come on!
  • (2017-12-29, 1:44 AM) – Day 2 ends for me now. #34c3
  • (2017-12-29, 10:30 AM) – Methodisch Inkorrekt. Approx. 1k people waiting in line. Not nice. #34c3
  • (2017-12-29, 10:43 AM) – Damn. Notebook battery ran out of power last night. Cannot check mails and other unimportant things while waiting in line. One improvement proposal for #34c3 – more power lines outside hackcenter!
  • (2017-12-29, 10:44 AM) – Nice. Now the wlan is breaking down. #34c3
  • (2017-12-29, 10:57 AM) – LAOOOOLAAA through the hall! We did it #34c3 !
  • (2017-12-30, 3:45 AM) – 9h Party. Straight. I'm dead. #34c3
  • (2017-12-30, 9:08 PM) – After some awesome days at the #34c3 I am intellectually burned out now. That's why the #trance #techno #rave yesterday was exactly the right thing to do!
  • (2017-12-30, 11:35 PM) – Where can I get the set from yesterday night Chaos Stage #34c3 ??? Would love to trance into the next year with it!
  • (2017-12-31, 11:05 PM) – My first little #34c3 congress résumé: I should continue on #imag and invest even more time. Not that I do not continue it, but progress is slowing down with the last months of my masters thesis... Understandable I guess.

That was my congress. Yes, there are few toots after 28th... because I was really tired by then and also had people to talk to all the time, so little time for microblogging there. All in all: It was the best congress so far!

tags: #ccc #social

When working with Rust on NixOS, one always had the problem that the stable compiler was updated very sparsely. One time I had to wait six weeks after the rustc release until I finally got the compiler update for NixOS.

This is kind of annoying, especially because rustc is a fast-moving target and each new compiler release brings more awesomeness included. As an enthusiastic Rust programmer, I really want to be able to use the latest stable compiler all the time. I don't want to wait several days or even weeks until I can use it.

Use the overlay, Luke!

Finally, with NixOS 17.03 and unstable as of now, we have overlays for nixpkgs. This means that one can “inject” custom nix expressions into the nixpkgs set.

Meet the full awesomeness of nixpkgs!

Soon after overlays were introduced, the Mozilla nixpkgs overlay was announced on the nixos-dev mailinglist.

Now we can install rustc in stable, beta and nightly directly with pre-build binaries from Mozilla. So we do not need to compile rustc nightly from source on our local machines if we want the latest great rust feature.

This is so awesome. Everybody should use the rust overlay now, in my opinion. It's really convenient to use and gives you more flexability with your rust installation on NixOS. So why not?

tags: #nixos #nix #rust #software

When the last semester came to an end, I noticed that my Thinkpad behaved weird. I couldn't nix-store --optimize it, and some other things began to fail silently. I suspected the SSD was dying, a Crucial C400 with 256GB. So I ran the smart tools with a short test – But it told me everything was alright. Then I ran the extended self-test on the drive and after 40% of the check (60% remaining) it told me about dead sectors, nonrecoverable.

So I got a new SSD and installed NixOS from my old installation. Here's how.

So I got a nice new Samsung EVO 850 PRO with 256 GB. I was really amazed how light these things are today. No heavy metal in there like in a HDD!

Preparation

First of, you need to prepare your current installation. Make some backups, be sure everything is fine with them.

Then, verify that your configuration.nix and your hardware-configuration.nix file list your partitions not by UUID, but by /dev/sda1 and so on. That could be really helpful later.

If you have some crypto keys you need to keep, maybe make another backup of them.

The installation

First of, we need to format the new drive. Use gdisk for this if you have a UEFI setup like with an Thinkpad X220. Format your partitions after that. Make sure that your boot partition is formatted as vfat (fat16). I don't know why, but it is only possible to boot from vfat, according to the nixos documentation. Also, do your cryptsetup.

For simplicity, I refer to the boot partition by /dev/sda1 and to the root/home partition as /dev/sda2 – you can, of course, have more partitions, maybe for a seperate /home. But I saw no need for it. With only one partition I do not have to take care of the size of the /nix/store and if I have few things in the store I can grow my music collection a bit – so I'm really flexible. And yes, I know about LVM, but I really don't need these things, do I?

Now, mount the partitions as follows:

  • /dev/sda2 in /mnt
  • /dev/sda1 in /mnt/boot (you might need to mkdir this directory first)

Ensure things are properly mounted. This broke my neck twice during my installation, as /mnt/boot wasn't mounted properly and I failed to rebuild the system. Took me some time to see this, actually.

Now you can nixos-generate-config --root /mnt. After that you might want to modify your configuration.nix file in the newly generated setup under /mnt/etc/nixos/configuration.nix – I did not! I nixos-install --root /mnted to get a minimal bootable system.

Then I rsync -aed my /home/$USER to /mnt/home/ and symlinked the configuration.nix (which lives in /home/$USER/config/nixos/$HOSTNAME.nix on my machines) to /etc/nixos/configuration.nix. I renamed the host as well, to avoid confusion.

Then, I shutdowned, removed the old SSD, assembled the new one and booted. I had some problems with failing mounts during boot (because I had mount operations specified by UUID rather than via /dev/sdaX). I got a rescue shell and was able to fix things up. After several reboots I was able to get my system up and running.

When I was able to boot my minimal installation, I just followed the manual and created my user and so on. Then, I nixos-rebuild switched. And because I copied my whole configuration.nix setup from my old drive, everything got build for me.

After some more nix-env -iA calls (because some things only live in my user environment), I fully restored my system. Awesome!

Conclusion

Installing NixOS from NixOS works really nice. You have to be careful with some things, UUIDs and so on, but overall it is rather simple.

Anyways, you benefit if you really know your system. I wouldn't necessarily recommend this to an inexperienced NixOS user – hacking things into the TTYs and getting a rescue shell for fixing the installation is no thing that a newbie really wants to do – except for learing and if backups are at hand!

Because of the awesomeness of NixOS and the configuration.nix file, I was able to rebuild my complete system within a few minutes. Despite my extensive adaptions in my configuration.nix file – speaking of container setup, custom compile flags for packages, custom vim setup with plugins compiled into the vim derivation (and the same again for neovim), hundreds of packages and stuff – I was able to rebuild my system without much effort.

Overall, leaving out the UUID fail, I think I am able to redo a complete setup (including syncing /home, which was ~100GB data, and reinstalling everything) in maybe 90 minutes, depending on how fast the internet connection is for downloading binaries.

One could even mount the old /nix/store from the old installation and copy over derivations, which would be a hell lot faster and would result in a reinstallation without the need for internet access. But I don't know how to do it, so I leave it as exercise to the reader.

tags: #desktop #linux #nix #nixos