Anscombe's quartet and monitoring

Anscombe’s quartet is a collection of four data sets of points in the plane. When graphed the data sets are obviously very different, but they have identical sets of descriptive statistics. Newer variations on the same theme include the delightful Datasaurus dozen, which includes animations of deformations from one step into another where the descriptive statistics are kept identical during the animation.

The datasaurus dozen

I’ve seen Anscombe’s quartet cited in discussions around monitoring and observability to assert that the average and standard deviation (two of the statistical tools used in the quartet) are not useful for monitoring and alerting; instead one should use percentiles. I agree with the conclusion but think it does not follow from the premise, and would like to offer some thoughts on the theme from the perspective of a reformed geometer.

To do so, I’d like to work out some examples involving the average and standard deviation. The data sets above involve collections of points in the plane but I’d like us to discuss data modeled on timeseries instead. The difference isn’t important, everything we say applies to both theaters, but the geometry we want to point out is easier to see in the latter.

Setup

We’ll model a portion of a timeseries as a tuple of \(n\) real numbers \((x_1, \ldots, x_n)\). We can think of these as \(n\) measurements recorded at one-second (or minute, hour, …​) intervals. The set of all possible such timeseries is familiar; it is the real vector space \(\mathbb{R}^n\) of dimension \(n\). (Timeseries can be added and subtracted, there is a zero-timeseries that serves as an identity, and we can multiply a timeseries by a real number.)

The main characters of our story are the average and standard deviation. If \(x = (x_1, \ldots, x_n)\) is a timeseries, its average is \[ \mu(x) = (x_1 + \cdots + x_n) / n \] and its standard deviation is defined by the scary formula[1] \[ \sigma(x) = \sqrt{ ( (x_1 - \mu(x) )^2 + \cdots + (x_n - \mu(x) )^2 ) / n }. \]

We can now look at some examples. In every one of these, we’ll fix the average and standard deviation to some values and look at the collection of timeseries that have those values. The first couple of examples are just warm-up.

n = 1

We first look at \(n = 1\), that is, timeseries that just have one value \(x = (x_1)\). For such a timeseries we have \(\mu(x) = x_1\) and \(\sigma(x) = 0\) so each timeseries is uniquely determined by its average and they all have the same standard deviation.

n = 2

The next case is \(n = 2\), of timeseries with two values \(x = (x_1, x_2)\). This is slightly more interesting; we have the average \(\mu(x) = (x_1 + x_2) / 2\) and the standard deviation \[ \begin{align} \sigma(x) &= \sqrt{( (x_1 - \mu(x) )^2 + (x_2 - \mu(x) )^2 )/2} \\ &= \sqrt{((x_1 - x_2)^2 + (x_2 - x_1)^2)/8} = |x_1 - x_2|/2. \end{align} \] The second equality here is by substituting the definition of \(\mu(x)\).

This is more interesting because different timeseries can have different averages and standard deviations. However, I claim that if we fix both the values of the average and standard deviations, say to \(\mu\) and \(\sigma\), there is at most one timeseries that has those values.

To see this, observe that the formula for the average implies that \(x_2 = 2 \mu - x_1\), so if we fix the value of the average the value of \(x_2\) is determined by the choice of \(x_1\). If we substitute this into the formula for the standard deviation we get \[ \sigma(x) = |x_1 - 2 \mu + x_1|/2 = |x_1 - \mu|. \] We can solve this equation by splitting into cases where \(x_1 > \mu\) and \(x_1 < \mu\). In the former we get \(x_1 = \sigma + \mu\) and in the latter we get \(x_1 = \mu - \sigma\). Notice that in both cases, the value of \(x_1\) is determined by the values of the average and standard deviations, and the value of \(x_2\) is determined by that of \(x_1\) and the average, so both \(x_1\) and \(x_2\) are determined once we fix the others.

n = 3

We now get into the first case that has a hint of the general picture. Our timeseries have three points, \(x = (x_1, x_2, x_3)\). We could write down equations like in the previous case and calculate, but it’s worth it to take a step back and think about what’s going on.

Let’s say we have fixed the value of the average of our timeseries to a constant \(\mu\). The formula for the standard deviation of our series is then \[ \sigma(x) = \sqrt{((x_1 - \mu)^2 + (x_2 - \mu)^2 + (x_3 - \mu)^2)/2}. \] If we form the point \((\mu, \mu, \mu)\), we see that \[ \sigma(x) = \|(x_1, x_2, x_3) - (\mu, \mu, \mu)\|/\sqrt{2} \] is just a multiple of the Euclidean distance between our timeseries and the fixed point \((\mu, \mu, \mu)\). If we also fix \(\sigma(x) = \sigma\), the set of points that satisfy this equation is \[ S := \{ (x_1, x_2, x_3) \mid \|(x_1, x_2, x_3) - (\mu, \mu, \mu)\| = \sqrt{2} \sigma \}. \] This should look familiar, it is the sphere whose center is \((\mu, \mu, \mu)\) and whose radius is \(\sqrt{2}\sigma\).

Similarly, the set of timeseries whose average is equal to \(\mu\) is \[ P := \{ (x_1, x_2, x_3) \mid (x_1 + x_2 + x_3)/2 = \mu \}. \] For those who know a little linear algebra, we can write the condition as being that \((x_1, x_2, x_3) \cdot (1, 1, 1) = 2 \mu\), that is, that the inner product with \((1, 1, 1)\) must be equal to \(2 \mu\). The set of points that satisfies such a condition is a plane; a flat two-dimensional space.

We’re interested in the set of points where both of these conditions are true, which is the intersection \(S \cap P\). We could work out what that is by calculation, but it’s easiest to visualize it. The intersection of a sphere and a plane in three dimensions is either empty (when they don’t meet), a single point (when the plane is tangent to the sphere), or a circle that lies on the sphere.

In three dimensions, if we pick our values for the average and standard deviations right, there is thus a whole circle of timeseries that have that average and standard deviation. They form a kind of Anscombe set.

Higher dimensions

In higher dimensions, basically the same thing happens as in dimension 3. The set of timeseries that have a fixed average is a hyperplane and the set of timeseries that have a fixed standard deviation and average is a sphere. Their intersection, when it is not empty or a single point, is not as easily described but it is a subspace of dimension \(n-2\). As \(n\) grows higher, this space grows as well, and in general the space of timeseries that have a fixed average and standard deviation is enormous.

Other prescriptive statistics

The final point I’d like to make, and one you’ll have to take my word for to some extent, is that there is nothing special about the average or standard deviation here. Very similar things will happen for every collection of statistical tools we pick.

Every statistical tool can be viewed as a function that takes a timeseries (or perhaps a couple of timeseries, which complicates things a little but not a lot) and returns a real number. These functions are very well behaved (continuous, differentiable, and so on).

Given a collection of such functions \(f_1, \ldots, f_k\) on the space of timeseries of \(n\) points, we can ask whether there is an Anscombe set of fixed values of these functions? That is, if we fix values \(y_1, \ldots, y_k\), can we say anything about the set \(\{x = (x_1, \ldots, x_n) \mid f_1(x) = y_1, \ldots, f_k(x) = y_k \}\)?

With some hand waving, we can actually say this: Fixing the value of a single tool \(f_j\) creates a hypersurface \(X_j = \{x \mid f_j(x) = y_j\}\) inside the space of timeseries, that is, a subspace of dimension one less than the ambient space. All timeseries in \(X_j\) will have the same value with respect to this tool. If the intersection of all of these \(X_1 \cap \cdots \cap X_k\) is not empty, it will generally be a subspace of dimension \(n - k\). If \(n\) is much bigger than \(k\), the dimension of this space, the space of timeseries that all have the same statistical values with respect to our tools, is enormous.

Making the hand waving precise and figuring out when this intersection will be non-empty is the domain of algebraic geometry over the real numbers. Saying anything at all about that is beyond our scope here.

Outro

I hope I have convinced you that the statement "the average and standard deviation are useless for monitoring because of Anscombe’s quartet" is not true. The moral of Anscombe’s quartet, which boils down to intersection theory in algebraic geometry, is that the space of data that fits any given collection of statistical instruments (including any collection of percentiles) is gigantic and we have to look at the data to make sense of it.

In general, percentiles are more valuable for analyzing performance or traffic data than the average and standard deviation. The reason is not Anscombe’s quartet, the robustness of percentiles, or anything to do with normal distributions, but that the average and standard deviation are fundamentally tools for answering the questions "what would this be if it were constant?" and "how far away is this from being constant?".[2] For performance and traffic analysis, we don’t care about either of those questions, but about peaks in demand and how we can handle them. We should pick tools that can help with that, but other tools have their place.


1. Some people divide by \(n-1\) in the formula for the standard deviation. I’m sure they have their reasons, but the factor of \(n\) is more natural for geometers. The reason involves interpreting timeseries as functions and is a bit of a detour from what we really care about.
2. Exactly how involves some linear algebra and this post was already long. If people care, I can write another one explaining how.

How Silicon Valley will solve the trolley problem

The trolley problem asks how to decide between the lives of people in two groups. At the moment, it comes up in our industry in discussions around self-driving cars: Suppose a car gets into a situation where it must risk injuring either its passengers or pedestrians; which ones should it prioritize saving?

Always choosing one group leads to suboptimal outcomes. If we always save the pedestrians, they may perform attacks on car riders by willfully stepping into traffic. If we always save the passengers, we may run over a pedestrian whose life happens to be of greater worth than that of the passengers.

The crux of the trolley problem is how we should make the choice of what lives to save. As with all hard problems without clear metrics for success, it’s best to solve this by having the participants decide this for themselves.

Every time we view a website, our friendly ad networks must decide what ads we should see. This is done by having our user profile put up on auction for prospective advertisers. During a handful of milliseconds, participants may inspect our profile and place a bid for our attention according to what they see.

The same technology could be trivially repurposed for deciding the trolley problem in the context of self-driving cars.

Assume the identity of every participant in the trolley scenario is known. Practically, we know the identity of the passengers; that of the pedestrians could be known if their phones broadcast a special short-range identification signal. An incentive for broadcasting such a signal could be that we would have to assume that a person without one were one of no means.

Given this information, a car about to be involved in a collision could take a few milliseconds to send the identities of the people involved to an auction service. Participants who had the foresight of purchasing accident insurance would have agents bidding on their behalf. The winner of the bid would be safe from the forthcoming accident, and their insurance would pay some sum to the car manufacturer post-hoc as a reward for conforming to the rules.

The greater a person’s individual wealth, the better insurance coverage and latencies they could purchase, and the more accidents they could expect to survive. This aligns nicely with the values of societies like the United States, where the worth of a person’s life is proportional to their wealth.

As this system lets self-driving car manufacturers off the hook for any decisions taken, and would need coordinated ethical action on the behalf of software engineers to not be implemented, we expect to see this system in action in the world in short order.

Infosec: A board game

I’m pleased to announce the release of my new card-drawing game Infosec.

The rules of the game are simple. It is for two or more players. The player with the fewest friends is the Infosec Expert; the other players are the Coworkers.

To start, the Infosec Expert deals three cards face down from a standard deck of cards. The Coworker on the Infosec Expert’s right hand should draw one card.

"Which card?" the Coworker may ask.

"This is a simple phishing exercise," the Infosec Expert should reply. "Just pick a card."

"But they all look the same," the Coworker may object.

"Draw one. And get it right."

This exchange should go on in increasingly hostile tones until the Coworker agrees to draw a card. It will have been the wrong card. The Infosec Expert should inform the Coworker:

"You got phished. You moron. You fucking idiot. You’re such a goddamn waste of space and time. How could you have gotten this so wrong? Were you even trying? Answer me. What the fuck was that?"

Feel free to ad-lib along these lines, or draw as many cards as you want from the accompanying Admonishment Deck (expansion packs available). Include ad-hominem attacks and use as many personal details as you know. Interrupt the Coworker if they try to reply. Don’t hold back.

Once the Coworker is silent, the Infosec Expert should collect the cards, deal new ones, and proceed to the next Coworker in line.

The game ends with the Infosec Expert’s victory when all the Coworkers have left.

Backing up data like the adult I supposedly am

Like so many things I’m supposed to do but don’t — getting exercise, eating right, sleeping well, standing up for women and minorities in public spaces — backing up my data has always been something I’ve half-assed at best.

I’ve lugged around an external hard drive with a few hundred gigabytes of data for the last 10 years, and made backups to it once every three or four years or so. Every time I’ve tried restoring anything from those backups I’ve regretted it, because of course I just bought the drive, plugged it in and copied stuff to it, so it is a FAT32 drive while I have mostly had EXT4 filesystems, which means all my file permissions get lost during the process.

I’ve written shameful little shell scripts to set file permissions to 0644 and directory permissions to 0755, recursively, many many times.

Part of my problem was that I both know just enough rsync to be dangerous and have a credit card so I can provision cloud VMs, so forever just around the corner was my perfect backup solution that I’d write myself and maintain and actually do instead of dealing with whatever I had going on in my life. I’ve come to accept that this will never happen, or perhaps more definitively, that I’d rather cut myself than write and maintain another piece of ad-hoc software for myself.

Luckily I recently found two things that have solved this whole problem for me: borg and rsync.net.

Borg is backup software. It compresses and deduplicates data at the block level, and strongly encourages (but does not force) you to encrypt data before backing it up. It is everything I’d want from my half-assed rsync and shell script abomination.

I read its documentation a couple of times and was impressed. I then set about comparing different VM hosts to see which one would give me the cheapest block storage option, when the result of some random google search led me to rsync.net. They are a company that stores backups, pretty cheaply, and even more cheaply if you use borg to take them. I guess they just really love borg and want us to love it too.

I signed up for their cheapest plan, which starts at 100GB stored for $18 per year. They have no network in- or egress costs, and the storage amount can be adjusted at any time. Once my account had been activated, I did a little password reset dance, and uploaded a public SSH key.

I wanted to back up my $HOME directory, so after installing borg I ran:

export BORG_REMOTE_PATH="borg1"
borg init --encryption repokey-blake2 UID@ch-s011.rsync.net:home

This created a remote borg repository called "home" on rsync.net’s servers. The environment variable is so we use a more recent version of borg on the remote server (version 1.1.11 at the time of writing), as the default version is rather old (version 0.29.0).

When choosing what encryption method to use, one can choose between a "repokey" or a "keyfile". They both create a private key locked with a passphrase; the difference is that with "repokey" the key is stored in the borg repo, while with "keyfile" it is stored outside of it. This boils down to whether we think a passphrase is enough security for our data, or whether we think having a secret keyfile is necessary. I figured my password manager could create a strong enough passphrase for my needs, and I didn’t want to think about losing the keyfile, so I chose "repokey-blake2".

To create my first backup, I ran

borg create --exclude "$HOME/.cache" UID@ch-s011.rsync.net:home::backup-1 "$HOME"

which created the archive "backup-1" in my "home" borg repository. I didn’t change the compression algorithm from the default one.

By default borg compresses data with lz4. It can use other compression methods (xz, zlib, zstd). I compared their compression ratios on some binary files I had and found no difference between them. I think this is because the large binary files I have are mostly audio and video files in lossy formats, which don’t seem to benefit very much from further compression. I have a lot of text files as well, but text takes up so little relative space on today’s hardware that it makes no sense to spend CPU cycles on compressing it better than lz4 does.

This backup command hummed along for a good while, and through a couple of reboot cycles. Doing a second backup right after it finished (or the day after) took a lot less time because of the deduplication:

borg create --exclude "$HOME/.cache" UID@ch-s011.rsync.net:home::backup-2 "$HOME"

Restoring from backup is also easy:

borg extract UID@ch-s011.rsync.net:home::backup-2

I set this up to run as a daily timed systemd service at noon (very easy on NixOS, which every Linux user should be using unless they hate themselves), and will never, ever think about this again. For a handful of bucks a year, that is a good deal.

Review of home manager

A common theme among the people who fall in love with Nix (or its cousin NixOS) is that they want it to manage everything for them. Finding out the limits of that capability and where it breaks down is part of each’s journey.

An obvious enough limit to Nix’s reach is secret management. It can easily handle public keys, SSH or cryptographic ones, but the private keys cannot be managed by its store as it is readable by all.

The promise of Nix is that we will never break our system by updating it. We can always roll back to a previous working version, once we have a working version at all. This is in contrast to, say, every other Linux distribution. I have borked Ubuntu, Arch, Fedora and others during system updates. Often the safest way to update the system is to back up /home and reinstall the OS.

From somewhere comes the idea that a user on a Linux system should be able to install their own programs and manage their own system services. (See Chris Wellons for where you can run with this idea to if your sysadmin gives you a C compiler.) This seems odd from a historical perspective. In a true multi-user system, the system administrators normally do not want users to be able to install arbitrary software or run services. On a modern "multi"-user system, where there is a only single user, avoiding system packages and services seems like some kind of theater.

Yet we do it anyway. An argument I sometimes make to myself is that this is a cleaner separation between what I need the system to do versus what I do on it. I may want to be able to run traceroute as root, but I don’t care about root being able to run the Go compiler.

NixOS has facilities to enforce this separation. It happily creates users and their home directories, and can fill in some of the bits that go there, like public SSH keys. It can install packages for specific users, and allow them to define systemd services. It will not manage the configuration of specific user packages (like dotfiles) without some coercing. One can presumably create custom derivations of, say, ZSH with all the configuration one wants, but who has the time?

Home manager wants to fill this gap and bring the power of Nix to user environment and configuration management. It lets individual users say what packages they want installed; what services they want run; and what configuration files should go where. On the surface it seems like something I should love, but after using it for a month I wrote it out of my system and now use plain NixOS.

I used Home manager for three things:

  1. Installing packages for my user.

  2. Scheduling services for my user.

  3. Installing configuration files for my software.

The first one was never a big attraction. Home manager lets us define a list of packages in home.packages of packages to install. If we control the system configuration, we can acheive the same by defining those packages in users.users.$username.packages.

If a user controls the system configuration (directly or through a sysadmin), they can also define their own user services via systemd.user. Home manager’s selling point is that it comes with a large list of already defined services that we can enable with a boolean flag, instead of having to write our own service configuration. This is admittedly nice. In the end, I found that learning the idiosyncrazies of each home manager service definition was a less useful use of my time than learning how to define NixOS systemd services once and for all. The latter is after all where the former end up.

As a long-time sufferer of dotfile management, I had high hopes for the third point. And indeed, home manager will manage dotfiles just fine. It can do this in two modes: it can generate a config file from various options we fill out if someone has written a home manager module for the program we’re trying to configure, or it can plomp a file on the system verbatim from a source. I used the latter, as I didn’t feel like learning a configuration language to be able to partially configure program dotfiles was a good idea.

This works well, until we want to change anything in a dotfile. This experiment with home manager coincided with a regular low point in my life in which I try to use emacs. This comes with quite a lot of .emacs changes as I use Lisp for the only thing it’s ever been good for; configuring a text editor in the most complicated way imaginable. Now, the dotfiles that home manager (or Nix) puts on our systems are read-only, so every change would involve changing the source file and running home-manager switch. This seems like unnecessarily many steps, especially after I saw this brilliant Hacker news comment, which after a week of use is a much better solution for this problem.

All in all home manager is nice software. I can see it being useful for people who either don’t control the system they run on but want to use Nix in user mode to run their corner of it, or for those Nix users who gamers (a well-adjusted group of humans if there ever was one) would call "filthy casuals", that is, people who just want things to work and don’t care very much about learning how to write enough Nix to make that happen.

I’m not included in those groups, as I run this system and explicitly want to learn to use Nix in anger so I can try and fail to convince people to run it in production at work. Home manager is fine software and if it makes you happy, then please use it.

Use ad hoc structs for command-line flags in Go

The path of least resistance to commandline flag parsing in Go is to use the flag package from the standard library. A lot of times the result looks like this:

func main() {
  help := flags.Bool("help", false, "HALP")
  frobinate := flags.Int("frobinate", 0, "Amount to frobinate by")
  blargalarg := flags.String("blargalarg", "", "Social media comment")
  // [713 variables later]
  flag.Parse()

  // Much later
  if *blargalarg != "" {
    // Do things. We may or may not remember what this variable is.
  }
}

That is, we have a bunch of variables lying around we don’t really care about and take up perfectly good names. If we see one of them later in the program, we don’t have any context on where it comes from, so we have to start jumping around in the source.

In my projects I’ve used a little accounting trick to hold these flags. I find it helps me deal with them. We just define an anonymous struct to hold the flags:

func main() {
  flags := struct{
    help *bool
    frobinate *int
    blargalarg *string
    // [713 field definitions]
  }{
    help: flags.Bool("help", false, "HALP"),
    frobinate: flags.Int("frobinate", 0, "Amount to frobinate by"),
    blargalarg: flags.String("blargalarg", "", "Social media comment"),
    // [713 field instantiations]
  }
  flag.Parse()

  // Much later
  if *flags.blargalarg != "" {
    // AAAAAH YES IT'S A FLAG
  }
}

If I really need to, I can pull the anonymous struct out into its own global variable or type definition or whatever, and pass it around as arguments to functions that deal with its contents. That is not as handy with a litter of flag variables. But really I just find that defining all these flags clearly in one place makes the program easier to read later once I’ve forgotten what it does.

Kleroteria

I got picked in Kleroteria a couple of months ago. This is my contribution.


Three years ago, my wife and I moved to Amsterdam. We wanted to have kids, and thought the Netherlands was a better place to do that than Mexico, where my wife is from and we lived.

We’ve both moved around a lot, and have picked up the languages of the places we’ve lived. That’s how we speak French and Spanish, along with English. Learning those was always a necessity; no one in France is going to speak English voluntarily, and my in-laws don’t speak the best English so I had to learn Spanish.

So it was a change to come to the Netherlands. Everyone here speaks perfect English. They also have no patience for your attempts at Dutch pronunciation or verb conjugation. The second they hear you’re not from here, they switch. On paper this is great, but it means that after three years here I still don’t speak Dutch. I can never get anyone to have a whole conversation with me in it.

A couple of weeks after we came here, we got pregnant. We had a boy, who has grown up into a little man who loves cars and trains, makes funny faces at the dinner table, and keeps trying to show our dog his books. ("Pancho! See!", but Pancho doesn’t care.)

We speak to him each in our own language, and by now he understands what we say. He’s been going to daycare here since he was five months old, where they speak Dutch, so he understands that as well. Kids who grow up with more than one language start speaking later, which is can be frustrating for everyone. He knows what he wants to say, but can’t figure out what words to use, and we have to keep guessing at sounds while other parents don’t.

He’s finally speaking now. He tells us everything he wants and knows. In Dutch. Every time. Every word comes out in Dutch. And we still don’t really speak it. It’s not really any less frustrating for anyone. There’s a lot of guessing at pronunciation and Google translate.

But what do you know. I did finally get someone to speak Dutch to me.

Bug-fixing checklist

It is OK not to do all of these things for a given bug. Some bugs are trivial, or nontrivial to reproduce or write tests for. But you should explicitly decide not to do some of these things for any given bug, and be able to explain why.

  1. Is there a (possibly old, closed) issue for the bug?

  2. Can you reproduce the bug manually?

  3. Can you write a regression test for the bug?

  4. Did you check that your change actually fixes the problem?

  5. Does the fix’s commit message explain the bug, the fix, and point to the issue?

It’s debatable whether checking for old or closed issues is the responsibility of the developer fixing the bug or a project manager who triages the backlog. Sometimes the second person doesn’t exist, but the job should still be done.

Remember that a bug report is a report of a symptom. A bug fix is directed at a cause. A software symptom may have more than one cause, making bug fixes that say "This fixes issue X" very optimistic about what they claim to achieve.

Exercise moderation and common sense in all things. Both 0% and 100% test coverage are awkward places to live in. Sometimes the information or effort needed to write a test make it too expensive to do so. However, when writing a test is cheap and easy or can be made so without causing harm, there should be a good reason not to do it.

Gitsplorer

Have you ever found yourself using Git and thinking "This is great, but I wish these filesystem operations were read-only and ten times slower?". Well, friend, do I have news for you.

API of a Golang codebase at different times in its history. To do this, I figured I’d clone the repo, check out commit A and analyze it, then check out commit B and analyze that, and boo! Hiss! That’s inelegant and leaves clutter that needs to be cleaned up all around the disk. There has got to be a better way!

Once I made a Git commit hash miner because I wanted to race it against a coworker to see who could get a commit with more leading zeros into a frequently used repository at work. That had the side effect of teaching me some about Git’s internals, like what its objects are (blobs, trees, commits, tags) and how they fit together. I figured that if I could convince the Golang AST parser to read the Git database instead of the filesystem, I could do what I wanted in a much better way.

Alas, doing that would have required monkey-patching the Go standard library, and I don’t want to hunt down every system call it ends up making to be sure I got them all. However, Git is famously a content-addressable filesystem, so what if we just made a filesystem that points to a given commit in a repo and pointed the parser at that?

This turns out to be pretty easy to do by combining libgit2 and libfuse. We use the former to read objects in the Git repository. (The objects are easy to read by hand, until you have to read packed objects. That’s doable, but a bit of a distraction in what is already quite the distraction.) We then use the latter to create a very basic read-only filesystem. In the end, we have a read-only version of git checkout that writes nothing to disk.

I put a prototype of this together in Python, because I’m lazy. It’s called gitsplorer and you should absolutely not use it anywhere near a production system. It scratches my itch pretty well, though. In addition to my API comparisons (which I still haven’t got to), I do sometimes want to poke around the state of a repository at a given commit and this saves me doing a stash-checkout dance or reading the git worktree manpage again.

For fun, and to see how bad of an idea this was, I came up with a very unscientific benchmark: We checkout the Linux kernel repository at a randomly selected commit, run Boyter’s scc line-counting tool, and checkout master again. We do this both with gitsplorer and with ye olde git checkout. The results speak for themselves:

git checkout: 62 seconds
gitsplorer: 567 seconds

The gitsplorer version is also remarkable for spending all its time using 100% of a CPU, which the git version does not. (It uses around 90% of a CPU while doing the checkouts, then all of my CPUs while counting lines. The Python FUSE filesystem is single-threaded, so beyond Python being slow it must also be a point of congestion for the line counting.) I did some basic profiling of this with the wonderful profiler Austin, and saw that the Python process spends most of its time reading Git blobs. I think, but did not verify, that this is because libgit2 decompresses the contents of the blobs on every such call, while most of the reads we make are in the FUSE getattr call where we are only interested in metadata about the blob. I made no attempts to optimize any of this.

So, friends, if you’ve ever wished git checkout was read-only and 10 times slower than it is, today is your lucky day.

Lyttle Lytton 2020

My Lyttle Lytton entry for 2020:

"Actually, you do like this," maverick CEO Eric Davies, Ph.D., insisted as he pulled my foreskin back and cunnilingussed my pee hole.

I’m not exactly proud of it, but I’m glad it’s no longer in my head.