Gitsplorer

Have you ever found yourself using Git and thinking "This is great, but I wish these filesystem operations were read-only and ten times slower?". Well, friend, do I have news for you.

I was thinking about a project the other day that involved comparing the public API of a Golang codebase at different times in its history. To do this, I figured I’d clone the repo, check out commit A and analyze it, then check out commit B and analyze that, and boo! Hiss! That’s inelegant and leaves clutter that needs to be cleaned up all around the disk. There has got to be a better way!

Once I made a Git commit hash miner because I wanted to race it against a coworker to see who could get a commit with more leading zeros into a frequently used repository at work. That had the side effect of teaching me some about Git’s internals, like what its objects are (blobs, trees, commits, tags) and how they fit together. I figured that if I could convince the Golang AST parser to read the Git database instead of the filesystem, I could do what I wanted in a much better way.

Alas, doing that would have required monkey-patching the Go standard library, and I don’t want to hunt down every system call it ends up making to be sure I got them all. However, Git is famously a content-addressable filesystem, so what if we just made a filesystem that points to a given commit in a repo and pointed the parser at that?

This turns out to be pretty easy to do by combining libgit2 and libfuse. We use the former to read objects in the Git repository. (The objects are easy to read by hand, until you have to read packed objects. That’s doable, but a bit of a distraction in what is already quite the distraction.) We then use the latter to create a very basic read-only filesystem. In the end, we have a read-only version of git checkout that writes nothing to disk.

I put a prototype of this together in Python, because I’m lazy. It’s called gitsplorer and you should absolutely not use it anywhere near a production system. It scratches my itch pretty well, though. In addition to my API comparisons (which I still haven’t got to), I do sometimes want to poke around the state of a repository at a given commit and this saves me doing a stash-checkout dance or reading the git worktree manpage again.

For fun, and to see how bad of an idea this was, I came up with a very unscientific benchmark: We checkout the Linux kernel repository at a randomly selected commit, run Boyter’s scc line-counting tool, and checkout master again. We do this both with gitsplorer and with ye olde git checkout. The results speak for themselves:

git checkout: 62 seconds
gitsplorer: 567 seconds

The gitsplorer version is also remarkable for spending all its time using 100% of a CPU, which the git version does not. (It uses around 90% of a CPU while doing the checkouts, then all of my CPUs while counting lines. The Python FUSE filesystem is single-threaded, so beyond Python being slow it must also be a point of congestion for the line counting.) I did some basic profiling of this with the wonderful profiler Austin, and saw that the Python process spends most of its time reading Git blobs. I think, but did not verify, that this is because libgit2 decompresses the contents of the blobs on every such call, while most of the reads we make are in the FUSE getattr call where we are only interested in metadata about the blob. I made no attempts to optimize any of this.

So, friends, if you’ve ever wished git checkout was read-only and 10 times slower than it is, today is your lucky day.

Lyttle Lytton 2020

My Lyttle Lytton entry for 2020:

"Actually, you do like this," maverick CEO Eric Davies, Ph.D., insisted as he pulled my foreskin back and cunnilingussed my pee hole.

I’m not exactly proud of it, but I’m glad it’s no longer in my head.

Not made for this world

It’s hot tonight, and this is going to be mostly the whiskey talking while I wait for it to get cool enough to sleep. I’ve killed the mosquitos I’ve seen, however, so until then I have only Haskell to keep me company.

This morning, tef tweeted about monads, which sent the Haskell pack his way with barks of not getting it. Just now, pinboard was reminded of some guy’s rage against Esperanto, from back in the 90’s when the web was fun and mostly devoted to things like explaining how "The Downward Spiral" is a concept album or destroying Unix instead of each other’s mental health.

For the Haskell pack, I did a PhD in the type of math that necessitates a lot of category theory, and I have looked at your use of category theory, and judged it to be unnecessary and pretentious and mainly focused on making you look smart while being entirely trivial. But this is not that kind of blog post, one that gets too tangled up in whether category theory is useful to get to the point. (If nothing else, Pijul proves that category theory is useful.) We’re here to discuss how Haskell as a whole is nonsense if you’re not an academic. Our claim is that Haskell is a useless language for writing software that has users.

Our point is simple, and focused on IO. We propose that you can measure how user-facing a program or language is by measuring how much of its time it spends or worries about doing IO. That is, after all, the medium through which anyone who is not a program’s author (of which there may be many) will interact with the program. The time spent doing IO can be on the command line, via a GUI, over a network, or wherever; but to be a serious contender for user-facing programs, a language has to make IO be easy.

C is a terrible language for most new things today. Anyone writing new software in C, that they expect to be used by other than thouroughly vetted people, needs to be able to explain why they’ve chosen C. At the same time, a lot of us are still exposed to C through the BSD or Linux kernels and syscalls, the undying popularity of K&R, random software on the internet, or other vectors. The culture around C invented the modern language textbook, K&R, and the modern user-facing program, "Hello world", both of which spend most or all of their time dealing with IO to talk to you or other users.

I claim making IO as simple as possible, which C does for all its faults, to do is analogous to trying to making it as simple as possible for other people to talk to you in designed languages as you can.[1]. Esperanto shows you can fail at that goal, if you even had it, for it favors sounds native to European languages above others. Likewise, Haskell shows you can fail at the goal of making IO easy, if you even had it, for it does not.

Haskell is a purely functional, lazily evaluated language, with a type system. Like tef explains, that is great, until you run into IO. Up until that point, you could rearrange computations in any order you liked, if they needed to be done at all. As soon as you need to do IO, though, you need something to happen before another thing, which makes you very unhappy if you’re Haskell. It in fact makes you so unhappy that you’ll drag the entire lost-at-sea community of category theorists into the orbit of your language just so you can have an abstraction for doing IO that fits into your model of the world. This abstraction, monoids, then comes with the added benefit of being abstract enough that all of your programmers can spend their time explaining it to each other instead of writing programs that use the abstraction to do IO, and therefore deal with any actual users.

Haskell is where programmers go to not have users.


1. One could say that what I mean is something more like making FFI as easy as possible, but that’s missing the point and would just move this discussion to some other, less inflammatory, level that we’re keen to avoid.

Curvature as a bilinear form

Here’s something I thought of when I couldn’t sleep last night.

The curvature tensor of a Kähler metric can be viewed as a Hermitian form on \(\bigwedge^{1,1} T_X^*\) by mapping \(\operatorname{End} T_X \to \bigwedge^{1,1} T_X^*\) via the metric.

If we’re on a compact Kähler manifold with zero first Chern class, then for each Kähler class \(\omega\) and \((1,1)\)-classes \(u, v\), we can pick the Ricci-flat metric in \(\omega\) and the harmonic representatives of \(u, v\). If \(R\) is the curvature tensor of the metric, viewed as a Hermitian form, we can then set \[ b(u,v)(\omega) := \int_X R(u, v) \, dV_{\omega}. \] This defines a smooth bilinear form \(b\) on the tangent space of the Kähler cone of \(X\).

Besides being fun times, can we say anything interesting about \(b\)? For example, what is its norm with respect to the Riemannian metric on the Kähler cone, or its trace with respect to that metric? Can we integrate it over some subset of the cone?

Choices

I only have time to work on the arXiv project so often, so I’m taking a lot of time between sessions to think about what I’m doing. When I look at my system design notes, I feel like all the decisions I’m making are the obvious choices, but they also were not at all obvious when I started thinking about them. It’s a good feeling.

Queuesim

I haven’t made a lot of progress on my projects. I did create a Scaleway VM and shoved an OAI harvester on there that’s happily downloading the arXiv’s backlog of metadata. I can also parse the XML it fetches, and have some ideas about how I’m going to store it.

This lack of progress mostly comes from me being nerd-sniped into thinking about bounded queues under load. My old work project wanted to use a FIFO queue to hold its requests. That is a bad idea, because FIFO queues perform very poorly under load, as I went a little overboard in demonstrating.

Funnily enough, that very simple thing is one of my most popular Github projects ever.

Shame

To motivate me to actually finish some of the projects I start, I’m going to try announcing them to the world. Shame me into finishing these:

  • A nicer arXiv frontend of daily new additions.

  • A modern typesetting of Beauville’s Surfaces algebriques complexes.

Shame. Shame. Shame.

Request-level configuration invariance in Go

Suppose we have an HTTP service. The behaviour of service depends on some configuration that may change at runtime; it may reload a static configuration file on SIGHUP, need to react to changes in its service discovery mechanism, or have A/B test state or features toggled on and off.

In Go, a naive way of handling this is by writing our configuration state to a struct and updating it in background goroutines:

type State struct {
    frobinate bool
}

var state State{}

func handler(w http.ResponseWriter, r *http.Request) {
    if s.frobinate {
        fmt.Fprintln(w, "Great success")
    } else {
        fmt.Fprintln(w, "Great non-success")
    }
}

This naive approach has at least two problems:

  1. The state may change while we’re processing a request, causing us to process part of the request with one state, and another part with another. This isn’t a big deal in our example, but becomes more of a problem as the time needed to handle a request increases.

  2. There are no synchronization primitives in play, so updating the state has data race conditions.

Check out the working example in this commit to see the first problem in action.

One way to resolve these problems is to add a mutex and to pass copies of the state to the request handlers:

type State struct {
    frobinate bool

    *sync.Mutex
}

// Copy may be arbitrarily complicated if State contains slices, maps,
// pointers, or other structs.
func (s *State) Copy() State {
    s.Lock()
    defer s.Unlock()

    return *s
}

var state &State{}

func handler(w http.ResponseWriter, r *http.Request) {
    s := state.Copy()

    if s.frobinate {
        fmt.Fprintln(w, "Great success")
    } else {
        fmt.Fprintln(w, "Great non-success")
    }
}

The background goroutine that updates the state then either does so through a dedicated method that locks/unlocks the mutex, or does the locking itself.

While this works fine, it relies on global state and uses none of Go’s built-in concurrency features. We were promised a brave new world, and encouraged to "share memory by communicating". This points the way to another solution to our two problems that leverages Go’s primitives better:

type State struct {
    frobinate bool
}

// Copy may be arbitrarily complicated if State contains slices, maps,
// pointers, or other structs.
func (s State) Copy() State {
    return s
}

var stateCh chan State
var toggle chan struct{}

func stateManager() {
    state := State{}

    for {
        select {
        case stateCh <- state.Copy():

        case <-toggle:
            state.frobinate = !state.frobinate
        }
    }
}

func handler(w http.ResponseWriter, r *http.Request) {
    s := <-stateCh

    if s.frobinate {
        fmt.Fprintln(w, "Great success")
    } else {
        fmt.Fprintln(w, "Great non-success")
    }
}

A complete working example is here. Note that the working example doesn’t rely on any global state to pass state to the handler functions, using closures instead. Modifying the state is also only possible within the stateManager function. The working example could also be extended to have middleware copy the state to each request processing function instead of the ad-hoc way done there.

It is still up to the developers to ensure the Copy method doesn’t pass mutable state around, but they no longer need to deal with mutexes and locks themselves. This also means that any future additions to the program that use the same pattern won’t need to worry about those locks either.

There are no silver bullets in heavily concurrent systems, but in Go we can choose to not deal with some of the footguns we would need to handle in C, Java or other similar languages.