Gitsplorer

20 October, 2019

Have you ever found yourself using Git and thinking "This is great, but I wish these filesystem operations were read-only and ten times slower?". Well, friend, do I have news for you.

API of a Golang codebase at different times in its history. To do this, I figured I’d clone the repo, check out commit A and analyze it, then check out commit B and analyze that, and boo! Hiss! That’s inelegant and leaves clutter that needs to be cleaned up all around the disk. There has got to be a better way!

Once I made a Git commit hash miner because I wanted to race it against a coworker to see who could get a commit with more leading zeros into a frequently used repository at work. That had the side effect of teaching me some about Git’s internals, like what its objects are (blobs, trees, commits, tags) and how they fit together. I figured that if I could convince the Golang AST parser to read the Git database instead of the filesystem, I could do what I wanted in a much better way.

Alas, doing that would have required monkey-patching the Go standard library, and I don’t want to hunt down every system call it ends up making to be sure I got them all. However, Git is famously a content-addressable filesystem, so what if we just made a filesystem that points to a given commit in a repo and pointed the parser at that?

This turns out to be pretty easy to do by combining libgit2 and libfuse. We use the former to read objects in the Git repository. (The objects are easy to read by hand, until you have to read packed objects. That’s doable, but a bit of a distraction in what is already quite the distraction.) We then use the latter to create a very basic read-only filesystem. In the end, we have a read-only version of git checkout that writes nothing to disk.

I put a prototype of this together in Python, because I’m lazy. It’s called gitsplorer and you should absolutely not use it anywhere near a production system. It scratches my itch pretty well, though. In addition to my API comparisons (which I still haven’t got to), I do sometimes want to poke around the state of a repository at a given commit and this saves me doing a stash-checkout dance or reading the git worktree manpage again.

For fun, and to see how bad of an idea this was, I came up with a very unscientific benchmark: We checkout the Linux kernel repository at a randomly selected commit, run Boyter’s scc line-counting tool, and checkout master again. We do this both with gitsplorer and with ye olde git checkout. The results speak for themselves:

git checkout: 62 seconds
gitsplorer: 567 seconds

The gitsplorer version is also remarkable for spending all its time using 100% of a CPU, which the git version does not. (It uses around 90% of a CPU while doing the checkouts, then all of my CPUs while counting lines. The Python FUSE filesystem is single-threaded, so beyond Python being slow it must also be a point of congestion for the line counting.) I did some basic profiling of this with the wonderful profiler Austin, and saw that the Python process spends most of its time reading Git blobs. I think, but did not verify, that this is because libgit2 decompresses the contents of the blobs on every such call, while most of the reads we make are in the FUSE getattr call where we are only interested in metadata about the blob. I made no attempts to optimize any of this.

So, friends, if you’ve ever wished git checkout was read-only and 10 times slower than it is, today is your lucky day.