Kleroteria

I got picked in Kleroteria a couple of months ago. This is my contribution.


Three years ago, my wife and I moved to Amsterdam. We wanted to have kids, and thought the Netherlands was a better place to do that than Mexico, where my wife is from and we lived.

We’ve both moved around a lot, and have picked up the languages of the places we’ve lived. That’s how we speak French and Spanish, along with English. Learning those was always a necessity; no one in France is going to speak English voluntarily, and my in-laws don’t speak the best English so I had to learn Spanish.

So it was a change to come to the Netherlands. Everyone here speaks perfect English. They also have no patience for your attempts at Dutch pronunciation or verb conjugation. The second they hear you’re not from here, they switch. On paper this is great, but it means that after three years here I still don’t speak Dutch. I can never get anyone to have a whole conversation with me in it.

A couple of weeks after we came here, we got pregnant. We had a boy, who has grown up into a little man who loves cars and trains, makes funny faces at the dinner table, and keeps trying to show our dog his books. ("Pancho! See!", but Pancho doesn’t care.)

We speak to him each in our own language, and by now he understands what we say. He’s been going to daycare here since he was five months old, where they speak Dutch, so he understands that as well. Kids who grow up with more than one language start speaking later, which is can be frustrating for everyone. He knows what he wants to say, but can’t figure out what words to use, and we have to keep guessing at sounds while other parents don’t.

He’s finally speaking now. He tells us everything he wants and knows. In Dutch. Every time. Every word comes out in Dutch. And we still don’t really speak it. It’s not really any less frustrating for anyone. There’s a lot of guessing at pronunciation and Google translate.

But what do you know. I did finally get someone to speak Dutch to me.

Bug-fixing checklist

It is OK not to do all of these things for a given bug. Some bugs are trivial, or nontrivial to reproduce or write tests for. But you should explicitly decide not to do some of these things for any given bug, and be able to explain why.

  1. Is there a (possibly old, closed) issue for the bug?

  2. Can you reproduce the bug manually?

  3. Can you write a regression test for the bug?

  4. Did you check that your change actually fixes the problem?

  5. Does the fix’s commit message explain the bug, the fix, and point to the issue?

It’s debatable whether checking for old or closed issues is the responsibility of the developer fixing the bug or a project manager who triages the backlog. Sometimes the second person doesn’t exist, but the job should still be done.

Remember that a bug report is a report of a symptom. A bug fix is directed at a cause. A software symptom may have more than one cause, making bug fixes that say "This fixes issue X" very optimistic about what they claim to achieve.

Exercise moderation and common sense in all things. Both 0% and 100% test coverage are awkward places to live in. Sometimes the information or effort needed to write a test make it too expensive to do so. However, when writing a test is cheap and easy or can be made so without causing harm, there should be a good reason not to do it.

Gitsplorer

Have you ever found yourself using Git and thinking "This is great, but I wish these filesystem operations were read-only and ten times slower?". Well, friend, do I have news for you.

API of a Golang codebase at different times in its history. To do this, I figured I’d clone the repo, check out commit A and analyze it, then check out commit B and analyze that, and boo! Hiss! That’s inelegant and leaves clutter that needs to be cleaned up all around the disk. There has got to be a better way!

Once I made a Git commit hash miner because I wanted to race it against a coworker to see who could get a commit with more leading zeros into a frequently used repository at work. That had the side effect of teaching me some about Git’s internals, like what its objects are (blobs, trees, commits, tags) and how they fit together. I figured that if I could convince the Golang AST parser to read the Git database instead of the filesystem, I could do what I wanted in a much better way.

Alas, doing that would have required monkey-patching the Go standard library, and I don’t want to hunt down every system call it ends up making to be sure I got them all. However, Git is famously a content-addressable filesystem, so what if we just made a filesystem that points to a given commit in a repo and pointed the parser at that?

This turns out to be pretty easy to do by combining libgit2 and libfuse. We use the former to read objects in the Git repository. (The objects are easy to read by hand, until you have to read packed objects. That’s doable, but a bit of a distraction in what is already quite the distraction.) We then use the latter to create a very basic read-only filesystem. In the end, we have a read-only version of git checkout that writes nothing to disk.

I put a prototype of this together in Python, because I’m lazy. It’s called gitsplorer and you should absolutely not use it anywhere near a production system. It scratches my itch pretty well, though. In addition to my API comparisons (which I still haven’t got to), I do sometimes want to poke around the state of a repository at a given commit and this saves me doing a stash-checkout dance or reading the git worktree manpage again.

For fun, and to see how bad of an idea this was, I came up with a very unscientific benchmark: We checkout the Linux kernel repository at a randomly selected commit, run Boyter’s scc line-counting tool, and checkout master again. We do this both with gitsplorer and with ye olde git checkout. The results speak for themselves:

git checkout: 62 seconds
gitsplorer: 567 seconds

The gitsplorer version is also remarkable for spending all its time using 100% of a CPU, which the git version does not. (It uses around 90% of a CPU while doing the checkouts, then all of my CPUs while counting lines. The Python FUSE filesystem is single-threaded, so beyond Python being slow it must also be a point of congestion for the line counting.) I did some basic profiling of this with the wonderful profiler Austin, and saw that the Python process spends most of its time reading Git blobs. I think, but did not verify, that this is because libgit2 decompresses the contents of the blobs on every such call, while most of the reads we make are in the FUSE getattr call where we are only interested in metadata about the blob. I made no attempts to optimize any of this.

So, friends, if you’ve ever wished git checkout was read-only and 10 times slower than it is, today is your lucky day.

Lyttle Lytton 2020

My Lyttle Lytton entry for 2020:

"Actually, you do like this," maverick CEO Eric Davies, Ph.D., insisted as he pulled my foreskin back and cunnilingussed my pee hole.

I’m not exactly proud of it, but I’m glad it’s no longer in my head.

Not made for this world

It’s hot tonight, and this is going to be mostly the whiskey talking while I wait for it to get cool enough to sleep. I’ve killed the mosquitos I’ve seen, however, so until then I have only Haskell to keep me company.

This morning, tef tweeted about monads, which sent the Haskell pack his way with barks of not getting it. Just now, pinboard was reminded of some guy’s rage against Esperanto, from back in the 90’s when the web was fun and mostly devoted to things like explaining how "The Downward Spiral" is a concept album or destroying Unix instead of each other’s mental health.

For the Haskell pack, I did a PhD in the type of math that necessitates a lot of category theory, and I have looked at your use of category theory, and judged it to be unnecessary and pretentious and mainly focused on making you look smart while being entirely trivial. But this is not that kind of blog post, one that gets too tangled up in whether category theory is useful to get to the point. (If nothing else, Pijul proves that category theory is useful.) We’re here to discuss how Haskell as a whole is nonsense if you’re not an academic. Our claim is that Haskell is a useless language for writing software that has users.

Our point is simple, and focused on IO. We propose that you can measure how user-facing a program or language is by measuring how much of its time it spends or worries about doing IO. That is, after all, the medium through which anyone who is not a program’s author (of which there may be many) will interact with the program. The time spent doing IO can be on the command line, via a GUI, over a network, or wherever; but to be a serious contender for user-facing programs, a language has to make IO be easy.

C is a terrible language for most new things today. Anyone writing new software in C, that they expect to be used by other than thouroughly vetted people, needs to be able to explain why they’ve chosen C. At the same time, a lot of us are still exposed to C through the BSD or Linux kernels and syscalls, the undying popularity of K&R, random software on the internet, or other vectors. The culture around C invented the modern language textbook, K&R, and the modern user-facing program, "Hello world", both of which spend most or all of their time dealing with IO to talk to you or other users.

I claim making IO as simple as possible, which C does for all its faults, to do is analogous to trying to making it as simple as possible for other people to talk to you in designed languages as you can.[1]. Esperanto shows you can fail at that goal, if you even had it, for it favors sounds native to European languages above others. Likewise, Haskell shows you can fail at the goal of making IO easy, if you even had it, for it does not.

Haskell is a purely functional, lazily evaluated language, with a type system. Like tef explains, that is great, until you run into IO. Up until that point, you could rearrange computations in any order you liked, if they needed to be done at all. As soon as you need to do IO, though, you need something to happen before another thing, which makes you very unhappy if you’re Haskell. It in fact makes you so unhappy that you’ll drag the entire lost-at-sea community of category theorists into the orbit of your language just so you can have an abstraction for doing IO that fits into your model of the world. This abstraction, monoids, then comes with the added benefit of being abstract enough that all of your programmers can spend their time explaining it to each other instead of writing programs that use the abstraction to do IO, and therefore deal with any actual users.

Haskell is where programmers go to not have users.


1. One could say that what I mean is something more like making FFI as easy as possible, but that’s missing the point and would just move this discussion to some other, less inflammatory, level that we’re keen to avoid.

Curvature as a bilinear form

Here’s something I thought of when I couldn’t sleep last night.

The curvature tensor of a Kähler metric can be viewed as a Hermitian form on \(\bigwedge^{1,1} T_X^*\) by mapping \(\operatorname{End} T_X \to \bigwedge^{1,1} T_X^*\) via the metric.

If we’re on a compact Kähler manifold with zero first Chern class, then for each Kähler class \(\omega\) and \((1,1)\)-classes \(u, v\), we can pick the Ricci-flat metric in \(\omega\) and the harmonic representatives of \(u, v\). If \(R\) is the curvature tensor of the metric, viewed as a Hermitian form, we can then set \[ b(u,v)(\omega) := \int_X R(u, v) \, dV_{\omega}. \] This defines a smooth bilinear form \(b\) on the tangent space of the Kähler cone of \(X\).

Besides being fun times, can we say anything interesting about \(b\)? For example, what is its norm with respect to the Riemannian metric on the Kähler cone, or its trace with respect to that metric? Can we integrate it over some subset of the cone?

Choices

I only have time to work on the arXiv project so often, so I’m taking a lot of time between sessions to think about what I’m doing. When I look at my system design notes, I feel like all the decisions I’m making are the obvious choices, but they also were not at all obvious when I started thinking about them. It’s a good feeling.

Queuesim

I haven’t made a lot of progress on my projects. I did create a Scaleway VM and shoved an OAI harvester on there that’s happily downloading the arXiv’s backlog of metadata. I can also parse the XML it fetches, and have some ideas about how I’m going to store it.

This lack of progress mostly comes from me being nerd-sniped into thinking about bounded queues under load. My old work project wanted to use a FIFO queue to hold its requests. That is a bad idea, because FIFO queues perform very poorly under load, as I went a little overboard in demonstrating.

Funnily enough, that very simple thing is one of my most popular Github projects ever.

Shame

To motivate me to actually finish some of the projects I start, I’m going to try announcing them to the world. Shame me into finishing these:

  • A nicer arXiv frontend of daily new additions.

  • A modern typesetting of Beauville’s Surfaces algebriques complexes.

Shame. Shame. Shame.