zanchey.blog

The tech blog of David Adam, recovering Wheel member of the University Computer Club and committer on fish shell.

tldr: an ANALZYE statement reduced a complicated query's runtime by 99.9%

I've been investigating some health informatics data, including the taxonomy system SNOMED CT and the Australian derivatives, SNOMED CT-AU and the Australian Medicines Terminology. One of the challenges of digital health in Australia is that it can take weeks and lots of signed paperwork in order to get access to development resources and data, even if it is available free of charge.

While waiting for my Ontoserver licensing to come through, I found the SNOMED CT MySQL sample scripts from the Australian Digital Health Agency and thought they would be a good starting point for some exploration.

Unfortunately, I have a significant aversion to MySQL/MariaDB. I've spent too many weekends restoring it with a text editor and a toothpick (extremely 90s sysadmin reference), and discovered too many surprising landmines in the default feature set. Plus, the usual MySQL cluster I use was offline.

I decided I'd port the scripts to the world's most widely-deployed database, SQLite. It's super easy to set up, allows you to iterate fast and is really well documented.

Other than a bunch of MySQL footguns (proving once again that it is the C of databases) like foreign keys not having to be unique, and some limitations due to SQLite not having native functions, the scripts translated into SQLite fairly easily and produced a 2 GB database.

However a monster statement to turn the hyper-normalised data into something actually useable as a data source was taking six hours. In 2021! Well, on a 2016 laptop (Core i5 1.6 GHz). So I shifted it all to a faster VM with a Xeon ES-2690, though still from 2016, and it took... five and a half hours.

A bit of digging around in the EXPLAIN QUERY PLAN (extremely 90s Australian politics reference) showed nothing obvious, but a look through the SQLite Query Optimizer Overview (SQOOO?) identified one opportunity:

To help the optimizer get a more accurate estimate of the work involved in using various indexes, the user may optionally run the ANALYZE command... after making significant changes it might be prudent to rerun ANALYZE.

As I'd just inserted a few million rows into various tables, this seemed like it might help. Four seconds to run ANALYZE, and the large query time dropped from six hours to fifteen seconds, an approximately 99.93% reduction.

I've sometimes thought about SQLite as a "drop in and don't think about it" database, but it's clear that there's lots of little tips to change the very-conservative defaults.

Other useful commands have included PRAGMA journal_mode=WAL and the .timer interactive command.

Now the total runtime to create the database is about fifteen minutes, and I am tempted to try and improve it further, but for something I'm not going to do frequently or use long-term I should probably just enjoy a coffee while it runs.

I do some work on fish shell, a command-line shell for UNIX-like platforms including Linux, OS X and Windows.

Although we don't have a release manager, as such, I've done the bulk of the work for the last couple of releases which include an urgent security release and a major update. I also babysit the continuous delivery of "nightly" builds.

This is an attempt to snapshot our current distribution process. I spent a long car ride trying to explain this to a friend and it is testament both to his brilliance and the size of Australia that he understood it.

Like most open source projects, we distribute a source tarball. Everyone⁺ who distributes software eventually comes to the conclusion that having every computer in the world compile from source is not terribly efficient, so we also try and get compiled binary packages into the hands of our users.

Linux platforms

I've been a Linux sysadmin in earnest since about 2005, when I joined the Wheel group of the University Computer Club, so I have a reasonable amount of experience with various distributions.

Most Linux distributions do not make it easy to distribute your own software. This isn't necessarily a criticism - projects like Debian specifically promise to their users that they will make software available in a secure and well-maintained way, and they certainly do their best with this Sisyphean task.

As an "upstream", you can just wait for the maintainers of the distribution to pick up your source tarball, massage it into the highly variable requirements of the distribution and get it shipped. Our most awesome maintainers join the mailing list, chime in on GitHub and do their best to get any required patches upstream. The bigger distributions try and help us to help them by providing documents like Debian's upstream guide and Fedora's packaging guidelines.

Unfortunately there are other maintainers who don't have the time to get fully involved, and often the first we hear of fish being available in a distribution is when we get a problem report from a downstream user. The one thing I would love distribution maintainers to do is to email the mailing list (or edit our wiki) when they start packaging fish. That way we can get in touch when new versions are released (or forewarn them), keep an eye on patches that need to go upstream, and even directly ask if we have a question about their distribution.

However, relying our distribution maintainers isn't enough. For many big distributions, the latest and greatest release can't or won't be shipped, but there are still users who are keen to use it.

There are two key challenges in this area: package formats and distribution channels.

As far as I can tell, the process of starting a new distribution goes like this:

Decide to start a new distribution.
Write a new package manager.

If you are thinking of making a Linux distribution, I implore you, please don't write a new package manager. The existing frontends are incredibly mature but still have problems, and it is unlikely that you will not run into issues that Debian or RedHat solved in 1997.

If you are determined to write a new package manager, please consider at least using the existing package formats (dpkg or RPM) and just writing your own frontend. Both formats are versatile and have a lot more depth than you might realise.

We ship both dpkg and RPM packages. It's definitely a lot easier to create a control file for RPMs, although somewhat hampered by a lack of an up-to-date authoritative reference.

Creating dpkgs is very intimidating at first, because there are about a thousand files to edit before anything works, but this becomes more straightforward with practice. The other problem is that a Debian-based system is basically mandatory for the various Perl scripts and debhelper packages that are required for sane building, which is why most of our release management is done from a Debian machine.

The days of the holy wars between dpkg and RPM seem to be over, and really there's only one thing I've found that you can do with dpkg that you can't do with RPM: substitute dependencies - that is, marking a package as requiring X or Y. We need this because we try and use the same package metadata for all our target platforms, which is irritating when some provide the hostname program in the net-tools package and others split it out into a package of its own. Although you can use filesystem paths, there's also some distributions that ship it as /bin/hostname and others as /usr/bin/hostname, which runs into the same problem.

The second problem is distribution channels. Most distributions require a repository reachable over HTTP with packages cryptographically signed and with a whole bunch of metadata available. Here, we are enormously lucky to have two free services available.

openSUSE's Open Build Service (OBS) makes this much easier. It allows you to upload a source package and will build binary packages in dpkg and RPM formats for all the most popular distributions, as well as a repository hosting service and a reasonable front-end for helping users to install them.

We also use a Launchpad Personal Package Archive to ship similar dpkgs for Ubuntu. This seems like overkill - after all, OBS can generate and host Ubuntu packages - but the user experience for Ubuntu users is much better with Launchpad and generally more familiar.

However, installing using either of these methods still require cut-and-paste of relatively complex command lines into a superuser shell (except on openSUSE, which has a one-click install link). This isn't ideal - fish is supposed to be attractive to people who aren't comfortable at the shell. We could ship an installer package that understands how to add the distribution channel, mark it as trusted, update the package lists and download the new package (Chrome does this), but I've always been nervous - we don't really have the resources to test this correctly. There's probably room for a framework or similar project to automate this and get it right.

One distribution we don't do our own packages for is Arch Linux. This is deliberate; OBS can produce Arch packages, but some Arch users told me a couple of years ago not to bother, as they have an active maintainer and are committed to always shipping the newest release. I'd be happy to turn the Arch switch back on at some stage, but there hasn't been any demand.

OS X

Unlike Linux, I am not comfortable with OS X. This is not for a lack of trying - I've been using OS X since 10.0. I bought my first Mac last year and it had beachballed, requiring a hard reset, within ten minutes of being powered up. I don't know what happens but whenever I get near OS X it stops working. Coupled with the fact that Apple "hate sysadmins" (according to the professional Mac sysadmins I know) and certainly seem to have some disdain for backward compatibility and we are off to a bit of a rocky start. Luckily we have a few committed and brilliant contributors who do most of the wrangling for OS X.

We ship two kinds of packages for OS X, and we actively promote Homebrew as a third option.

The first kind is the self-contained .app format. This drag-and-drop install distribution is the greatest strength of the OS X ecosystem and entirely self-contained. However, it gives users minimal control over the way the shell is launched (forcing the use of Apple Terminal rather than the better iTerm) and doesn't allow for third-party integration with shared or standard paths.

The second package is a .pkg installer. This is a genuine Windows 3.1 experience - no way to uninstall, no way of seeing what's installed, no sensible update mechanism. I believe that's partly our fault, although being relatively new to OS X I haven't managed to look into it further.

Finally, through Homebrew, the dominant free software installation tool for OS X. Homebrew's history is a great example of the dangers of reinventing the wheel, but it is moving impressively fast. Five years ago it was at the level of FreeBSD ports circa 2002; with the addition of binary packages and an increasingly sensible approach to repository management there are fewer and fewer features I miss from APT.

Homebrew's maintenance model is a little unusual - there is no designated maintainer for a given formula, which means that we have had a couple of issues with less experienced packagers submitting updates that then don't work as intended, leading to confused users. I'm trying to be more proactive in having the changes to Homebrew submitted as soon as I publish the release, which will hopefully avoid this problem in the future.

Other UNIXen

We don't release official packages for platforms like FreeBSD or its cousins, although fish is available in repositories for FreeBSD, OpenCSW for Solaris, DragonFly BSD, and NetBSD. Partly this is just down to a lack of resources - OBS doesn't support any of these platforms, none of the fish committers use them on a daily basis in anger, and it's not immediately obvious how to support a third-party repository for them.

I don't think fish even builds properly on Solaris at the moment, and since it went x86-only I've really lost all interest.

Windows

Various versions of Windows have technically been POSIX-compliant over the years, but really the only option for getting fish working was through Cygwin, a terrifying reimplementation of large parts of Linux into Windows. The fact that it works at all is amazing, and we are blessed with an active and knowledgeable maintainer, but Cygwin continuously produces odd edge cases, will probably never work properly with Unicode and requires a lot of fallback implementations in fish. Once again, there's no obvious way for us to build and ship our own packages for Cygwin should the maintainer ever disappear.

The landscape is changing enormously with the arrival of the bizzarely-titled BashOnWindows implementation. Despite the name, fish works basically perfectly from the Ubuntu packages that we are alreay committed to producing, and I suspect that will become the dominant method of installing fish on Windows in the future.

The ideal world

"Complaining is what I do."

"Complaining is all you do."

(Bill Bryson, Notes from a Big Country)

There are a number of factors at play in getting our software into the hands of users and I recognise that there's no easy answer.

In terms of the user experience, what we'd like is a one-click install method. Whether we build the binaries or not is largely a moot point, although I think it's clear that we can't expect our downstream maintainers to be constantly on the ball.

However, being able to provide an alternative distribution channel is important as we can also ship beta and continuous-delivery builds via the standard package manager, and I'd definitely encourage platform developers to consider how they can help users to understand the implications of adding a third-party software source, and then to proceed in a straightforward fashion.

Of course, if you're interested in helping us make fish integrate into your platform of choice, we'd be glad to hear from you!