Jeff Stafford

Red Hat licensing changes and the long, slow death of a community

Sun, 02 Jul 2023 00:00:00 +0000

If you weren’t already aware, the Linux community is currently up in a kerfluffle about Red Hat’s latest licensing changes. To summarize:

Red Hat makes an operating system Red Hat Enterprise Linux (RHEL), and until a week or two ago, also published the source code publicly in the spirit of free and open source software.
Red Hat makes money by selling RHEL subscriptions and support.
Throughout the years, other organizations have republished RHEL for free (CentOS, Oracle Linux, Alma Linux, Rocky Linux, etc.).
In an effort to kill the republished copies of RHEL, Red Hat will no longer publish their source code publicly in a way that’s easy to rebuild. Now they publish the source to CentOS Stream, a “testing OS” that is very similar, but not quite compatible with software built for RHEL.

Anyhow, the community is upset, because it means RHEL as a free and open-source product will likely no longer be available. Yes, you can still pay for RHEL. Yes, you can download 16 free subscriptions that you’re not allowed to use for anything that could possibly earn you money. Yes, you can still get a different free product with CentOS Stream. Yes, Red Hat’s changes might even be legal, even if they no longer follow the spirit of the GPL license that makes open-source software possible.

But this sucks.

If you are a sysadmin, developer, or just anyone who works with computers professionally, you’re going to learn a lot of “stuff” over the course of your career. I’ll speak personally from my experience as a sysadmin here (call it “devops”, “sre”, “platform engineer”, or whatever the job title of the month is, it’s the same job). This “stuff” you need to learn includes operating systems (like RHEL), programming languages, infrastructure-as-code tools, and other more esoteric stuff like Kubernetes, cloud providers, databases, and more.

This is a lot of stuff to learn, but over the course of your career, you’ll build a “stack” of tools you’re intimately familiar with and can take with you between jobs. It’s a very empowering feeling, and all of these skills are basically your career and job mobility. You can take these skills anywhere to do to anything a computer can possibly do - for free!

As a sysadmin, pretty much the first step of building this stack is picking an operating system and becoming intimately familiar with it. Right now there are sort of 3 major Linux OS ecosystems that people choose from for work:

Red Hat land: RHEL, Fedora, and the RHEL clones (Alma Linux, Rocky Linux, Oracle Linux, CentOS Stream).
Debian and friends: Debian, Ubuntu, and Pop!_OS
SUSE: OpenSUSE Tumbleweed, OpenSUSE Leap, and SLES

There are a couple common elements here:

Each landmass (er, ecosystem?) has a fast moving, high-quality, desktop OS you can use on your laptop that’s similar to the ones you’ll use on servers. This isn’t an issue here so I’ll gloss over this.
Each OS “ecosystem” has one OS with long-term support (2 years or more). This is important for businesses because once you have a lot of servers, it’s just impractical to be upgrading the OS all the time. So long-term support is essential.
No licensing restrictions. All of these ecosystems have the ability for you to deploy as many copies of an OS as you want, for whatever use case you want, for free.

The last one is important, and is what’s at stake here in this case for RHEL and its related tooling. We need to be honest here - no one wants to pay for Linux itself. If we wanted to piss away hundreds of dollars a year for each server we’d all be using Windows.

Red Hat’s goal here is to convert “freeloaders” like myself (and bascially all the previous companies I’ve worked at) into paying customers by taking away the ability to use their ecosystem for free. Red Hat has tried to do damage control on all this and the response by their “core systems VP” just makes things look even worse:

[…] we have determined that there isn’t value in having a downstream rebuilder.

The generally accepted position that these free rebuilds are just funnels churning out RHEL experts and turning into sales just isn’t reality. I wish we lived in that world, but it’s not how it actually plays out. Instead, we’ve found a group of users, many of whom belong to large or very large IT organizations, that want the stability, lifecycle and hardware ecosystem of RHEL without having to actually support the maintainers, engineers, writers, and many more roles that create it. These users also have decided not to use one of the many other Linux distributions.

Red Hat believes that the existence of Alma Linux and Rocky Linux is cannibalizing sales of RHEL subscriptions. “Every Alma Linux and Rocky Linux install is a lost sale! Maybe if we destroyed all of the rebuilds, all of the people using them would buy RHEL instead? The community will be so eager to reward Red Hat by buying subscriptions to our products now that the alternatives don’t exist anymore, right?”

Wait - “buy subscriptions”?

Red Hat has completely missed why people use their software. No one cares about the support subscriptions. No one cares about “Red Hat Enterprise Linux” or “Red Hat” at all.

Red Hat’s actual product is its community.

As a software developer (either free or paid), RHEL-based OSes are an attractive target to build for because it has a large number of users and businesses using it (read: potential customers). You could build for Ubuntu and RHEL and that basically covered 95% of Linux-based businesses in North America with money.
As an IT professional, the dominance of RHEL-based OSes meant that learning RHEL was a good investment of your time: the more businesses that used it, the more attractive you’d be to employers and you’d be able to start contributing to a company that used RHEL or a RHEL-clone on day one (instead of needing to learn a new OS every time).
As a business, selecting a RHEL-based OS was attractive: there was a large community which meant that most bugs got reported and fixed by community members before you ran into them. There was lots of free+paid software available with guides written by the community, so you didn’t have to install any other OSes just to install some weird piece of software you needed. RHEL skills were also relatively common, so you’d be able to hire knowledgeable people and spend less time training people from scratch. RHEL and its clones were sponsored by a company that appeared to be healthy and profitable so you knew the OS wasn’t going to suddenly implode and you’d have to spend effort jumping ship.
As for Red Hat and IBM itself, more users of free RHEL clones meant that there’d be more chances to sell Red Hat and IBM’s other, much more valuable software offerings. For instance, when I worked in supercomputing with Queen’s University and Compute Canada, we were all rabid CentOS users that saw absolutely zero value in RHEL, but we were more than happy to shell out hundreds of thousands of dollars each year for GPFS (now IBM Spectrum Scale), Tivoli Storage Manager (now IBM Spectrum Protect), and IBM+Lenovos’s servers and hardware support. IBM made so much money off of us as CentOS users they bought my boss and I a free vacation to Vegas one year to go to their conference and do lines with our account manager (that last part is a joke, my old boss and I aren’t cool enough to get invited to those kind of parties).

All of this community value from using RHEL is based on sheer numbers. The more users there are, the more developers will write software for it. The more software there is for RHEL, the more users there will be. The more stable this community appears, the more likely businesses and professionals will invest in it and stay long-term. The larger the community, the more chance for Red Hat (and IBM) to sell whatever products they had. All of the other work with RHEL that Red Hat claims is just so valuable is just a bonus. Ubuntu, SUSE, Debian, and Amazon (yes, Amazon) tick all the same checkboxes that RHEL does- the biggest factor keeping customers in the RHEL ecosystem is the community that’s sprung up around it.

The “community as the product” is particularly evident with another Red Hat product: Ansible. Ansible is an automation tool that’s commonly used to automatically configure servers and perform common operations tasks. Though the Ansible software itself is nifty and you can do a mind-blowing amount of shit with it, the real value of using Ansible is actually its community, specifically community-generated “Ansible roles”. For the unintiated, Ansible roles are neat little self-contained bundles of Ansible code that setup a server to do advanced things without you actually needing to know the specifics of how to do these things yourself.

Want to configure a Postgres server, but know nothing about Postgres? Blam - done.
Need to back up a server, but not sure where to start? This stranger has your back!
Need to pass a compliance audit or get official government certification for something? There’s an entire company that does nothing but write server hardening Ansible roles to help you pass these type of audits.

By using Ansible, you get to use all this work these people are giving back to Red Hat and the community for free. If this community disappeared, the entire point of using Ansible (and Ansible’s commercial value as a product) would disappear overnight. And that’s what’s happening right now.

For better or worse, many FOSS software ecosystems have a cult of personality that revolves around a single, ultra-productive community member. For instance Linus Torvalds basically was Linux incarnate for several decades. The R programming language revolves around Hadley Wickham. He wrote so many amazing R packages it created an entire data science ecosystem called the “Hadleyverse” (he asked to change the name to “tidyverse” because he wanted to be modest). Ansible has one of these people too: Jeff Geerling - also known as “geerlingguy”.

geerlingguy is Ansible. I would conservatively estimate that >50% of the good (as in, you’d actually want to use these instead of writing your own) Ansible roles on Ansible Galaxy are written by him directly. He literally wrote the book on Ansible. I’ve even used “Do you know who geerlingguy is?” as an interview question - if someone doesn’t know who he is, it’s obvious they’ve never spent any serious time with Ansible (this question also absolutely fucks with people trying to use ChatGPT and read off the screen to fake it during job interviews. Yes… we’re on to you.).

Not only have Red Hat’s latest moves alienated their largest contributor, he’s gone scorched-earth and begun actively removing support for RHEL from all of his Ansible roles. Jeff Geerling is now advertising on Twitter about just how easily you can use Red Hat’s own Ansible product to migrate off of RHEL using the cross-platform tooling he’s written (I’d link to the tweet, but Twitter is offline). RHEL’s vendor lock-in isn’t an issue when you have automated tools like Ansible to reproduce your RHEL servers on another distribution like Debian in a matter of minutes.

This is a disaster for Red Hat. And its not an isolated incident: EPEL maintainers are leaving in droves. The last comment is particularly telling in regards to how most contributors to Red Hat’s products are feeling right now:

Why should we keep contributing to EPEL? To be forced to use 16 free RHEL instances maximum? What is the advantage for us volunteer contributors? I mean, we did not do it for personal advantage, we did it to help us each other within the Enterprise Linux distros community, but this Red Hat move will kill the Enterprise Linux distros community, leaving only with RHEL, which is mostly a paid subscription distribution, let’s call things with their proper name

These free EPEL contributors are essential to Red Hat’s business model. Without them, Red Hat would lose a majority of the software it needs to compete with Ubuntu and Debian’s massive software catalog. Despite being a multi-billion dollar corporation, Red Hat has never had the resources to maintain all of this software by itself. In an unrelated incident, RHEL also failed to retain it’s LibreOffice maintainer, and will stop shipping LibreOffice as a result. This leaves RHEL somewhat perplexingly as the only “enterprise” Linux distribution without an enterprise office suite. Lennart Poettering, another hyper-productive developer (responsible for PulseAudio and systemd) actually left Red Hat last year to go work at Microsoft. Red Hat even fired the Fedora Program Manager who manages the upstream Fedora distribution that Red Hat repackages and rebrands as RHEL itself. Red Hat’s most valuable staff and community contributors are either being fired or jumping ship.

Intentionally or not, Red Hat seems to be doing everything it can to destroy the community that makes RHEL a product you’d want to consider purchasing in the first place.

A failure to monetize, and having your skillset put behind a paywall

Let’s pretend that Red Hat is somehow successful in killing all of its downstream rebuilders (Alma Linux, Oracle Linux, Rocky Linux, etc.). RHEL has an extremely customer-hostile monetization scheme:

You can use it for free if you’re learning it.
As soon as you want to use it for anything commercial, Red Hat wants an unjustifiably high licensing fee. Unlike with Ubuntu Pro, where you can selectively choose to buy support for key systems, without a free RHEL-clone available you’d need to pay a subscription fee for every single production system.
As a software vendor, if your stuff only runs on RHEL, then your customers are forced to pay an extra fee for the OS as well, which makes your product less competitive.
Even if you can stomach the licensing fees, there is no way to convert from a free install (CentOS Stream, Alma Linux, Rocky Linux, etc.) to a paid install without reinstalling the OS. So not only do you have to pay Red Hat tons of money, you also get the joy of reinstalling the operating system on every single machine you have.
If you switch companies and the new company doesn’t want to use RHEL, you are out of the ecosystem permanently. There won’t be a way to onboard your new company into the RHEL ecosystem for free anymore and using RHEL itself will be a very hard sell (see above).
RHEL professionals will “auto-convert” from using RHEL to something else at an extremely high rate because having your career be held hostage to a yearly subscription just isn’t a very empowering feeling.

Speaking for myself, the last factor is the most significant. The people responsible for advocating buying Red Hat and IBM’s RHEL-based products in the first place are being alienated because it feels like their skills with those products are getting put behind a paywall. One of the attractive things about building a career with RHEL-based OSes until now has been that you can pick up and take your skills anywhere, for free. The latest moves to kill off the RHEL-downstream OSes makes it feel like an important part of your skillset is getting put behind a paywall.

Let’s sum things up:

The main reason to use RHEL-based Linux these days is because of the really great community. Most of RHEL’s software and useful tooling comes from free labor by the community. The latest licensing changes are designed to monetize every cent they can, driving away all of this “community value” that makes RHEL an attractive product in the first place.

Like many other companies before it, Red Hat seems to have entered the “enshittification” death-spiral:

Here is how platforms die: First, they are good to their users; then they abuse their users to make things better for their business customers; finally, they abuse those business customers to claw back all the value for themselves. Then, they die.

Red Hat hasn’t yet successfully killed its RHEL-clone downstreams, but the writing seems to be on the wall. There is a bad actor at the very core of the Red Hat ecosystem: Red Hat itself. There doesn’t seem to be a long-term future for the Red Hat community and RHEL now that the “enshittification” process is in full swing (we are just starting the “abuse the business customers” stage- Red Hat can’t put the squeeze on them if there are easy alternatives). It seems like the future will just be many years of slowly increasing RHEL license fees while people leave and the product gets worse and worse.

Why stay?

When I originally wrote this article, I was really irritated by Red Hat’s decision to try to kill CentOS a second time. It was important to take a step back, “touch grass” as they say, and think about why I felt this way: it’s just an operating system… why am I so upset about this that I would type all of this out? (I don’t even depend on RHEL or RHEL-clones for work anymore.) I think why I was so upset is that the best part about Linux is that it’s just a giant community of people who help each other and try to make the world a better place (or at least the world of computing)- for free! Do we have to monetize this to death? Does everything have to end in a profit-seeking death spiral?

Anyhow, I guess this article is basically just a really roundabout way of saying I’m dropping official support for RHEL in the software I write. Any continued support is just a happy coincidence of the fact that SUSE and Fedora share the same RPM build toolchains. I won’t pretend that I’m an important community member or my contributions are so valuable that Red Hat will go under without me, but it’s just not worth putting in free labor to support a yet another company who is doing everything possible to use everyone else’s work and give nothing back.

Near zero-downtime Postgres migrations and upgrades with pglogical

Mon, 03 Aug 2020 00:00:00 +0000

Databases are notoriously fussy to work with. Postgres is no exception. Though the software itself may be pretty solid, stuff like major version upgrades or migrations to “the cloud” (or back to on-prem) are really tricky to do without significant and costly downtime. Though there’s tools out there to make this process easier, many of these simply don’t work for anything more than small test databases, and will silently corrupt tables or fail spectacularly in real-world scenarios.

This post is about how to safely migrate a real-world Postgres database without downtime using pglogical. As a bonus, this procedure works to migrate an on-premise db to AWS RDS (many tools don’t work with RDS), and you can perform multiple major version upgrades as part of the process (skip as many versions as you want!).

I haven’t written any blog posts for a very long time. Writing these posts is a lot of work - I usually only sit down to write something when it’s of use to me and public documentation doesn’t exist or is otherwise very sparse. This is one of those articles. (Looking to migrate a MySQL / MariaDB database in a similar manner? Check out this guide from AWS.)

pglogical: Why are we using it?

The goal of migrating a database is to create an identical copy of it on a separate piece of infrastructure, either on another VM, another datacenter, or perhaps even another country. There’s a lot of different ways of migrating Postgres databases and unfortunately all of them have significant limitations. Let’s quickly do an overview of the different Postgres tools and demonstrate why pglogical is the least bad option (at the time of writing).

Dump and restore

This is the most basic method of moving a database. You create a database dump from the source database with a tool like pg_dumpall or pg_basebackup and restore it on the target. Obviously, this is not a great option when you want to avoid downtime. Depending on database size, it can take hours to create the initial database backup, and many hours to restore it on the new target instance. Any writes that occur on the source instance after the backup is taken are lost. Though this method is virtually foolproof and can perform upgrades as part of the process, it obviously incurs significant downtime. This isn’t an option for many businesses, and likely everyone involved would prefer it if there was no disruption to the business at all during the migration.

Binary replication / “hot standby” databases

Binary replication is the easiest to setup, and lets you create a read-only replica db from your original master. Out of all the replication options, this is by far and away the best option. It “just works”. If you’re looking to setup binary replication, honestly the best starting point is the official documentation:

Quick tutorial: https://wiki.postgresql.org/wiki/Hot_Standby
More detailed overview: https://www.postgresql.org/docs/current/hot-standby.html

Unfortunately, binary replication has several key disadvantages:

It doesn’t work with most cloud providers. If you want to migrate to a managed database like AWS RDS, you won’t actually have the superuser permissions and tools to set this up.
Binary replication only works with DBs of the same major version. If you want to replicate to a different version, well… you can’t.
Replication is one-way: you can only have a single “master” database active at any given time. (Unpopular opinion: if you want master-master replication that doesn’t suck, you should honestly just switch to MariaDB, where this is a solved problem.

If you are not upgrading to a new database major version and are not trying to migrate to a managed service like AWS RDS, stop reading this article now and just use binary replication. It’s the simplest option and is the fastest path to a successful migration.

Third-party replication tools

There are a lot of other replication tools out there designed to address some of the shortcomings of Postgres’ built-in binary replication. I won’t go too deeply into each one, but whether they work or not is highly dependent on what your database looks like: how big is it, what types of data you have in there, where you’re trying to migrate to (when using managed databases like RDS, you’ll frequently be missing the pemissions necessary to set things up), etc..

While trying things out, we explored trigger-based replication tools like Slony, Bucardo, and Londiste. All of these had significant issues with the databases we tried to migrate. In particular, replication frequently broke, and there were numerous issues with truncated or empty tables where a database that had supposedly been replicated was missing data. As mentioned before, success is highly dependent on the database you’re trying to migrate. Simple databases that don’t use special datatypes or triggers are much more likely to work. It’s possible that one of these tools may work for your DB, but you’ll need to try them out to know for sure.

AWS has a managed “Database Migration Service” (DMS): this is a proprietary AWS tool that live-replicates data from one database to another. It also sucks. In addition to taking an extremely long time to replicate large databases, silent corruption and truncation of tables is extremely common - even if a migration survives the initial copy phase (in my experience, tables with more than a hundred million rows will consistently break a DMS replication instance irreparably, especially if record validation is enabled), many records will be altered in mysterious ways. Some of the more entertaining failures I encountered included DMS shifting some, but not all, timestamps in a MySQL database 3 hours into the future (no, this wasn’t timezone-related), and DMS rounding all of the columns that used Postgres’ money datatype to the nearest ten cents. If you’ve been considering using DMS on any DB you care about… don’t. Your time is better spent investigating other migration options that actually work reliably. DMS is only worthwhile to look into if you need to switch DB technologies completely (such as from Oracle to Postgres).

Postgres logical replication and pglogical

At this point, migrating a Postgres database without downtime of some kind is looking increasingly impossible. Enter pglogical:

pglogical is a “logical replication” tool for Postgres. Instead of replicating filesystem-level changes (binary replication), pglogical replicates the actual database SQL statements from one db to another target db. Executing an insert on the master db will execute the same insert on any dbs “subscribed” to it. Though the tool has its own limitations (we’ll get to those in a second), logical replication has several MAJOR advantages over other methods:

You can set filters and rules for what you want to replicate. Instead of creating an exact binary copy of another DB, you can selectively replicate only parts of it.
Replication works between major versions. Because the actual SQL statements being replicated are not specific to a particular version, you can replicate between different major versions without issue. You can even skip multiple major Postgres versions in one go (For what it’s worth, I’ve successfully done 9.4->9.6 and 9.6->11 without issue, though your results may vary).
Cloud providers like AWS actually support it. AWS RDS has the pglogical extension built into their RDS images on Postgres 9.6.10 and above.
It’s one of the only replication tools that supports really old versions of Postgres (9.4+).
It’s free and open source. Hopefully I don’t need to explain why this is awesome.
It actually works. Though the tool is still pretty tricky to use, pglogical is remarkable for the fact that it hasn’t failed me yet.

But wait, what about Postgres’ native logical replication? Postgres recently gained the ability to perform logical replication on its own in version 10. So how is pglogical different? As it turns out, they’re actually the same tool under the hood. Both Postgres’ native logical replication and pglogical were developed by 2ndQuadrant - pglogical is the upstream for Postgres’ native logical replication, and has significantly more features (esp. for Postgres 10 and 11). pglogical is also notable for working on Postgres 9.4+, whereas native logical replication isn’t supported until versions 10+. To sum things up, the main difference between pglogical and Postgres native replication is that pglogical will have more features on older versions of Postgres (and managed services like RDS support using pglogical).

So what are the downsides?

You still need to apply a database restart to install the pglogical extension. I may have lied earlier when I said that the migration was “zero-downtime”, but a single restart isn’t bad as far as these things go (plus you can choose when to do the restart, as opposed to being forced to do it as part of cutover).
Lots of stuff isn’t supported. pglogical doesn’t migrate sequences very well, if it does it at all. The documentation claims that sequences will be syncronized “periodically”, but in practice, I’ve never seen pglogical actually sync them unless you explictly force it to. There’s more details on sequence migration later in this article, but this is one of the issues where using pglogical has several major caveats.
Changes to database schema don’t get replicated whatsoever. Any changes to database table structure or anything else need to be performed separately on both the master DB and its replicas.
Replication is per-DB. If you have a postgres server with a bunch of DBs on it, you’re going to need to setup and monitor replication for each one individually. This can very quickly turn a migration into a lot of work.
A primary key is required to perform UPDATEs and DELETEs. Tables without primary keys are going to be insert-only. I have no idea why some developers keep creating tables without primary keys, but if you are unfortunate enough to have any of these individuals at your company, make sure they’re aware of this caveat before you start the migration process.
Foreign keys are ignored during replication. If you’re simply moving a database from one location to another, this is probably not a huge concern, but if want to use pglogical in a master-master replication setup, foreign key constraints aren’t going to do anything. If you want good master-master replication, stop reading this article now and switch to MariaDB.
Documentation is really sparse. Here’s the official documentation: https://www.2ndquadrant.com/en/resources/pglogical/pglogical-docs/. That’s it. There’s a few blog posts out there of questionable veracity (this one included), but as documentation goes, you’re more or less on your own. There are very few how-to’s out there and good luck asking questions on Stack Overflow if you get yourself into trouble. Please do not email me or ask me for help (yes, I don’t care if you have money).

Did you read all that and are still interested in migrating/upgrading a database? Let’s get started.

Migrate a database using pglogical

Before we start, make sure you’ve completed the following pre-requisites:

Verify that both the source and target database are running Postgres 9.4+. If you’re using Postgres 9.4, be aware that there are several special considerations you need to take into account. If you are using RDS, the RDS databases must be Postgres version 9.6.10 or above - that is the earliest RDS version that supports pglogical.
You have a direct network connection between your source and target DB. It doesn’t matter what connection you’ve got as long as it works: AWS VPC peering, IPsec, Wireguard, etc. - in some cases you can even get by with an SSH tunnel. Just make sure the target db instance is able to connect to the source db.
You have superuser or equivalent privileges on both the source and target db. If you are using RDS, the rds_superuser role is sufficient.
You have read, or at least skimmed the official documentation. Don’t skip this step.

Note: if you want to use pglogical to perform a zero-downtime upgrade, setup the target database on whatever Postgres version you wish to upgrade to. If you wanted to upgrade from version 9.6 to version 12, you would install Postgres 12 on the target.

Initial pglogical setup

Before you do anything else, make sure you’ve installed the pglogical package on both dbs (the next few setup steps need to be done for both source and target dbs). Follow the official documentation here: https://www.2ndquadrant.com/en/resources/pglogical/pglogical-installation-instructions/.

Add the following to your Postgres config, and restart the Postgres service to apply the changes (a reload is not sufficient to load the pglogical package).

# for more information on these values, see pglogical docs.
# these values are sufficient if you intend to migrate less than 10 databases from the source instance
wal_level = 'logical'
shared_preload_libraries = 'pglogical'
max_worker_processes = 10 
max_replication_slots = 10
max_wal_senders = 10 # 10 + previous value, if one was there
track_commit_timestamp = on # leave this line out if using postgres 9.4

CREATE ROLE pglogical WITH LOGIN REPLICATION SUPERUSER PASSWORD 'some_password_here';

Ensure that this user is able to connect from the target instance to the source in the source instance’s pg_hba.conf (please use a secure authentication method). If you don’t know how to do this, see the official documentation for pg_hba.conf. Note that a reload is sufficient to apply changes to pg_hba.conf (restarting postgres is not necessary here).

Initial pglogical setup on RDS

If you are using RDS, you’ll need to add pglogical to shared_preload_libraries in your parameter group and reboot your RDS instance - see the RDS docs on how to make changes to a parameter group if you are unsure. If it already has another value there, just add pglogical to the end (these values can be comma-separated).

Create a “pglogical” user with RDS superuser privileges (the “pglogical” user name is arbitrary):

CREATE ROLE pglogical WITH LOGIN PASSWORD 'some_password_here';
GRANT rds_replication TO pglogical;
GRANT rds_superuser TO pglogical;

Configure source database

Note: every step that follows is per-database. If the source db instance has multiple datbases you want to migrate, you’ll need to repeat these steps for each database.

At this point, we’re ready to actually setup replication. pglogical has some important terminology that we need to understand before we can continue:

A “node” represents a database. It can either be a publisher (source) or subscriber (target).
A “replication set” is a set of tables and sequences to be migrated, as well as what changes should be replicated (stuff like INSERT, UPDATE, DELETE, and/or TRUNCATE).
A “subscription” represents an actual replication connection. “Subscriber” nodes sync changes from “publisher” nodes. By default, all replication sets are migrated from the source to the target.

The replication process has three basic steps: setup the provider node and select what data to replicate, setup the subscriber node, then create a replication connection. With this in mind, let’s setup the source database now. Login to the database you wish to replicate on the source instance and perform the following:

-- Create the pglogical extension
CREATE EXTENSION pglogical;
CREATE EXTENSION pglogical_origin; -- only on postgres 9.4, otherwise skip

-- Create the publisher node
-- The DSN represents how to connect to the source database
SELECT pglogical.create_node(
 node_name := "source",
 dsn := 'host=hostname_or_ip_address port=5432 dbname=database_to_migrate user=pglogical password=pglogical_password'
);

Now we need to select which tables should be migrated and add them to a replication set. Several replication sets were created when you ran CREATE EXTENSION pglogical;: default, default_insert_only, and ddl_sql.

The default replication set is what you should use by default and will replicate INSERT, UPDATE, DELETE, and TRUNCATE (note that TRUNCATE CASCADE doesn’t work).
default_insert_only only replicates INSERT statements. You use this for tables that don’t have a primary key.
ddl_sql is a special replication set designed to replicate schema changes. You don’t need it here because this is a one-time migration and will not be making changes to the source instance’s schema during the process. (Using the ddl_sql replication set is outside the scope of this article.)

Tables and sequences needed to be added to the replication sets individually, though there are helper functions to do this in one go:

-- Add all tables to the default replication set from the 'public' schema
SELECT pglogical.replication_set_add_all_tables('default', ARRAY['public'])

-- Check which tables have been added to all replication sets
SELECT * FROM pglogical.replication_set_table;

-- Add all sequences to the default replication set from the 'public' schema
SELECT pglogical.replication_set_add_all_sequences('default', ARRAY['public']);

-- Check which sequences have been added to all replication sets
SELECT * FROM pglogical.replication_set_seq;

If a table doesn’t have a primary key, you’ll need to remove it from the default replication set and add it to the default_insert_only replication set. Any tables created by an extension like postgis should also need to be removed from replication (extension-specific tables will be “migrated” when you import the db schema on the target db instance).

-- Remove a table from the 'default' replication set and add it to 'default_insert_only'
-- (for other table manipulations see the official documentation):
SELECT pglogical.replication_set_remove_table('default', 'table_name_here');
SELECT pglogical.replication_set_add_table('default_insert_only', 'table_name_here');

Before you proceed, check your work - note that you can investigate and view the tables that pglogical uses for replication in the pglogical schema: \dt pglogical.* will give you a list of tables you can look at. Do not attempt to manipulate these tables yourself except through the functions that pglogical provides (“Here be dragons.”).

Finally, create a schema-only dump of the source database you wish to migrate with pg_dump:

pg_dump -U pglogical -h source_database -s database_name > database_name_schema.sql

Configure the target database

Create the database you want to migrate on the target:

CREATE DATABASE database_name;

Import the schema from the previous step on the target database. If you encounter errors, feel free to drop the database on the target and reimport the schema as many times as it takes to make sure you’ve fixed any errors.

psql -U pglogical -h target_database -d database_name < database_name_schema.sql

Recreate any users/roles that need to be imported (you can also dump and restore these with pg_dumpall, but this is not covered here).

With all of that done, we can setup the pglogical subscriber node.

-- Create the pglogical extension
CREATE EXTENSION pglogical;
CREATE EXTENSION pglogical_origin; -- only on postgres 9.4, otherwise skip

-- Create the subscriber node
-- The DSN describes how to connect to the target instance
SELECT pglogical.create_node(
 node_name := 'target',
 dsn := 'host=hostname_or_ip_address port=5432 dbname=database_to_migrate user=pglogical password=pglogical_password'
);

Create the subscription

Now that the source and target database nodes have been setup, we can create the replication subscription. Creating the subscription will immediately start replication, so make sure you’re ready to start before beginning this step:

-- This command is run on the target instance (the "subscriber node")
SELECT pglogical.create_subscription(
 subscription_name := 'subscription_name_here',
 provider_dsn := 'host=hostname_or_ip_address port=5432 dbname=database_to_migrate user=pglogical password=pglogical_password'
);

Now that you’ve created the subscription, check its status:

SELECT pglogical.show_subscription_status('subscription_name_here');

There are three possible “subscription statuses”:

down - This is probably what you’re going to see the first time setting this up. This means that replication has failed. This can either due to connectvity issues or an actual problem with replication. Check the Postgres logs on the source and target databases to see why.
initializing - A status of initializing means that the source database is performing the initial copy of the table data from source to target. Seeing this typically indicates success - you just need to wait until the subscription reaches replicating state.
replicating - This means that the target database node has fully replicated the entire source db, and is now replicating ongoing changes. A “replicating” db is almost fully migrated - there are several final steps you’ll need to take.

If replication is down, see below for troubleshooting instructions. Otherwise, if replication has been successful, feel free to skip the next section.

Troubleshooting replication

If replication is initializing or replicating, skip this section: things are working. Otherwise, continue reading.

Before you do anything else, be aware that you can stop and start replication as needed via:

-- Pause replication
SELECT pglogical.alter_subscription_disable('subscription_name_here');

-- Resume replication where you left off
SELECT pglogical.alter_subscription_enable('subscription_name_here');

The Postgres server logs will always have the cause of replication failures - do what they say to fix things. The most likely cause of replication issues is networking issues. Again, ensure that the target database can connect to the source database on port 5432 and that the replication user can connect to the source database from the target in the source database’s pg_hba.conf.

Other issues may require resynchronizing a particular table. You can forcibly resynchronize a table with the following command (run from the source instance):

-- NOTE: will truncate the table on the target
SELECT pglogical.alter_subscription_resynchronize_table('subscription_name_here', 'table_name');

-- Check the status of the resynchronized table, pretty self-explanatory
SELECT pglogical.show_subscription_table('subscription_name_here', 'table_name');

The above command may not succeed if the table has foreign key constraints - you’ll need to manually TRUNCATE both the table you are attempting to resynchronize as well as any tables that depend on it on the target (obviously, make EXTRA SURE that you’re running the TRUNCATEs on the correct database).

Perhaps you forgot a table when setting up the initial replication sets? To fix this, you can still add the table during replication, then call pglogical.alter_subscription_resynchronize_table().

For anything else, check the official pglogical documentation. Although there’s very few how-tos in there, there’s a number of functions to do most anything you need to. Don’t be afraid to completely restart the process from the beginning if you have to (drop the target database, reimport the schema, and re-setup replication on the target instance from scratch). If you need to peek at pglogical’s current state, remember that you can get a list of tables with \dt pglogical.*.

Completing the migration

If your setup has been successful and reached replicating state (if its still initializing, just wait for the initial copy to complete) you’ll need to take several steps to complete the migration.

At this point, the actual table data has been synced, but the sequences are still out of sync (the pglogical documentation claims they’ll be synced, but in practice, it doesn’t seem to happen). You can synchronize them with this command on the source instance:

-- Before running it, note that this will actually add 1000 to each sequence value on the target.
-- This is actually by design - you can read about and complain about this here:
-- https://github.com/2ndQuadrant/pglogical/issues/163
SELECT pglogical.synchronize_sequence( seqoid ) FROM pglogical.sequence_state;

-- You can check individual sequences with this command on the subscriber database:
SELECT last_value FROM sequence_name;

If you want an alternative command to sync sequences, try this command on the target database: https://wiki.postgresql.org/wiki/Fixing_Sequences

Once the sequences have been synced, you can begin the cutover process to the new database. I cannot help you with this part - the exact cutover steps you need to perform will depend on what application is connected to Postgres. One very important thing to note about the cutover process is that both the source and target are writable. The replication subscription will continue replicating data until you issue the following command:

-- Optional - temporarily stop replication:
SELECT pglogical.alter_subscription_disable('subscription_name_here');

-- Permanently disable replication
-- (re-migrating data will require repeating the migration process from scratch):
SELECT pglogical.drop_subscription('subscription_name_here');

At this point replication is terminated - you can drop the remaining pglogical nodes, extension, and roles at your convenience (they do not impact normal operation as long as there is no active replication subscription). Congratulations on a successful migration!

Going "Pro" with RStudio Server Open Source

Wed, 20 Jun 2018 00:00:00 +0000

RStudio is the go-to tool for programmers working in R. Frequently organizations will want to centralize their R work or provide web-based access to a compute environment. Although RStudio Server has an “open source edition”, most of the useful administrative functionality is locked behind the rather-expensive RStudio Server Pro version ($10k USD/year). This price isn’t sustainable for many organizations, or might not otherwise be worth it if there are only a few potential users. We will cover how to setup and administer the free version of RStudio Server in a professional manner, and use Linux’s features to unlock most of the functionality from the “Pro” version.

And before you ask, yes, this is all perfectly in line with RStudio’s open source licensing. Many of these changes are also useful if you’ve got a license for RStudio Server Pro, particularly the reverse proxy configuration.

Setting up a base installation

I’m going to assume you’ve already got a fresh installation of CentOS 7 ready to go. In my case, I’ve installed CentOS in a GNOME Boxes VM on my laptop, normally you’d be SSH’ed into a server and setting things up that way. We’ll start by installing R, RStudio, and several development headers required for many R packages, in this case tidyverse and devtools. Note that this tutorial assumes you are working as the root user (since pretty much every command we will need to run requires sudo privileges).

yum update
yum install epel-release
# install R plus some useful development headers for R (required for tidyverse + devtools)
yum install R openssl-devel libcurl-devel libxml2-devel wget
# download RStudio Server and install it
wget https://download2.rstudio.org/rstudio-server-rhel-1.1.453-x86_64.rpm
yum install rstudio-server-rhel-1.1.453-x86_64.rpm

RStudio Server should now be running at port 8787 on your server. You can test that the installation worked by visiting http://localhost:8787/ in a browser.

This is the basic installation of RStudio Server Open Source Edition. However, there’s a number of glaring issues with this installation:

RStudio Server doesn’t know about LDAP users or any users not directly on the server (i.e. any users not created with useradd).
RStudio is being hosted over a non-standard port (8787).
The website is being served over HTTP - any passwords entered/network traffic will be in plain-text. This is BAD.
There are no resource limits for users. There’s a known bug in RStudio (both Pro and Open Source Edition) where loading >10GB of data into a session will lock that user out of RStudio indefinitely. (RStudio will try to save large sessions to disk, then time out while attempting to re-load them).
We might want to host RStudio as part of another website (for example, https://your.website.name/rstudio/).

Authenticating network users via PAM

RStudio Server uses PAM for authentication. PAM (Pluggable Authentication Modules) are used on Linux to break authentication and sign-in into a set of configurable modules. Without going too deep into things, we can change how RStudio authenticates users by changing its PAM configuration. (If you don’t care about letting people use network credentials like LDAP, feel free to skip this section.)

RStudio’s PAM configuration is stored at /etc/pam.d/rstudio. Let’s look at the current config:

cat /etc/pam.d/rstudio

#%PAM-1.0
auth requisite pam_succeed_if.so uid >= 500 quiet
auth required pam_unix.so nodelay
account required pam_unix.so

Translating the PAM config into plain-english, this config does two things:

Authentication will succeed if you are attempting to authenticate a user with a UID greater than 500 (this is done to prevent low-numbered system users from logging in - you don’t want any users to logging in as root, for instance).
Authentication and user accounts are handled by the UNIX authentication module (pam_unix.so).

Before you do anything else, create a backup of your old RStudio PAM module:

cp /etc/pam.d/rstudio /etc/pam.d/rstudio.bak

If we want to have our installation authenticate different types of users, we’ll need to change RStudio’s PAM authentication. To change authentication methods, say from UNIX users to LDAP, all we’d need to use is change the authentication module from pam_unix.so to a new module like pam_ldap.so. (Note: this will remove the ability of local UNIX users to login to RStudio, and only allow LDAP users to login exclusively.)

Example /etc/pam.d/rstudio LDAP auth config:

#%PAM-1.0
auth requisite pam_succeed_if.so uid >= 500 quiet
auth required pam_ldap.so nodelay
account required pam_ldap.so

What happens if we want to allow a mix of both network (LDAP) and local (UNIX) users to authenticate? Ideally, you’d want a config that matched how the system normally authenticated users over SSH/whatever. The good news is that this config already exists: /etc/pam.d/password-auth. We can use other PAM files like this one in our existing RStudio config:

#%PAM-1.0
auth requisite pam_succeed_if.so uid >= 500 quiet
auth include password-auth
account include password-auth

The changes should take effect immediately for all new sessions (using either our LDAP or system-auth PAM config). To be specific, a “new session” in the context of RStudio means either logging in with no existing rsession processes, or clicking the “power” button in RStudio Server to start a new session/process on the server. If something, goes wrong, you can just restore the old RStudio PAM config by copying over your backup from earlier.

Hosting RStudio Server securely over HTTPS

You typically do not ever want a web application like RStudio exposed to the general internet. The best practice is to host RStudio behind a webserver like Apache httpd or Nginx in what’s called a reverse proxy configuration. When you setup a reverse proxy for an application, it means that you are setting things up so that the only way of accessing the application is via your proxy webserver (which is typically more secure than the application itself). We’ll set up access to RStudio Server in this manner using httpd, and configure the firewall to allow access to only the ports we specify.

First, let’s make sure that our firewall is up, running, and starts on boot.

systemctl start firewalld
systemctl enable firewalld
systemctl status firewalld

# should show something like the following:
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2018-03-02 15:28:37 EST; 2 days ago
Docs: man:firewalld(1)
Main PID: 710 (firewalld)
CGroup: /system.slice/firewalld.service
└─710 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid

Let’s configure the firewall to allow access to our server over ports 80 (HTTP) and 443 (HTTPS).

firewall-cmd --add-service=http --permanent
firewall-cmd --add-service=https --permanent
firewall-cmd --reload

Alright our firewall is running and will happily allow connections to our machine. Let’s install and configure the Apache HTTPD server to host RStudio.

yum install httpd mod_ssl
systemctl start httpd
systemctl enable httpd

Ok, we’ve got an HTTP server (if you want to check, visit localhost in a browser - it should appear just like the image above). We just need to tell it how to host RStudio. Let’s create a new Apache VirtualHost that exposes RStudio to the web. You’ll need an SSL certificate for this step. If you don’t have an SSL certificate, you can get one from Let’s Encrypt using the instructions here: https://certbot.eff.org/#centosrhel7-apache. If Let’s Encrypt isn’t an option (say if you’re doing this on a VM like me), we can create a self-signed SSL certificate with the following. For consistency’s sake, I’ll put it in /etc/rstudio:

openssl req -x509 -newkey rsa:4096 -keyout /etc/rstudio/rstudio_key.pem -out /etc/rstudio/rstudio_cert.pem -days 3650 -nodes
# enter whatever you want for the questions since it's a self-signed cert

# NOTE: please ensure that your certificates are not world-readable, you do not
# want random users to be able to read your certificates. Make sure that only
# root can read the certificates.
chmod 700 /etc/rstudio/*.pem

Now that we have an SSL certificate, let’s setup our RStudio VirtualHost. Create a new file /etc/httpd/conf.d/rstudio.conf, with the following content. This will host RStudio at your server’s base directory, for instance https://website.name.here:

<VirtualHost *:80>
# redirect all port 80 traffic to 443
RewriteEngine on
ReWriteCond %{SERVER_PORT} !^443$
RewriteRule ^/(.*) https://%{HTTP_HOST}/$1 [NC,R,L]
</VirtualHost>

<VirtualHost *:443>
# configure SSL
SSLEngine on
SSLCertificateFile /etc/rstudio/rstudio_cert.pem
SSLCertificateKeyFile /etc/rstudio/rstudio_key.pem
# use if you have a real cert
# SSLCertificateChainFile /etc/rstudio/rstudio_cert_bundle.crt

# disable weak SSL ciphers
SSLProtocol -ALL +TLSv1.2
SSLCipherSuite HIGH:!MEDIUM:!aNULL:!MD5:!SEED:!IDEA:!RC4
SSLHonorCipherOrder on
TraceEnable off

# host rstudio
ProxyPreserveHost on
ProxyRequests off
RewriteCond %{HTTP:Upgrade} =websocket
RewriteRule /(.*) ws://localhost:8787/$1 [P,L]
RewriteCond %{HTTP:Upgrade} !=websocket
RewriteRule /(.*) http://localhost:8787/$1 [P,L]
ProxyPass / http://localhost:8787/
ProxyPassReverse / http://localhost:8787/
RequestHeader set X-Forwarded-Proto "https"
</VirtualHost>

If you want to host RStudio under a subdirectory (say https://website.name.here/rstudio/), your conf should look something like this:

<VirtualHost *:80>
# redirect all port 80 traffic to 443
RewriteEngine on
ReWriteCond %{SERVER_PORT} !^443$
RewriteRule ^/(.*) https://%{HTTP_HOST}/$1 [NC,R,L]
</VirtualHost>

<VirtualHost *:443>
# configure SSL
SSLEngine on
SSLCertificateFile /etc/rstudio/rstudio_cert.pem
SSLCertificateKeyFile /etc/rstudio/rstudio_key.pem
# use if you have a real cert
# SSLCertificateChainFile /etc/rstudio/rstudio_cert_bundle.crt

# disable weak SSL ciphers
SSLProtocol -ALL +TLSv1.2
SSLCipherSuite HIGH:!MEDIUM:!aNULL:!MD5:!SEED:!IDEA:!RC4
SSLHonorCipherOrder on
TraceEnable off

# extra redirects for the RStudio subdirectory
Redirect /rstudio /rstudio/
Redirect /auth-sign-in /rstudio/auth-sign-in
Redirect /auth-sign-out /rstudio/auth-sign-out
# some redirects for RStudio Server Pro, if you've got a license
Redirect /s /rstudio/s
Redirect /admin /rstudio/admin

# Catch RStudio redirecting improperly from the auth-sign-in page
<If "%{HTTP_REFERER} =~ /auth-sign-in/">
 RedirectMatch ^/$ /rstudio/
</If>

# host rstudio
ProxyPreserveHost on
ProxyRequests off
RewriteCond %{HTTP:Upgrade} =websocket
RewriteRule /rstudio/(.*) ws://localhost:8787/$1 [P,L]
RewriteCond %{HTTP:Upgrade} !=websocket
RewriteRule /rstudio/(.*) http://localhost:8787/$1 [P,L]
ProxyPass /rstudio/ http://localhost:8787/
ProxyPassReverse /rstudio/ http://localhost:8787/
RequestHeader set X-Forwarded-Proto "https"
</VirtualHost>

Note: the RStudio Admin Guide instructions on how to host RStudio under a subdirectory are actually wrong here. This config solves a longstanding bug where RStudio does not properly redirect users to and from its authentication pages.

To apply the new config, we’ll restart Apache and perform a config change to SELinux to allow httpd to proxy RStudio.

setsebool -P httpd_can_network_connect on
systemctl restart httpd

RStudio should now be available over HTTPS when you visit the server. Additionally, it will redirect from HTTP and force HTTPS automatically if someone tries to visit the HTTP link.

Set up resource limits

RStudio Server has a critical bug where any user who loads more than 5-10GB of data will be permanently locked out of their session. RStudio will attempt to save this session to disk when it becomes inactive, and then upon resuming the session, it will timeout and fail to load. The user will be locked out of their session. To get around this issue, we’ll need to setup some resource limits (this will also prevent one user from dominating all the memory on the system, of course).

Although RStudio Server Pro has a lot of nifty utilities for implementing resource limits, the Linux kernel does it better. We’ll set some resource limits to bypass the above above bug.

Resource limits on Linux are set in /etc/security/limits.conf. To set a memory limit of 8GB for all non-system users, add the following line to the file:

1000: - as 8388608

Let’s break down the line above - it generally follows the format of:

who_to_apply_limits_to type_of_limit resource_to_limit limit_value

For the who_to_apply_limits_to value, we can specify a user (just use the username), a group (specified as @groupname), or a range of users/groups (to use uid numbers, follow the format min_uid:max_uid). In this case, we have applied the limit to all users with uids greater than 1000. System users on Linux generally are numbered below 1000, and new users created by useradd/LDAP (i.e. real users) will always have uids higher than this value. Using 1000: will apply the limits to all non-system users.

As for the type_of_limit, this can be either hard or soft. hard limits are binding, and can not be altered by users. soft limits can be changed by users using the ulimit command, up to the value of the hard limit. The soft limits are in effect by default. As far as users are concerned, none of them are going to know or care about the ulimit command. Because of this, we might as well set both the hard and soft limits to the same value. There’s a neat shortcut for this - we can specify both limits at the same time using -.

There are a lot of different resource limits, which one do we use? To make a very long story short, the only limits we are usually interested in are as (memory limit), nofile (open files, often needs to be increased for Hadoop/Spark), and nproc (number of processes a user is allowed to start). In this case we want to set a memory limit using as.

Finally, the limit value is different depending on what limit are you trying to set. In the case of as, the limit is in kilobytes. (If one were to calculate a reasonable memory limit in kilobytes in R: gb * 1024 ^ 2). In this case, we set a memory limit of 8GB with the value of 8388608 (KB).

To make a long story short, we’ve set a memory limit of 8GB for all human users on the system.

But wait, you may have tested this out and found it does not actually apply the memory limits! (You can use object.size(some_variable) to check the size of an object in R. If a memory limit is hit, it will display Error: cannot allocate vector of size <some size>.) Why not? As it turns out, session limits set in /etc/security/limits.conf are applied only if the following line is present in the PAM config users logged in as. In order to apply resource limits to RStudio, you should add the following to /etc/pam.d/rstudio:

session required pam_limits.so

This line enforces resource limits on user sessions using PAM. Without it, user sessions started using /etc/pam.d/rstudio will not respect the limitations in /etc/security/limits.conf. Once set, you can use /etc/security/limits.conf to apply whatever resource limits you want to RStudio.

For reference, an example /etc/pam.d/rstudio might now look like the following:

#%PAM-1.0
auth requisite pam_succeed_if.so uid >= 500 quiet
auth include password-auth
account include password-auth
session required pam_limits.so

Summary

Over the course of this article, we’ve done the following:

Installed R and development headers necessary for the tidyverse and devtools packages.
Installed RStudio Server Open Source Edition.
Setup RStudio’s PAM config to authenticate all users on the server, including network/LDAP users.
Hosted RStudio on a standard port (no port 8787 weirdness).
Hosted RStudio so that all traffic between the user and the server is encrypted over HTTPS.
Used resource limits and PAM to enforce resource limits to RStudio users.

To make a long story short, we have applied multiple features from RStudio Server Pro, including: authentication of network/LDAP users, secure communication over HTTPS, and resource limits for RStudio sessions. To underscore this, this is ten thousand dollars per year worth of features.

So why buy RStudio Server Pro? As of this blog post, RStudio Open Source Edition has all the key features of the Pro version, except for the following:

Multiple sessions
Multiple R versions / custom R initialization logic (such as loading environment modules on an HPC cluster)
A very nice admin dashboard (that is not to be underestimated… hnggggg)
Load-balancing across multiple servers
Supports the RStudio team financially

If one of these features is important for you, please buy RStudio Server Pro and support the RStudio team. If not, the suggestions covered in this post will allow you to use RStudio Server Open Source for any small- to medium-scale RStudio Server deployment. Enjoy!

Reproducible science with Conda and Snakemake

Mon, 04 Jun 2018 00:00:00 +0000

Doing scientific computing is hard. Delivering results with fast, performant code is often the easy part. You know your tools and how to get results. Delivering your workflow to your target audience is where it gets tough. What happens if your clients want to re-run things themselves, on their own hardware? How do they configure your pipeline for a new problem set? What happens if they’ve never even used the command line before, much less understand what a server is? This post is more or less a “lessons learned” on my approach to solving these types of workflow deployment problems. It’s by no means a perfect solution, but hopefully this will be useful to other groups struggling with the same issues.

This is a tall order - you need to provide:

Your analysis results.
The pipeline itself and all supporting code.
A foolproof method of deploying the software and the execution environment your pipeline requires.
The training and documentation required to run things start to finish. This is harder than it sounds - you can’t force your target audience to care enough to learn UNIX or the basics of programming (they’ve got other stuff to do, remember!).

You might say the last 3 are unnecessary (they’ve got the results, right?), but this is the most important part! Once your clients can run the pipeline themselves, your job is done and you can move onto your next project!

For readers looking for the quick summary:

Jupyter and R notebooks work really well for displaying results (nothing new here…).
Snakemake works well for managing and scaling pipeline execution.
When deploying pipeline software, Git + Conda environments work well initially, but do not age well. There’s not really any good solutions in this space right now unfortunately (Docker containers won’t be able to pass muster for security-conscious organizations).
Ideally, documentation gets done in the Git repository README.md/wiki, but hands-on training and follow ups are still a must. Your workflow needs to be as simple as possible to reproduce and run.

Delivering results

This is probably the easiest part (chances are you’ve done this before!). You need to deliver the actual result data files along with supporting plots and explanations. Personally, I find the best solution to this approach is a report that interleaves summary statistics/plots with explanations as they are generated by the pipeline. The easiest way to do this is using Jupyter notebooks or an R Markdown report. (I won’t provide a full walkthrough on how to use tools here, check out their respective documentation pages.)

Either tool is great (I personally prefer R markdown notebooks), but it’s important that you use these as a tool to document your workflow (where possible) and how each plot was generated. No one wants a folder full of plots with no explanation (aside from labeled axes). For each step, write what you are doing and why you are doing it along with what each statistic means. As for your data, provide a description of what each output file consists of and gzip it all up.

All of that said, Jupyter/R notebooks aren’t all that great for heavy-duty data-crunching. So what do you make into a notebook and what can you leave as plain old scripts? Again, notebooks are there to explain your results: QC scripts, summary statistics, analysis conclusions, etc.. anything that will be read by someone else should go into a notebook if possible. Everything else can stay a script.

Creating a reproducible analysis pipeline

Your analysis needs to run itself, automatically, without human input. It’s not reproducible unless it can be run completely independently of your involvement. Your client should also be able to swap out the dataset for a new one, and your pipeline should update itself and handle the change in data appropriately. Ideally, this should all execute in parallel and take advantage of all available hardware.

There’s a lot of different tools for this, but the one I’ve (relatively happily) settled on is Snakemake. Snakemake works exactly like GNU Make, where rules define how output files are created from input files. The main difference between the two is that Snakemake workflows are all in Python and supports a lot of stuff that GNU Make doesn’t (running on different OS’es, submitting jobs to a cluster, etc.).

An example Snakemake rule to produce a FastQC report from an input FASTQ file might look like this. Notice how there’s only 3 ingredients to a rule: an input, output, and a shell command.

rule fastqc:
 input: '{sample}.fastq'
 output: '{sample}_fastqc.html'
 shell: 'fastqc {input}'

There are a few big advantages of Snakemake vs. other tools I tried:

Snakemake was probably the easiest pipelining software to get the hang of. You can more or less learn it in an afternoon.
It’s pure Python - anything Python can do, Snakemake can do as well.
Scaling up a pipeline is effortless. A serial workflow is identical to a parallel one - no changes needed. To submit a job to a cluster, all you have to do is provide a cluster submission command and it will do its thing and submit jobs for you.
Workflows are really fast to write.
It’s really easy to produce a workflow diagram that shows exactly how a pipeline gets executed. This is really great for explaining to a professor or doctor how an analysis works.

And some disadvantages of Snakemake I’ve run into:

It’s not daemonized. There’s no real easy way to have a long-running Snakemake workflow running in the background besides just nohup-ing it, which can be a little inconvienient.
No dynamic job execution (if some file fails a quality check, do this after, etc.).
Personal experience has shown that it can’t be installed on Windows without a C++ compiler, which makes it a little harder to install on Windows users’ computers. Still, this is better than no Windows support at all though (like most other tools.)

All in all, after using Snakemake for several years, I think it’s a great tool for bioinformatics and data science use cases where analysis is done in a standard start-to-finish manner. Anything involving continuous job execution is probably not a good fit, such as rerunning an analysis with new data every hour or something like that. I have no serious regrets after using Snakemake and it’s a pretty great tool if you want to deliver outputs reproducibly and have other people understand the workflow (even non-technical types).

Deploying your pipeline with Conda

This is where things always get icky. You’ve got a great software environment and it runs the pipeline happily, but you want to get your client up and running too. After all, it isn’t “reproducible science” if they can’t re-run things and verify your results. Usually the hardest part of this is just installing all of the software on your clients’ computers.

Are you really responsible for installing software on clients’ computers? Honestly, yes. Even if you provide them with access to a system with all of the software installed, at some point they will pick up a collaborator who needs to install the software, or maybe migrate systems. You’re going to get an email asking how to install the software at the end of the day. So what’s the best way of ensuring that this happens?

There are three ways of getting a set of software packages installed and running on a new system. I’ll go through these each in order:

Install every bit of your pipeline and all dependencies manually (oh god no.).
Use a containerization tool like Docker.
Use a reproducible software environment, like Anaconda.

Installing things by hand

Don’t do it. If it takes you two or three hours, it will take your non-technically inclined colleagues two or three weeks (and you’ll get a lot of “please help me” emails).

Using Docker/Singularity containers

Docker containers seem like an ideal way of implementing a new workflow. You can install all of your dependencies in a Docker container, and then have your clients run the analysis using that container. I think Docker containers are awesome, and use them for integration testing or any automated tests requiring testing against a web service or other special compononents.

Though Docker continers aren’t that fun to build, they make it really easy to repoduce a defined environment, which makes them perfect for workflow deployment. So what’s the catch?

To make a long story short, letting untrusted/semi-untrusted users run Docker is a massive security hole. Any Docker container can root its host machine, and by that same token any user able to launch Docker has the equivalent of root access. If your pipeline needs additional resources like those on an HPC cluster or other shared system, chances are that your workflow will not be allowed to run. To use Docker containers in production, you need root access to the system you are running on. This is a major security consideration, and is unlikely to pass muster for most research groups unless they own the infrastructure they run on.

Singularity is a nice alternative to Docker and solves most of it’s security issues. In fact, it has a “rootless” run mode that lets it run entirely as a user. The only two “gotchas” here are that Singularity still requires root priviliges to install, and there are still some security issues being ironed out.

So to sum things up, Docker is great if you (and your clients) own the infrastructure and have been entrusted with sudo priviliges. If not, Singularity is the way to go (though security issues still seem to crop up with it fairly frequently).

Conda environments

There is of course, a third option: instead of requiring lots of special security priviliges or install things manually, why not just use Conda, Anaconda’s package manager. For those unfamiliar with it, Anaconda is a all-included Python distribution. Though it used to ship just Python packages, Anaconda now ships more or less every piece of scientific software. Of particular interest to bioinformaticians is it’s Bioconda channel, which ships more or less all bioinformatics software packages.

Conda works more or less like a Python virtualenv, though instead of using pip install, you use conda install to install everything. To make a very long story short, I haven’t really found anything that isn’t conda installable yet. Once all is said and done, you can export your conda environment to YML with conda env export > environment-name.yml. To reproduce the environment, another user would run conda env create -f environment-name.yml and then source activate environment-name to load it. All in all, this reduces your entire software pipeline to a single YML file. Just add this to a Git repository, stuff it on GitHub/Bitbucket/Gitlab and you’re done. To reproduce the pipeline execution environment, it’s just three lines:

git clone https://github.com/username/project-name.git
conda env create -f project-name.yml
source activate project-name

So what’s the catch? This seems a little too easy. I’ll say that this method of pipeline deployment worked really well intially. It did not age gracefully, however. After about a year of usage, some of my users began to report issues where certain dependency versions could not be found. As it turns out, Conda envs pin every version for every package and dependency. Anaconda apparently stops shipping packages after a certain period of time, which means that new environment installs will be broken after a certain period of time. After using conda environments as my go-to solution for a lot of projects, the average time to first breakage (where you need to supply a new conda-env.yml file to users) is about a year.

I haven’t found a good way around this issue, aside from providing the general conda installation instructions on how to re-create the environment ("conda install this list of packages…"). This was really disappointing, because conda environments seemed like a rather promising method of long-term software installs. Just add the environment.yml file to git and call it a day, right? Unfortunately this only works for the first year or so, after which all bets are off.

All in all, my work so far leads me to believe that Conda environments are the go-to solution for short-term work. Despite the issues with Conda environment longevity, it’s so easy to use and install software that I think using them for your workflows is worth it. For long-term projects (years or more) you should invest in some form of containerization solution, along with all the security implications that go with it. The next time I do a serious data science/bioinformatics project, I’m probably going to do a long term sit down with Conda and see if I can find a solution to the environment age problem, because I’d really like to use that for all my work, all the time.

Documenting your workflow

This has been a long blogpost, so I’ll keep this short. In order for users to be able to re-run you workflows, they need the instructions in order to be able to do so. In terms of raw documentation, this pipeline (Snakemake + Conda) generally boils down to only a few lines from installation to execution:

# install Miniconda from https://conda.io/miniconda.html
git clone https://github.com/your-pipeline.git
conda env create -f your-pipeline.yml
source activate your-pipeline.yml
# example execution for 24 cpus, actual snakemake execution command will likely differ
snakemake -j 24

This is really easy to shove in a README.md on Github/Bitbucket/wherever. That said, I’ve found that most users will want an in-person training session where you walk them through the pipeline step-by-step (“drop your files here”, let’s run through the following commands, etc.). There’s not really any way around this - you wouldn’t be performing the data analysis for them if they could do it themselves. snakemake --dag | dot -Tsvg > dag.svg is an incredibly useful command to produce a workflow diagram to show your end user/data consumer how results are generated. If you are the only user and all that matters is your end results, generally just the above installation instructions and a list of dependencies is sufficient documentation for the future.

I don’t have any magic tricks here, but the above workflow generally simplifies and automates workflow deployment and execution enough to make it doable for the average end-user to run. All in all, the weakest point of this workflow is that Anaconda environments don’t age well - if that ever gets fixed, I’d have few regrets. Hopefully this was an informative read for those of you considering similar workflows.

Setting up LDAP auth for MariaDB

Thu, 17 May 2018 00:00:00 +0000

Having separate credentials to log onto a server and access a database on that network is a pain. Why not provide users just one set of credentials for both services? This is a quick how-to guide on how to setup LDAP authentication for MariaDB. As it turns out it’s insanely easy to setup. (The official MariaDB documentation on the subject can be quite hard to find however - which may or may not be the primary reason for this blog post…).

This tutorial assumes that the database server has already been configured to authenticate users via LDAP (blog post on this later!). If you haven’t already, install MariaDB and set it up:

sudo yum install mariadb mariadb-server mariadb-devel
sudo systemctl start mariadb
sudo systemctl enable mariadb
sudo mysql_secure_installation # yes to all prompts

The next step is to login as the root user and enable the auth_pam plugin. auth_pam delegates MariaDB user authentication to the base operating system through PAM. PAM, or Pluggable Authentication Modules, allow configuring authentication for different software packages via text file. More on this in a later in this blog post.

MariaDB ships with this plugin present, but not enabled. You can install it with INSTALL SONAME 'auth_pam';. To use PAM authentication for a user, create that user with IDENTIFIED VIA pam in place of where you’d usually specify the user password.

mysql -u root -p
INSTALL SONAME 'auth_pam';
CREATE USER 'jstaf'@'%' IDENTIFIED VIA pam;

If I wanted to create a test database for that user account (I’ve named the database after the demo user in this case…):

CREATE DATABASE jstaf;
GRANT ALL ON jstaf.* TO 'jstaf'@'%';

Ok so now that we’ve setup our demo user and our test database, we’ll need to actually setup the PAM config for MariaDB. MariaDB does its best to remain compatible with the original MySQL codebase it was forked from, and this case it is no different - the PAM config for MariaDB is /etc/pam.d/mysql by default.

Create /etc/pam.d/mysql with the following contents:

#%PAM-1.0
auth required pam_ldap.so
account required pam_ldap.so

As PAM configs go, this is the absolute minimum. After the first line of the file (which merely identifies it as a PAM config to the OS), the following two lines state that:

Authentication requires successful authentication using pam_ldap.so, the PAM module responsible for handling LDAP authorization. A user will need to successfully authenticate via password to pass the first line.
The account is indeed valid and meets any non-password authorizations (also handled through pam_ldap.so).

Clever users will notice that this same lines could be swapped out for other authentication modules. As an example, pam_unix.so covers standard authentication using local user accounts - there’s a PAM module for pretty much every authentication mechanism out there.

Now all that’s left to do is login with your brand new LDAP-enabled user account:

mysql -u jstaf -p

Enter password:
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 14
Server version: 5.5.56-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]>

Success!

Remote backups with Borg and rsync

Mon, 12 Mar 2018 00:00:00 +0000

There’s a famous saying that “data that’s not backed up is data you’re prepared to lose.” I used Windows for a very long time, and managed to lose quite a bit of data back in the day because of either Windows Update bricking the system or just wanting to reinstall the OS (Windows has a habit of losing performance over time - easiest fix is a fresh install). I had been performing backups manually to an external hard disk, frequently forgot to backup something critical, and only had backups when I cared to make them (i.e. rarely). Fortunately, there’s a better way of doing things: automated backups to a remote server with setup-and-forget tools like rsync and borg. I haven’t lost data since.

Before we start

To use either of these tools, all you need is a UNIX system (Mac/Linux) and a server or storage device to back up to. There are no other requirements. If you don’t have a system of your own, I highly recommend Rsync.net. Rsync.net is a very cheap/reliable backup provider that simply gives you an SSH endpoint to dump your files in. Though plans vary, the price is about $1 per GB stored per year, which is quite affordable, and the service comes with free snapshots and support. If you choose this option, be aware that there’s also a secret pricing tier for borg users which gives heavily discounted plans that do not include support or snapshots (since borg does that for you).

If you’ll be backing up to a remote system, you’ll want to setup passwordless SSH before you start (all of the defaults are fine here, do not enter a passphrase for your key). This is more secure than using a password to connect, and it means that backups over SSH can be performed non-interactively.

ssh-keygen -t rsa
# hit enter for all the prompts here, you typically do not want to set a passphrase
ssh-copy-id username@server.web.address

Before you do anything else, make sure that connecting over SSH to your backup server no longer prompts you for a password.

Alternatively, if you are backing up to a network drive, make sure you know how to mount and unmount the drive via the command line (typically via mount and umount). Your backups should never be mounted to your computer unless you are backing things up! (This reduces the risk of attackers or accidents breaking your precious backups.) Once this is done, you should be set.

Simple backups with rsync

rsync is a very handy file-copying tool that performs easy, straightforwards backups. Backups are unencrypted and unauthenticated, but it’s trivial to setup and restore from. If all you want is an up-to-date backup when things go bad, rsync is the tool for you.

To create a backup

This performs a very simple backup to any storage device. Files are copied as-is, and all attributes (ownership, permissions, modification times, etc.) are preserved. No authentication or encryption is performed, meaning that anyone could get at your files if they can access your storage media. Files that you delete are deleted from your backup, and only files modified since the last backup are uploaded.

To back up to a local disk or network drive:

rsync -az --delete /folder/to/back/up /destination/folder

To back up to a remote server:

Note that this command connects to the remote server over SSH, meaning that information is encrypted while being transferred to the remote server. The actual backups themselves, however, are unencrypted.

rsync -az --delete -e ssh /folder/to/back/up username@remote.host.address:/destination/folder

To restore from a backup

To restore files from your remote backup, no special magic is required - just reverse the source and destination folders in the above command. Alternatively, you can use tools like scp or sftp to restore individual files.

rsync -az -e ssh username@remote.host.address:/destination/folder /folder/to/restore/in

Pros and cons of rsync

rsync is useful for its simplicity. You get easy-to-perform backups for virtually no learning curve, and restoring files is a breeze (just copy them back!).

There are some big drawbacks to this method, however. rsync is very space inefficient - no compression or deduplication is performed. You only get access to a single backup as well. If you want multiple backups for multiple dates, you’ll need to manage these manually, and each extra backup will take up an equal amount of space (7 days’ of backups == 7 times the storage usage). Anyone with access to your storage media will also have access to your files.

In light of this info, rsync is a great tool for fast and dirty backups to local storage media, or when you are confident that your backup location is secure and cannot be accessed by anyone else. If you want multiple backups and access controls, you’ll need a different tool.

Secure backups with borg

Borg is a fantastic tool that covers the weaknesses of rsync without sacrificing much in terms of usability. In particular, you’ll be able to keep multiple backups, save space through deduplication and compression, and secure your data with either passwords or a keyfile.

Setup

Borg requires a little bit of additional setup before you can start using it. Having borg installed on the remote server will speed things up. This is already done for you if you use Rsync.net, although you should specify the environment variable BORG_REMOTE_PATH to use the most recent version of borg available:

# only for Rsync.net users
export BORG_REMOTE_PATH=/usr/local/bin/borg1/borg1

Additionally, you will need to initialize your repository before you can use it. To create a new repository that’s password-protected, use the following:

# see "borg init --help" for more options like storage quotas, encryption options, etc.
borg init -e repokey-blake2 username@remote.host.address:/destination/folder

Creating a backup

For your first backup, you may wish to do it interactively so you can watch the progress and verify that things work. The following creates a backup titled backup-name:

borg create --progress --stats username@remote.host.address:/destination/folder::backup-name /folder/to/back/up

For subsequent backups, you will likely want to do things non-interactively. You can use the following to create an automatically-named backup (computer name + date). Note that further commands assume you’ve set the BORG_REPO environment variable (specifying a default repository to back up to).

# specify a password for non-interactive use
export BORG_PASSPHRASE='your repository password'
# specify the default repository to use for backups
export BORG_REPO='username@remote.host.address:/destination/folder'
borg create ::$(hostname)-$(date -I) /folder/to/back/up

Cleaning up old backups

Chances are, you will not want to keep every backup ever made. You might want to keep say only 7 days’ worth of daily backups, 8 weeks of weekly backups, and 12 months of monthly backups. To do so:

borg prune --keep-daily 7 --keep-weekly 8 --keep-monthly 12

Inspecting backups

To view a list of all backups:

borg list
# follows is a list of backups, dates, and ids

To view all files within a backup (EXTREMELY VERBOSE, so output has been piped to head):

borg list ::backup-name | head

Restoring from a backup

Restoring files is quite easy, although they are extracted to the current working directory (so if you backup up /home/youruser/some-folder, expect it to recreate that directory structure unless you cd to the root directory). To extract a single file:

cd /location/to/restore/to
borg extract ::archive file/to/restore

To extract all files:

cd /location/to/restore/to
borg extract ::archive

Automating your backups (and sample scripts!)

To run your backups automatically, you’ll want to create a script, and run it automatically through cron. Though crontab is normally a great way to do this (run a task at a specified time and day), it is not very flexible - if you set it to perform backups at 3am, and you’re not logged onto your laptop at 3am, the backup wont happen! Instead, we’ll create a script and put it in /etc/cron.daily. Scripts here are automatically run about 15 minutes after logging on to your computer. Here are some sample scripts that you can use for either rsync or borg. Installation is the same - just copy these to /etc/cron.daily. You’ll note I’ve fully specified the path to programs here - this is a “best practice” when working with scripts to be run under cron.

rsync example script

#!/bin/bash
# Backup a folder to a remote address using rsync.
# Usage: backup-rsync.sh
# To restore: rsync -az -e ssh username@remote.host.address:backups/$(hostname)/folder /restore/point

set -eu
/usr/bin/ssh username@remote.host.address mkdir -p backups/$(hostname)
/usr/bin/rsync -az --delete -e ssh /folder/to/back/up username@remote.host.address:backups/$(hostname)/

Borg example script

#!/bin/bash
# Backup a folder to a remote address using borg.
# Usage: backup-borg.sh
# To restore: borg extract $BORG_REPO::computer-and-date

set -eu
export BORG_REPO='username@remote.host.address:borg/repo/path'
export BORG_PASSPHRASE='your password'
export BORG_REMOTE_PATH=/path/to/remote/borg

/usr/bin/borg create ::$(hostname)-$(date) /folder/to/back/up
/usr/bin/borg prune ::$(hostname)-$(date) --keep-daily=14 --keep-monthly=6

You’re set!

Assuming you’ve added one of these scripts to your /etc/cron.daily/ folder, all you have to do is wait. If you’ve added BORG_REPO to your .bashrc, you can check in and verify that your backups are working properly with borg list (you should see a list of your current backups).

Installing FL Studio on Linux

Thu, 22 Feb 2018 00:00:00 +0000

Linux does a lot of things well. Music production is not usually one of them, mainly due to a lack of good programs available on Linux. Fortunately FL Studio, one of the most popular DAWs out there works flawlessly through Wine.

Wine is a Windows compatibility layer for Linux. You can often run Windows programs with it, though personally, my success has been mixed (especially for performance critical applications like videogames). In this case though, we can use it to run FL Studio, and it works perfectly.

Here’s a quick preview of our end product (sorry for terrible video quality, but it shows we have a nice, working install):

Setting up Wine

The following instructions are written using Fedora, but should work on any variety of Linux (adapt the next command to your package manager of choice).

To start, we’ll need wine and winetricks:

sudo dnf install wine winetricks

Once this completes, you’ll need to install some fonts needed by FL Studio. Run winetricks in the console, and select “select the default wineprefix”. Install the “core” Microsoft fonts.

Install FL Studio

Download the FL Studio installer from the official website. Once it’s downloaded, run the following command and install with all of the default settings:

wine flstudio_12.5.1.165.exe

Protip: in terms of ASIO drivers, use FL Studio ASIO instead of ASIO4ALL (it’s just better).

While things are installing, download your registration key from the FL Studio website (FLRegkey.Reg). You can register it with the following winetricks command:

regedit FLRegkey.Reg

Congrats, you now have a fully working version of FL Studio on Linux. And before you ask, yes - all of your VST plugins will work out of the box.

About me

Mon, 01 Jan 0001 00:00:00 +0000

Hi there! I am a sysadmin currently living in Toronto. I started my career as a neuroscientist, dabbled for a bit as a bioinformatician, but now I spend most of my time fiddling with databases and Kubernetes on various clouds. I collect job titles and hobbies, and my favorite pastime is coding while simultaneously binging B-grade scifi/fanstasy tv shows (and making no progress on my personal projects). I really like bunnies.

The best way to reach me is via email at jeff (dot) stafford (at) protonmail (dot) com (I am bad at replying sometimes but try to always eventually get back to everyone!).

Github: https://github.com/jstaf
Email: jeff (dot) stafford (at) protonmail (dot) com.
Phone: (Please email me for my number if you’d like to reach me by phone - I unfortunately had to remove my number from this website after too many unsolicited phone calls.)

Showcase

Mon, 01 Jan 0001 00:00:00 +0000

These are a bunch of public personal projects in various mediums that I’ve dabbled with over the years.

onedriver

I was really irritated that Microsoft OneDrive didn’t support Linux, and all of the existing OneDrive clients were kind of bad at the time (you want to download my entire OneDrive account to my local computer? Yuck.). This was my first golang project and has kind of taken on a life of its own with several tens of thousands of users. I attribute a lot of this success to it being easy to use and install compared to the existing alternatives.

Github repository: https://github.com/jstaf/onedriver
Fedora COPR .rpm repos: https://copr.fedorainfracloud.org/coprs/jstaf/onedriver/
OpenSUSE Build Service .deb repos: https://software.opensuse.org/download.html?project=home%3Ajstaf&package=onedriver
AUR: https://aur.archlinux.org/packages/onedriver

mayorate

This was probably my first real coding project. I was really into this videogame called Starfarer many years ago (now Starsector), and decided to make a mod for it. Though I don’t really actively play Starsector anymore, I occasionally like to update the mod to work with the latest version of the game and boot things up again for old times’ sake.

This project was very important to me because it was what made me realize that computing was easy and more importantly, I enjoyed it. It was basically the start of what would later be a career. (This was in contrast to science, which I actually hated working on and really just liked playing with the computers.)

Github repository: https://github.com/jstaf/mayorate
Forum thread: https://fractalsoftworks.com/forum/index.php?topic=7372.0
Starsector website (you need to buy a copy if you want to try this out): https://fractalsoftworks.com/

An early rigging of the largest capital ship in the mod, along with several of its fighters.

The hellish surface of the Mayorate mining planet of Inir. Probably my favorite digital painting I did for this project.

ezldap

I used to work at a supercomputing facility that used OpenLDAP as an identity provider for its compute clusters. If you’ve ever used OpenLDAP, you’ll know it’s a royal pain to work with and you basically have to come up with your own tooling and directory structure, which is a TON of work compared to Active Directory or FreeIPA. ezldap was a set of Python scripts and clever templating used to make managing users, groups, etc. much easier than it otherwise was using stuff like ldapmodify and ldapsearch.

I stopped working on this project after rolling out FreeIPA and/or IAM-based access at subsequent places of employment. FreeIPA does everything ezldap did for OpenLDAP, but better.

Official documentation: https://ezldap.readthedocs.io/en/latest/
Github repository: https://github.com/jstaf/ezldap

Teaching materials

While working at the supercomputing facility, like half of my job was not just maintaining our compute clusters and running analyses, but teaching scientists and doctors how to use our systems. I wrote a lot of teaching materials and did a ton of workshops on various computing topics including teaching as part of some for-credit graduate courses at Queen’s University. Though I don’t really teach these types of workshops anymore, a lot of my teaching materials have been adopted by the community and now see widespread use across the world (the “Intro to HPC” and Snakemake courses I wrote for Software Carpentry are extremely popular).

“Write a genomics pipeline for a scientist and you can frustrate them for a day.”

“Teach a scientist how to program and you can frustrate them for a lifetime.”

Intro to High-Performance Computing*: https://carpentries-incubator.github.io/hpc-intro/
Intro to High-Performance Computing in Python (Snakemake): http://www.hpc-carpentry.org/hpc-python/
Data Science with R: https://jstaf.github.io/r-data-science/
HPC R: https://jstaf.github.io/hpc-r/
R Package Development: https://github.com/jstaf/r-package-devel
BioMaRt: https://github.com/jstaf/biomaRt_tutorial

Music

Before I discovered programming, my hobby was writing music. I was never particularly good at it. I would spend all my time playing with synthesizers (all time favorite is Audjoo Helix) and mixing the tracks so they sounded perfect, but writing a melody and the actual music composition part was always a massive struggle and it was agony writing songs from start to finish and keeping the style consistent the whole way through.

I was a giant purist who thought that using any kind of drum loops, sampling from existing songs, or even using pre-existing synthesizer presets was cheating so most of my time was spent tweaking knobs to generate my instruments before I ever even got started. Obviously, I got very little done. Most of the time I would write something, come back the next morning, decided I hated it, and discard what I had before starting over. Even though I wrote a lot of different stuff, generally the only songs that I could reliably complete were soundtracks: I had a direct use for them (I was playing around with creating videogames at the time) and they also required genuine recorded instruments like strings so I would spend less time tweaking knobs and more time on music. (There’s probably a lesson to be learned here somewhere…)

I’ve mostly stopped writing music now (gave myself tinnitus and decided I had to stop if I didn’t want it to get worse), but these are a selection of my fully completed tracks I like the most.

This track never really got a title beyond the original filename (nebula.flp) when I created it. It was supposed to be ambient background music for while you’re flying around space in one of my videogame projects, but it ended up as probably my best track and somewhat dominates the mood whenever this song would play in-game. (This sounds like a failure for what’s supposed to be “background music”, but it actually works really well.) I was actively trying to avoid using drums here as I didn’t think I was capable of writing a song without a heavy drums all over the place at the time. Fortunately, I was wrong. This is probably the only track I am 100% pleased with from start to finish.

One of very few tracks I finished that qualifies as anything close to normal music. I wasted a ton of time with synthesizers here, every non-piano/drum instrument in this one was synthesized by hand by me again. The middle bit beginning at 2:00 is probably the closest I’ve gotten to releasing a professional-grade trance track.

This is a weird one. It doesn’t lend itself well to use in soundtracks (the whole song is a giant crescendo) or as normal music and there is a lot of odd stuff going on. The weird pulsing sound at 1:37 is actually a kick drum resampled in a horrible way and an ungodly amount of FX plugins mashing up what remains. I synthesized virtually all sounds in this track from scratch aside from the opening pad and guitars (even those got pretty warped though).

Steel Rain was probably my most popular song because it works perfectly for what it was meant to be (videogame battle soundtrack) and also evokes the same feeling as the classic game Homeworld (notable for having lots of very atmospheric and space-y sound to it). This was the easiest of the tracks to write. I found some exceptional taiko samples on a random corner of the internet, added in an African drum set, and then just went to town with them and a couple of very menacing soundfonts and pads. Basically wrote itself once I had the right sounds to start with.

actmon

This was a R package I wrote during grad school to compute statistics on Drosophila melanogaster behavior as measured by Trikinetics’ Drosophila *Activity Monitor. The one and only R package I’ve released to date.

Github repository: https://github.com/jstaf/actmon

gcamp4d

This was a MATLAB GUI application I wrote to better measure changes in neuronal activity over time when using GCaMP (a phosphorescent protein used to measure neuron activity) and a very specific imaging setup on a confocal laser microscope. I can’t imagine anyone outside of my old lab using this, but it made some pretty sweet 3D images of neurons over time.

Github repository: https://github.com/jstaf/GCaMP_4D

Two neurons imaged in-vivo activating in response to a stimuli.