<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Jeff Stafford</title><link>https://jstaf.github.io/</link><description>Recent content on Jeff Stafford</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>Jeff Stafford</copyright><lastBuildDate>Sun, 02 Jul 2023 00:00:00 +0000</lastBuildDate><atom:link href="https://jstaf.github.io/index.xml" rel="self" type="application/rss+xml"/><item><title>Red Hat licensing changes and the long, slow death of a community</title><link>https://jstaf.github.io/posts/rhel-death-of-a-community/</link><pubDate>Sun, 02 Jul 2023 00:00:00 +0000</pubDate><guid>https://jstaf.github.io/posts/rhel-death-of-a-community/</guid><description>If you weren&amp;rsquo;t already aware, the Linux community is currently up in a kerfluffle about Red Hat&amp;rsquo;s latest licensing changes. To summarize:
Red Hat makes an operating system Red Hat Enterprise Linux (RHEL), and until a week or two ago, also published the source code publicly in the spirit of free and open source software. Red Hat makes money by selling RHEL subscriptions and support. Throughout the years, other organizations have republished RHEL for free (CentOS, Oracle Linux, Alma Linux, Rocky Linux, etc.</description><content>&lt;p>If you weren&amp;rsquo;t already aware, the Linux community is currently up in a
kerfluffle about Red Hat&amp;rsquo;s latest licensing changes. To summarize:&lt;/p>
&lt;ul>
&lt;li>Red Hat makes an operating system Red Hat Enterprise Linux (RHEL), and until a
week or two ago, also published the source code publicly in the spirit of free
and open source software.&lt;/li>
&lt;li>Red Hat makes money by selling RHEL subscriptions and support.&lt;/li>
&lt;li>Throughout the years, other organizations have republished RHEL for free
(CentOS, Oracle Linux, Alma Linux, Rocky Linux, etc.).&lt;/li>
&lt;li>In an effort to kill the republished copies of RHEL, Red Hat will no longer
publish their source code publicly in a way that&amp;rsquo;s easy to rebuild. Now they
publish the source to CentOS Stream, a &amp;ldquo;testing OS&amp;rdquo; that is very similar, but
not quite compatible with software built for RHEL.&lt;/li>
&lt;/ul>
&lt;p>Anyhow, the community is upset, because it means RHEL as a free and open-source
product will likely no longer be available. Yes, you can still pay for RHEL.
Yes, you can download 16 free subscriptions that you&amp;rsquo;re not allowed to use for
anything that could possibly earn you money. Yes, you can still get a different
free product with CentOS Stream. Yes, Red Hat&amp;rsquo;s changes
&lt;a href="https://sfconservancy.org/blog/2023/jun/23/rhel-gpl-analysis/">&lt;em>might&lt;/em>&lt;/a> even be
legal, even if they no longer follow the spirit of the GPL license that makes
open-source software possible.&lt;/p>
&lt;h2 id="but-this-sucks">But this sucks.&lt;/h2>
&lt;p>If you are a sysadmin, developer, or just anyone who works with computers
professionally, you&amp;rsquo;re going to learn a lot of &amp;ldquo;stuff&amp;rdquo; over the course of your
career. I&amp;rsquo;ll speak personally from my experience as a sysadmin here (call it
&amp;ldquo;devops&amp;rdquo;, &amp;ldquo;sre&amp;rdquo;, &amp;ldquo;platform engineer&amp;rdquo;, or whatever the job title of the month is,
it&amp;rsquo;s the same job). This &amp;ldquo;stuff&amp;rdquo; you need to learn includes operating systems
(like RHEL), programming languages, infrastructure-as-code tools, and other more
esoteric stuff like Kubernetes, cloud providers, databases, and more.&lt;/p>
&lt;p>This is a lot of stuff to learn, but over the course of your career, you&amp;rsquo;ll
build a &amp;ldquo;stack&amp;rdquo; of tools you&amp;rsquo;re intimately familiar with and can take with you
between jobs. It&amp;rsquo;s a very empowering feeling, and all of these skills are
basically your career and job mobility. You can take these skills anywhere to do
to anything a computer can possibly do - for free!&lt;/p>
&lt;p>As a sysadmin, pretty much the first step of building this stack is picking an
operating system and becoming intimately familiar with it. Right now there are
sort of 3 major Linux OS ecosystems that people choose from for work:&lt;/p>
&lt;ul>
&lt;li>Red Hat land: RHEL, Fedora, and the RHEL clones (Alma Linux, Rocky Linux,
Oracle Linux, CentOS Stream).&lt;/li>
&lt;li>Debian and friends: Debian, Ubuntu, and Pop!_OS&lt;/li>
&lt;li>SUSE: OpenSUSE Tumbleweed, OpenSUSE Leap, and SLES&lt;/li>
&lt;/ul>
&lt;p>&lt;img alt="Linux world map" src="https://jstaf.github.io/images/linux-world-map-large.png#centre">&lt;/p>
&lt;p>There are a couple common elements here:&lt;/p>
&lt;ul>
&lt;li>Each landmass (er, ecosystem?) has a fast moving, high-quality, desktop OS you
can use on your laptop that&amp;rsquo;s similar to the ones you&amp;rsquo;ll use on servers. This
isn&amp;rsquo;t an issue here so I&amp;rsquo;ll gloss over this.&lt;/li>
&lt;li>Each OS &amp;ldquo;ecosystem&amp;rdquo; has one OS with long-term support (2 years or more). This
is important for businesses because once you have a lot of servers, it&amp;rsquo;s just
impractical to be upgrading the OS all the time. So long-term support is
essential.&lt;/li>
&lt;li>No licensing restrictions. All of these ecosystems have the ability for you to
deploy as many copies of an OS as you want, for whatever use case you want,
for free.&lt;/li>
&lt;/ul>
&lt;p>The last one is important, and is what&amp;rsquo;s at stake here in this case for RHEL and
its related tooling. We need to be honest here - no one wants to pay for Linux
itself. If we wanted to piss away hundreds of dollars a year for each server
we&amp;rsquo;d all be using Windows.&lt;/p>
&lt;p>Red Hat&amp;rsquo;s goal here is to convert &amp;ldquo;freeloaders&amp;rdquo; like myself (and bascially all
the previous companies I&amp;rsquo;ve worked at) into paying customers by taking away the
ability to use their ecosystem for free. Red Hat has tried to do damage control
on all this and
&lt;a href="https://www.redhat.com/en/blog/red-hats-commitment-open-source-response-gitcentosorg-changes">the response by their &amp;ldquo;core systems VP&amp;rdquo; just makes things look even worse&lt;/a>:&lt;/p>
&lt;blockquote>
&lt;p>[&amp;hellip;] we have determined that there isn’t value in having a downstream
rebuilder.&lt;/p>
&lt;p>The generally accepted position that these free rebuilds are just funnels
churning out RHEL experts and turning into sales just isn’t reality. I wish we
lived in that world, but it’s not how it actually plays out. Instead, we’ve
found a group of users, many of whom belong to large or very large IT
organizations, that want the stability, lifecycle and hardware ecosystem of
RHEL without having to actually support the maintainers, engineers, writers,
and many more roles that create it. These users also have decided not to use
one of the many other Linux distributions.&lt;/p>
&lt;/blockquote>
&lt;p>Red Hat believes that the existence of Alma Linux and Rocky Linux is
cannibalizing sales of RHEL subscriptions. &amp;ldquo;Every Alma Linux and Rocky Linux
install is a lost sale! Maybe if we destroyed all of the rebuilds, all of the
people using them would buy RHEL instead? The community will be &lt;em>so eager&lt;/em> to
reward Red Hat by buying subscriptions to our products now that the alternatives
don&amp;rsquo;t exist anymore, right?&amp;rdquo;&lt;/p>
&lt;p>Wait - &amp;ldquo;buy subscriptions&amp;rdquo;?&lt;/p>
&lt;p>&lt;img alt="&amp;ldquo;We&amp;rsquo;re pirates, we don&amp;rsquo;t even know what that means!&amp;rdquo;" src="https://jstaf.github.io/images/hondo.png#centre">&lt;/p>
&lt;p>Red Hat has completely missed &lt;em>why&lt;/em> people use their software. No one cares
about the support subscriptions. No one cares about &amp;ldquo;Red Hat Enterprise Linux&amp;rdquo;
or &amp;ldquo;Red Hat&amp;rdquo; at all.&lt;/p>
&lt;h2 id="red-hats-actual-product-is-its-community">Red Hat&amp;rsquo;s actual product is its community.&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>As a software developer&lt;/strong> (either free or paid), RHEL-based OSes are an
attractive target to build for because it has a large number of users and
businesses using it (read: potential customers). You could build for Ubuntu
and RHEL and that basically covered 95% of Linux-based businesses in North
America with money.&lt;/li>
&lt;li>&lt;strong>As an IT professional&lt;/strong>, the dominance of RHEL-based OSes meant that
learning RHEL was a good investment of your time: the more businesses that
used it, the more attractive you&amp;rsquo;d be to employers and you&amp;rsquo;d be able to start
contributing to a company that used RHEL or a RHEL-clone on day one (instead
of needing to learn a new OS every time).&lt;/li>
&lt;li>&lt;strong>As a business&lt;/strong>, selecting a RHEL-based OS was attractive: there was a large
community which meant that most bugs got reported and fixed by community
members before you ran into them. There was lots of free+paid software
available with guides written by the community, so you didn&amp;rsquo;t have to install
any other OSes just to install some weird piece of software you needed. RHEL
skills were also relatively common, so you&amp;rsquo;d be able to hire knowledgeable
people and spend less time training people from scratch. RHEL and its clones
were sponsored by a company that appeared to be healthy and profitable so you
knew the OS wasn&amp;rsquo;t going to suddenly implode and you&amp;rsquo;d have to spend effort
jumping ship.&lt;/li>
&lt;li>&lt;strong>As for Red Hat and IBM itself&lt;/strong>, more users of free RHEL clones meant that
there&amp;rsquo;d be more chances to sell Red Hat and IBM&amp;rsquo;s other, much more valuable
software offerings. For instance, when I worked in supercomputing with Queen&amp;rsquo;s
University and Compute Canada, we were all rabid CentOS users that saw
absolutely zero value in RHEL, but we were more than happy to shell out
hundreds of thousands of dollars each year for GPFS (now IBM Spectrum Scale),
Tivoli Storage Manager (now IBM Spectrum Protect), and IBM+Lenovos&amp;rsquo;s servers
and hardware support. IBM made so much money off of us as CentOS users they
bought my boss and I a free vacation to Vegas one year to go to their
conference and do lines with our account manager (that last part is a joke, my
old boss and I aren&amp;rsquo;t cool enough to get invited to those kind of parties).&lt;/li>
&lt;/ul>
&lt;p>All of this community value from using RHEL is based on sheer numbers. The more
users there are, the more developers will write software for it. The more
software there is for RHEL, the more users there will be. The more stable this
community appears, the more likely businesses and professionals will invest in
it and stay long-term. The larger the community, the more chance for Red Hat
(and IBM) to sell whatever products they had. All of the other work with RHEL
that Red Hat claims is just so valuable is just a bonus. Ubuntu, SUSE, Debian,
and Amazon &lt;a href="https://aws.amazon.com/linux/amazon-linux-2023/">(yes, Amazon)&lt;/a> tick
all the same checkboxes that RHEL does- the biggest factor keeping customers in
the RHEL ecosystem is the community that&amp;rsquo;s sprung up around it.&lt;/p>
&lt;p>The &amp;ldquo;community as the product&amp;rdquo; is particularly evident with another Red Hat
product: Ansible. Ansible is an automation tool that&amp;rsquo;s commonly used to
automatically configure servers and perform common operations tasks. Though the
Ansible software itself is nifty and you
&lt;a href="https://docs.ansible.com/ansible/latest/collections/index_module.html">can do a mind-blowing amount of shit with it&lt;/a>,
the real value of using Ansible is actually its community, specifically
community-generated &amp;ldquo;Ansible roles&amp;rdquo;. For the unintiated, Ansible roles are neat
little self-contained bundles of Ansible code that setup a server to do advanced
things without you actually needing to know the specifics of how to do these
things yourself.&lt;/p>
&lt;ul>
&lt;li>Want to configure a Postgres server, but know nothing about Postgres?
&lt;a href="https://github.com/geerlingguy/ansible-role-postgresql">Blam - done.&lt;/a>&lt;/li>
&lt;li>Need to back up a server, but not sure where to start?
&lt;a href="https://github.com/roles-ansible/ansible_role_restic">This stranger has your back!&lt;/a>&lt;/li>
&lt;li>Need to pass a compliance audit or get official government certification for
something?
&lt;a href="https://github.com/ansible-lockdown/">There&amp;rsquo;s an entire company that does nothing but write server hardening Ansible roles to help you pass these type of audits.&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>By using Ansible, you get to use all this work these people are giving back to
Red Hat and the community for free. If this community disappeared, the entire
point of using Ansible (and Ansible&amp;rsquo;s commercial value as a product) would
disappear overnight. And that&amp;rsquo;s what&amp;rsquo;s happening &lt;em>right now&lt;/em>.&lt;/p>
&lt;p>For better or worse, many FOSS software ecosystems have a cult of personality
that revolves around a single, ultra-productive community member. For instance
Linus Torvalds basically was Linux incarnate for several decades. The R
programming language revolves around &lt;a href="https://hadley.nz/">Hadley Wickham&lt;/a>. He
wrote so many amazing R packages it created an entire data science ecosystem
called the &amp;ldquo;Hadleyverse&amp;rdquo; (he asked to change the name to &amp;ldquo;tidyverse&amp;rdquo; because he
wanted to be modest). Ansible has one of these people too: Jeff Geerling - also
known as &amp;ldquo;geerlingguy&amp;rdquo;.&lt;/p>
&lt;p>&lt;a href="https://galaxy.ansible.com/geerlingguy">geerlingguy &lt;em>is&lt;/em> Ansible.&lt;/a> I would
conservatively estimate that &amp;gt;50% of the good (as in, you&amp;rsquo;d actually want to use
these instead of writing your own) Ansible roles on Ansible Galaxy are written
by him directly. He literally
&lt;a href="https://www.ansiblefordevops.com/">wrote the book&lt;/a> on Ansible. I&amp;rsquo;ve even used
&amp;ldquo;Do you know who geerlingguy is?&amp;rdquo; as an interview question - if someone doesn&amp;rsquo;t
know who he is, it&amp;rsquo;s obvious they&amp;rsquo;ve never spent any serious time with Ansible
(this question also absolutely fucks with people trying to use ChatGPT and read
off the screen to fake it during job interviews. &lt;em>Yes&amp;hellip; we&amp;rsquo;re on to you.&lt;/em>).&lt;/p>
&lt;p>Not only have Red Hat&amp;rsquo;s latest moves
&lt;a href="https://www.jeffgeerling.com/blog/2023/im-done-red-hat-enterprise-linux">alienated their largest contributor&lt;/a>,
he&amp;rsquo;s gone scorched-earth and
&lt;a href="https://github.com/geerlingguy/ansible-role-docker/commit/635061e0a44e94e7c855f45f96364f98af645fc9">begun actively removing support for RHEL from all of his Ansible roles&lt;/a>.
Jeff Geerling is now advertising on Twitter about just how easily you can use
Red Hat&amp;rsquo;s own Ansible product to migrate off of RHEL using the cross-platform
tooling he&amp;rsquo;s written (I&amp;rsquo;d link to the tweet, but Twitter is offline). RHEL&amp;rsquo;s
vendor lock-in isn&amp;rsquo;t an issue when you have automated tools like Ansible to
reproduce your RHEL servers on another distribution like Debian in a matter of
minutes.&lt;/p>
&lt;p>This is a disaster for Red Hat. And its not an isolated incident: EPEL
maintainers are
&lt;a href="https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/Q6LJZKB24D3IQZ7AMKO35NW6VIWENEK2/">leaving&lt;/a>
&lt;a href="https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/F35UTCBYF25XRE2HX32UEIRVZGMAXIBO/">in&lt;/a>
&lt;a href="https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/XSFQ5RFNRINM6RGEDEIUYDEAYP23XE66/">droves&lt;/a>.
The last comment is particularly telling in regards to how most contributors to
Red Hat&amp;rsquo;s products are feeling right now:&lt;/p>
&lt;blockquote>
&lt;p>Why should we keep contributing to EPEL? To be forced to use 16 free RHEL
instances maximum? What is the advantage for us volunteer contributors? I
mean, we did not do it for personal advantage, we did it to help us each other
within the Enterprise Linux distros community, but this Red Hat move will kill
the Enterprise Linux distros community, leaving only with RHEL, which is
mostly a paid subscription distribution, let&amp;rsquo;s call things with their proper
name&lt;/p>
&lt;/blockquote>
&lt;p>These free EPEL contributors are essential to Red Hat&amp;rsquo;s business model. Without
them, Red Hat would lose a majority of the software it needs to compete with
Ubuntu and Debian&amp;rsquo;s massive software catalog. Despite being a multi-billion
dollar corporation, Red Hat has never had the resources to maintain all of this
software by itself. In an unrelated incident,
&lt;a href="https://www.theregister.com/2023/06/07/red_hat_drops_libreoffice/">RHEL also failed to retain it&amp;rsquo;s LibreOffice maintainer, and will stop shipping LibreOffice as a result.&lt;/a>
This leaves RHEL somewhat perplexingly as the only &amp;ldquo;enterprise&amp;rdquo; Linux
distribution without an enterprise office suite. Lennart Poettering, another
hyper-productive developer (responsible for PulseAudio and systemd) actually
&lt;a href="https://www.phoronix.com/news/Systemd-Creator-Microsoft">left Red Hat last year to go work at Microsoft.&lt;/a>
Red Hat even
&lt;a href="https://www.phoronix.com/news/Fedora-PM-Red-Hat-Laid-Off">fired the Fedora Program Manager&lt;/a>
who manages the upstream Fedora distribution that Red Hat repackages and
rebrands as RHEL itself. Red Hat&amp;rsquo;s most valuable staff and community
contributors are either being fired or jumping ship.&lt;/p>
&lt;p>Intentionally or not, Red Hat seems to be doing everything it can to destroy the
community that makes RHEL a product you&amp;rsquo;d want to consider purchasing in the
first place.&lt;/p>
&lt;hr>
&lt;h2 id="a-failure-to-monetize-and-having-your-skillset-put-behind-a-paywall">A failure to monetize, and having your skillset put behind a paywall&lt;/h2>
&lt;p>Let&amp;rsquo;s pretend that Red Hat is somehow successful in killing all of its
downstream rebuilders (Alma Linux, Oracle Linux, Rocky Linux, etc.). RHEL has an
extremely customer-hostile monetization scheme:&lt;/p>
&lt;ul>
&lt;li>You can use it for free if you&amp;rsquo;re learning it.&lt;/li>
&lt;li>As soon as you want to use it for anything commercial, Red Hat wants an
unjustifiably high licensing fee. Unlike with Ubuntu Pro, where you can
selectively choose to buy support for key systems, without a free RHEL-clone
available you&amp;rsquo;d need to pay a subscription fee for every single production
system.&lt;/li>
&lt;li>As a software vendor, if your stuff only runs on RHEL, then your customers are
forced to pay an extra fee for the OS as well, which makes your product less
competitive.&lt;/li>
&lt;li>Even if you can stomach the licensing fees, there is no way to convert from a
free install (CentOS Stream, Alma Linux, Rocky Linux, etc.) to a paid install
without reinstalling the OS. So not only do you have to pay Red Hat tons of
money, you also get the joy of reinstalling the operating system on every
single machine you have.&lt;/li>
&lt;li>If you switch companies and the new company doesn&amp;rsquo;t want to use RHEL, you are
out of the ecosystem permanently. There won&amp;rsquo;t be a way to onboard your new
company into the RHEL ecosystem for free anymore and using RHEL itself will be
a very hard sell (see above).&lt;/li>
&lt;li>RHEL professionals will &amp;ldquo;auto-convert&amp;rdquo; from using RHEL to something else at an
extremely high rate because having your career be held hostage to a yearly
subscription just isn&amp;rsquo;t a very empowering feeling.&lt;/li>
&lt;/ul>
&lt;p>Speaking for myself, the last factor is the most significant. The people
responsible for advocating buying Red Hat and IBM&amp;rsquo;s RHEL-based products in the
first place are being alienated because it feels like their skills with those
products are getting put behind a paywall. One of the attractive things about
building a career with RHEL-based OSes until now has been that you can pick up
and take your skills anywhere, for free. The latest moves to kill off the
RHEL-downstream OSes makes it feel like an important part of your skillset is
getting put behind a paywall.&lt;/p>
&lt;hr>
&lt;h2 id="lets-sum-things-up">Let&amp;rsquo;s sum things up:&lt;/h2>
&lt;p>The main reason to use RHEL-based Linux these days is because of the really
great community. Most of RHEL&amp;rsquo;s software and useful tooling comes from free
labor by the community. The latest licensing changes are designed to monetize
every cent they can, driving away all of this &amp;ldquo;community value&amp;rdquo; that makes RHEL
an attractive product in the first place.&lt;/p>
&lt;p>&lt;img alt="&amp;ldquo;The more you tighten your grip, the more subscriptions will slip through your fingers&amp;rdquo;" src="https://jstaf.github.io/images/tarkin.jpg#centre">&lt;/p>
&lt;p>Like many other companies before it, Red Hat seems to have entered the
&lt;a href="https://www.wired.com/story/tiktok-platforms-cory-doctorow/">&amp;ldquo;enshittification&amp;rdquo; death-spiral&lt;/a>:&lt;/p>
&lt;blockquote>
&lt;p>Here is how platforms die: First, they are good to their users; then they
abuse their users to make things better for their business customers; finally,
they abuse those business customers to claw back all the value for themselves.
Then, they die.&lt;/p>
&lt;/blockquote>
&lt;p>&lt;a href="https://rockylinux.org/news/keeping-open-source-open/">Red Hat hasn&amp;rsquo;t yet successfully killed its RHEL-clone downstreams&lt;/a>,
but the writing seems to be on the wall. There is a bad actor at the very core
of the Red Hat ecosystem: Red Hat itself. There doesn&amp;rsquo;t seem to be a long-term
future for the Red Hat community and RHEL now that the &amp;ldquo;enshittification&amp;rdquo;
process is in full swing (we are just starting the &amp;ldquo;abuse the business
customers&amp;rdquo; stage- Red Hat can&amp;rsquo;t put the squeeze on them if there are easy
alternatives). It seems like the future will just be many years of slowly
increasing RHEL license fees while people leave and the product gets worse and
worse.&lt;/p>
&lt;hr>
&lt;h2 id="why-stay">Why stay?&lt;/h2>
&lt;p>When I originally wrote this article, I was really irritated by Red Hat&amp;rsquo;s
decision to try to kill CentOS a second time. It was important to take a step
back, &amp;ldquo;touch grass&amp;rdquo; as they say, and think about why I felt this way: it&amp;rsquo;s just
an operating system&amp;hellip; why am I so upset about this that I would type all of
this out? (I don&amp;rsquo;t even depend on RHEL or RHEL-clones for work anymore.) I think
why I was so upset is that the best part about Linux is that it&amp;rsquo;s just a giant
community of people who help each other and try to make the world a better place
(or at least the world of computing)- for free! Do we have to monetize this to
death? Does everything have to end in a profit-seeking death spiral?&lt;/p>
&lt;p>Anyhow, I guess this article is basically just a really roundabout way of saying
I&amp;rsquo;m dropping official support for RHEL in
&lt;a href="https://github.com/jstaf/onedriver">the software I write&lt;/a>. Any continued
support is just a happy coincidence of the fact that SUSE and Fedora share the
same RPM build toolchains. I won&amp;rsquo;t pretend that I&amp;rsquo;m an important community
member or my contributions are so valuable that Red Hat will go under without
me, but it&amp;rsquo;s just not worth putting in free labor to support a yet another
company who is doing everything possible to use everyone else&amp;rsquo;s work and give
nothing back.&lt;/p>
&lt;p>&lt;img alt="&amp;ldquo;This effort is no longer profitable!&amp;rdquo;" src="https://jstaf.github.io/images/no-longer-profitable.png#centre">&lt;/p></content></item><item><title>Near zero-downtime Postgres migrations and upgrades with pglogical</title><link>https://jstaf.github.io/posts/pglogical/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://jstaf.github.io/posts/pglogical/</guid><description>Databases are notoriously fussy to work with. Postgres is no exception. Though the software itself may be pretty solid, stuff like major version upgrades or migrations to &amp;ldquo;the cloud&amp;rdquo; (or back to on-prem) are really tricky to do without significant and costly downtime. Though there&amp;rsquo;s tools out there to make this process easier, many of these simply don&amp;rsquo;t work for anything more than small test databases, and will silently corrupt tables or fail spectacularly in real-world scenarios.</description><content>&lt;p>Databases are notoriously fussy to work with.
Postgres is no exception.
Though the software itself may be pretty solid,
stuff like major version upgrades or migrations to &amp;ldquo;the cloud&amp;rdquo; (or back to on-prem)
are really tricky to do without significant and costly downtime.
Though there&amp;rsquo;s tools out there to make this process easier,
many of these simply don&amp;rsquo;t work for anything more than small test databases,
and will silently corrupt tables or fail spectacularly in real-world scenarios.&lt;/p>
&lt;p>This post is about how to safely migrate a real-world Postgres database without downtime using pglogical.
As a bonus, this procedure works to migrate an on-premise db to AWS RDS
(many tools don&amp;rsquo;t work with RDS),
and you can perform multiple major version upgrades as part of the process
(skip as many versions as you want!).&lt;/p>
&lt;p>I haven&amp;rsquo;t written any blog posts for a very long time.
Writing these posts is a lot of work -
I usually only sit down to write something when it&amp;rsquo;s of use to me
and public documentation doesn&amp;rsquo;t exist or is otherwise very sparse.
This is one of those articles.
(Looking to migrate a MySQL / MariaDB database in a similar manner?
&lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.NonRDSRepl.html">Check out this guide from AWS.&lt;/a>)&lt;/p>
&lt;h2 id="pglogical-why-are-we-using-it">pglogical: Why are we using it?&lt;/h2>
&lt;p>The goal of migrating a database is to create an identical copy of it on a
separate piece of infrastructure, either on another VM, another datacenter,
or perhaps even another country.
There&amp;rsquo;s a lot of different ways of migrating Postgres databases and unfortunately all
of them have significant limitations.
Let&amp;rsquo;s quickly do an overview of the different Postgres tools
and demonstrate why pglogical is the least bad option
(at the time of writing).&lt;/p>
&lt;h3 id="dump-and-restore">Dump and restore&lt;/h3>
&lt;p>This is the most basic method of moving a database.
You create a database dump from the source database with a tool like &lt;code>pg_dumpall&lt;/code> or &lt;code>pg_basebackup&lt;/code>
and restore it on the target.
Obviously, this is not a great option when you want to avoid downtime.
Depending on database size, it can take hours to create the initial database backup,
and many hours to restore it on the new target instance.
Any writes that occur on the source instance after the backup is taken are lost.
Though this method is virtually foolproof and can perform upgrades as part of the process,
it obviously incurs significant downtime.
This isn&amp;rsquo;t an option for many businesses,
and likely everyone involved would prefer it
if there was no disruption to the business at all during the migration.&lt;/p>
&lt;h3 id="binary-replication--hot-standby-databases">Binary replication / &amp;ldquo;hot standby&amp;rdquo; databases&lt;/h3>
&lt;p>Binary replication is the easiest to setup, and lets you create a read-only replica db
from your original master. Out of all the replication options, this is by far and away
the best option. It &amp;ldquo;just works&amp;rdquo;. If you&amp;rsquo;re looking to setup binary replication,
honestly the best starting point is the official documentation:&lt;/p>
&lt;ul>
&lt;li>Quick tutorial: &lt;a href="https://wiki.postgresql.org/wiki/Hot_Standby">https://wiki.postgresql.org/wiki/Hot_Standby&lt;/a>&lt;/li>
&lt;li>More detailed overview: &lt;a href="https://www.postgresql.org/docs/current/hot-standby.html">https://www.postgresql.org/docs/current/hot-standby.html&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Unfortunately, binary replication has several key disadvantages:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>It doesn&amp;rsquo;t work with most cloud providers.&lt;/strong>
If you want to migrate to a managed database like AWS RDS,
you won&amp;rsquo;t actually have the superuser permissions and tools to set this up.&lt;/li>
&lt;li>&lt;strong>Binary replication only works with DBs of the same major version.&lt;/strong>
If you want to replicate to a different version, well&amp;hellip; you can&amp;rsquo;t.&lt;/li>
&lt;li>&lt;strong>Replication is one-way:&lt;/strong>
you can only have a single &amp;ldquo;master&amp;rdquo; database active at any given time.
(Unpopular opinion: if you want master-master replication that doesn&amp;rsquo;t suck,
you should honestly &lt;a href="https://mariadb.com/kb/en/what-is-mariadb-galera-cluster/">just switch to MariaDB, where this is a solved problem.&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>If you are not upgrading to a new database major version
and are not trying to migrate to a managed service like AWS RDS,
stop reading this article now and just use binary replication.
It&amp;rsquo;s the simplest option and is the fastest path to a successful migration.&lt;/p>
&lt;h3 id="third-party-replication-tools">Third-party replication tools&lt;/h3>
&lt;p>There are a lot of other replication tools out there designed to address some of the
shortcomings of Postgres&amp;rsquo; built-in binary replication.
I won&amp;rsquo;t go too deeply into each one,
but whether they work or not is highly dependent on what your database looks like:
how big is it, what types of data you have in there, where you&amp;rsquo;re trying to migrate to
(when using managed databases like RDS,
you&amp;rsquo;ll frequently be missing the pemissions necessary to set things up), etc..&lt;/p>
&lt;p>While trying things out, we explored trigger-based replication tools like
Slony, Bucardo, and Londiste.
All of these had significant issues with the databases we tried to migrate.
In particular, replication frequently broke, and there were numerous issues with
truncated or empty tables where a database that had supposedly been replicated was missing data.
As mentioned before, success is highly dependent on the database you&amp;rsquo;re trying to migrate.
Simple databases that don&amp;rsquo;t use special datatypes or triggers are much more likely to work.
It&amp;rsquo;s possible that one of these tools may work for your DB,
but you&amp;rsquo;ll need to try them out to know for sure.&lt;/p>
&lt;p>AWS has a managed &amp;ldquo;Database Migration Service&amp;rdquo; (DMS):
this is a proprietary AWS tool that live-replicates data from one database to another.
It also sucks.
In addition to taking an extremely long time to replicate large databases,
silent corruption and truncation of tables is extremely common -
even if a migration survives the initial copy phase
(in my experience, tables with more than a hundred million rows will consistently break a DMS replication instance irreparably,
especially if record validation is enabled),
many records will be altered in mysterious ways.
Some of the more entertaining failures I encountered included DMS shifting some, but not all,
timestamps in a MySQL database 3 hours into the future (no, this wasn&amp;rsquo;t timezone-related),
and DMS rounding all of the columns that used Postgres&amp;rsquo; money datatype to the nearest ten cents.
If you&amp;rsquo;ve been considering using DMS on any DB you care about&amp;hellip; don&amp;rsquo;t.
Your time is better spent investigating other migration options that actually work reliably.
DMS is only worthwhile to look into if you need to switch DB technologies completely
(such as from Oracle to Postgres).&lt;/p>
&lt;h2 id="postgres-logical-replication-and-pglogical">Postgres logical replication and pglogical&lt;/h2>
&lt;p>At this point, migrating a Postgres database without downtime of some kind
is looking increasingly impossible.
Enter pglogical:&lt;/p>
&lt;p>&lt;img src="https://jstaf.github.io/images/loki.gif#centre">&lt;/p>
&lt;p>pglogical is a &amp;ldquo;logical replication&amp;rdquo; tool for Postgres.
Instead of replicating filesystem-level changes (binary replication),
pglogical replicates the actual database SQL statements from one db to another target db.
Executing an insert on the master db will execute the same insert on any dbs &amp;ldquo;subscribed&amp;rdquo; to it.
Though the tool has its own limitations (we&amp;rsquo;ll get to those in a second),
logical replication has several MAJOR advantages over other methods:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>You can set filters and rules for what you want to replicate.&lt;/strong>
Instead of creating an exact binary copy of another DB,
you can selectively replicate only parts of it.&lt;/li>
&lt;li>&lt;strong>Replication works between major versions.&lt;/strong>
Because the actual SQL statements being replicated are not specific to a particular version,
you can replicate between different major versions without issue.
You can even skip multiple major Postgres versions in one go
(For what it&amp;rsquo;s worth, I&amp;rsquo;ve successfully done 9.4-&amp;gt;9.6 and 9.6-&amp;gt;11 without issue,
though your results may vary).&lt;/li>
&lt;li>&lt;strong>Cloud providers like AWS actually support it.&lt;/strong>
AWS RDS has the pglogical extension built into their RDS images on Postgres 9.6.10 and above.&lt;/li>
&lt;li>&lt;strong>It&amp;rsquo;s one of the only replication tools that supports really old versions of Postgres (9.4+).&lt;/strong>&lt;/li>
&lt;li>&lt;strong>It&amp;rsquo;s free and open source.&lt;/strong> Hopefully I don&amp;rsquo;t need to explain why this is awesome.&lt;/li>
&lt;li>&lt;strong>It actually works.&lt;/strong> Though the tool is still pretty tricky to use,
pglogical is remarkable for the fact that it hasn&amp;rsquo;t failed me yet.&lt;/li>
&lt;/ul>
&lt;p>But wait, what about &lt;a href="https://www.postgresql.org/docs/12/logical-replication.html">Postgres&amp;rsquo; native logical replication&lt;/a>?
Postgres recently gained the ability to perform logical replication on its own in version 10.
So how is pglogical different?
As it turns out, they&amp;rsquo;re actually the same tool under the hood.
Both Postgres&amp;rsquo; native logical replication and pglogical were developed by
&lt;a href="https://www.2ndquadrant.com/en/resources/pglogical/">2ndQuadrant&lt;/a> -
pglogical is the upstream for Postgres&amp;rsquo; native logical replication,
and has significantly more features (esp. for Postgres 10 and 11).
pglogical is also notable for working on Postgres 9.4+,
whereas native logical replication isn&amp;rsquo;t supported until versions 10+.
To sum things up, the main difference between pglogical and Postgres native replication
is that pglogical will have more features on older versions of Postgres
(and managed services like RDS support using pglogical).&lt;/p>
&lt;h3 id="so-what-are-the-downsides">So what are the downsides?&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>You still need to apply a database restart to install the pglogical extension.&lt;/strong>
I may have lied earlier when I said that the migration was &amp;ldquo;zero-downtime&amp;rdquo;,
but a single restart isn&amp;rsquo;t bad as far as these things go
(plus you can choose when to do the restart,
as opposed to being forced to do it as part of cutover).&lt;/li>
&lt;li>&lt;strong>Lots of stuff isn&amp;rsquo;t supported.&lt;/strong>
pglogical doesn&amp;rsquo;t migrate sequences very well, if it does it at all.
&lt;a href="https://www.2ndquadrant.com/en/resources/pglogical/pglogical-docs/#sequences">The documentation claims that sequences will be syncronized &amp;ldquo;periodically&amp;rdquo;&lt;/a>, but in practice,
I&amp;rsquo;ve never seen pglogical actually sync them unless you explictly force it to.
There&amp;rsquo;s more details on sequence migration later in this article,
but this is one of the issues where using pglogical has several major caveats.&lt;/li>
&lt;li>&lt;strong>Changes to database schema don&amp;rsquo;t get replicated whatsoever.&lt;/strong>
Any changes to database table structure or anything else need to be performed separately
on both the master DB and its replicas.&lt;/li>
&lt;li>&lt;strong>Replication is per-DB.&lt;/strong> If you have a postgres server with a bunch of DBs on it,
you&amp;rsquo;re going to need to setup and monitor replication for each one individually.
This can very quickly turn a migration into a lot of work.&lt;/li>
&lt;li>&lt;strong>A primary key is required to perform UPDATEs and DELETEs.&lt;/strong>
Tables without primary keys are going to be insert-only.
I have no idea why some developers keep creating tables without primary keys,
but if you are unfortunate enough to have any of these individuals at your company,
make sure they&amp;rsquo;re aware of this caveat &lt;em>before&lt;/em> you start the migration process.&lt;/li>
&lt;li>&lt;strong>Foreign keys are ignored during replication.&lt;/strong>
If you&amp;rsquo;re simply moving a database from one location to another,
this is probably not a huge concern,
but if want to use pglogical in a master-master replication setup,
foreign key constraints aren&amp;rsquo;t going to do anything.
If you want good master-master replication, stop reading this article now and switch to MariaDB.&lt;/li>
&lt;li>&lt;strong>Documentation is really sparse.&lt;/strong>
Here&amp;rsquo;s the official documentation: &lt;a href="https://www.2ndquadrant.com/en/resources/pglogical/pglogical-docs/">https://www.2ndquadrant.com/en/resources/pglogical/pglogical-docs/&lt;/a>.
That&amp;rsquo;s it. There&amp;rsquo;s a few blog posts out there of questionable veracity (this one included),
but as documentation goes, you&amp;rsquo;re more or less on your own.
There are very few how-to&amp;rsquo;s out there and good luck asking questions on Stack Overflow if you get yourself into trouble.
Please do not email me or ask me for help (yes, I don&amp;rsquo;t care if you have money).&lt;/li>
&lt;/ul>
&lt;p>Did you read all that and are still interested in migrating/upgrading a database?
Let&amp;rsquo;s get started.&lt;/p>
&lt;hr>
&lt;h1 id="migrate-a-database-using-pglogical">Migrate a database using pglogical&lt;/h1>
&lt;p>Before we start, make sure you&amp;rsquo;ve completed the following pre-requisites:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Verify that both the source and target database are running Postgres 9.4+.&lt;/strong>
If you&amp;rsquo;re using Postgres 9.4,
be aware that there are several &lt;a href="https://www.2ndquadrant.com/en/resources/pglogical/pglogical-docs/">special considerations&lt;/a> you need to take into account.
If you are using RDS, the RDS databases must be Postgres version 9.6.10 or above -
that is the earliest RDS version that supports pglogical.&lt;/li>
&lt;li>&lt;strong>You have a direct network connection between your source and target DB.&lt;/strong>
It doesn&amp;rsquo;t matter what connection you&amp;rsquo;ve got as long as it works: AWS VPC peering,
IPsec, Wireguard, etc. - in some cases you can even get by with an SSH tunnel.
Just make sure the target db instance is able to connect to the source db.&lt;/li>
&lt;li>&lt;strong>You have superuser or equivalent privileges on both the source and target db.&lt;/strong>
If you are using RDS, the &lt;code>rds_superuser&lt;/code> role is sufficient.&lt;/li>
&lt;li>&lt;strong>You have read, or at least skimmed &lt;a href="https://www.2ndquadrant.com/en/resources/pglogical/pglogical-docs/">the official documentation&lt;/a>.&lt;/strong>
Don&amp;rsquo;t skip this step.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Note:&lt;/strong> if you want to use pglogical to perform a zero-downtime upgrade,
setup the target database on whatever Postgres version you wish to upgrade to.
If you wanted to upgrade from version 9.6 to version 12,
you would install Postgres 12 on the target.&lt;/p>
&lt;h2 id="initial-pglogical-setup">Initial pglogical setup&lt;/h2>
&lt;p>Before you do anything else, make sure you&amp;rsquo;ve installed the pglogical package on both dbs
(the next few setup steps need to be done for both source and target dbs).
Follow the official documentation here: &lt;a href="https://www.2ndquadrant.com/en/resources/pglogical/pglogical-installation-instructions/">https://www.2ndquadrant.com/en/resources/pglogical/pglogical-installation-instructions/&lt;/a>.&lt;/p>
&lt;p>Add the following to your Postgres config,
and restart the Postgres service to apply the changes
(a reload is not sufficient to load the pglogical package).&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-ini" data-lang="ini">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># for more information on these values, see pglogical docs.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># these values are sufficient if you intend to migrate less than 10 databases from the source instance&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">wal_level&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#39;logical&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">shared_preload_libraries&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#39;pglogical&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">max_worker_processes&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">10 &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">max_replication_slots&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">10&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">max_wal_senders&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">10 # 10 + previous value, if one was there&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">track_commit_timestamp&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">on # leave this line out if using postgres 9.4&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Login to the db and create a user to be used for replication
(going to arbitarily call the user &lt;code>pglogical&lt;/code> here):&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">CREATE&lt;/span> &lt;span style="color:#66d9ef">ROLE&lt;/span> pglogical &lt;span style="color:#66d9ef">WITH&lt;/span> LOGIN REPLICATION SUPERUSER PASSWORD &lt;span style="color:#e6db74">&amp;#39;some_password_here&amp;#39;&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Ensure that this user is able to connect from the target instance to the source in
the source instance&amp;rsquo;s &lt;code>pg_hba.conf&lt;/code> (please use a secure authentication method).
If you don&amp;rsquo;t know how to do this, see the official documentation for
&lt;a href="https://www.postgresql.org/docs/current/auth-pg-hba-conf.html">pg_hba.conf&lt;/a>.
Note that a reload is sufficient to apply changes to &lt;code>pg_hba.conf&lt;/code>
(restarting postgres is not necessary here).&lt;/p>
&lt;h3 id="initial-pglogical-setup-on-rds">Initial pglogical setup on RDS&lt;/h3>
&lt;p>If you are using RDS, you&amp;rsquo;ll need to add &lt;code>pglogical&lt;/code> to &lt;code>shared_preload_libraries&lt;/code>
in your parameter group and reboot your RDS instance - see
&lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html">the RDS docs&lt;/a>
on how to make changes to a parameter group if you are unsure.
If it already has another value there, just add &lt;code>pglogical&lt;/code> to the end
(these values can be comma-separated).&lt;/p>
&lt;p>Create a &amp;ldquo;pglogical&amp;rdquo; user with RDS superuser privileges
(the &amp;ldquo;pglogical&amp;rdquo; user name is arbitrary):&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">CREATE&lt;/span> &lt;span style="color:#66d9ef">ROLE&lt;/span> pglogical &lt;span style="color:#66d9ef">WITH&lt;/span> LOGIN PASSWORD &lt;span style="color:#e6db74">&amp;#39;some_password_here&amp;#39;&lt;/span>;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GRANT&lt;/span> rds_replication &lt;span style="color:#66d9ef">TO&lt;/span> pglogical;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GRANT&lt;/span> rds_superuser &lt;span style="color:#66d9ef">TO&lt;/span> pglogical;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="configure-source-database">Configure source database&lt;/h2>
&lt;p>&lt;strong>Note:&lt;/strong> every step that follows is per-database.
If the source db instance has multiple datbases you want to migrate,
you&amp;rsquo;ll need to repeat these steps for each database.&lt;/p>
&lt;p>At this point, we&amp;rsquo;re ready to actually setup replication.
pglogical has some important terminology that we need to understand before we can continue:&lt;/p>
&lt;ul>
&lt;li>A &amp;ldquo;node&amp;rdquo; represents a database. It can either be a publisher (source) or subscriber (target).&lt;/li>
&lt;li>A &amp;ldquo;replication set&amp;rdquo; is a set of tables and sequences to be migrated,
as well as what changes should be replicated
(stuff like &lt;code>INSERT&lt;/code>, &lt;code>UPDATE&lt;/code>, &lt;code>DELETE&lt;/code>, and/or &lt;code>TRUNCATE&lt;/code>).&lt;/li>
&lt;li>A &amp;ldquo;subscription&amp;rdquo; represents an actual replication connection.
&amp;ldquo;Subscriber&amp;rdquo; nodes sync changes from &amp;ldquo;publisher&amp;rdquo; nodes.
By default, all replication sets are migrated from the source to the target.&lt;/li>
&lt;/ul>
&lt;p>The replication process has three basic steps:
setup the provider node and select what data to replicate,
setup the subscriber node,
then create a replication connection.
With this in mind, let&amp;rsquo;s setup the source database now.
Login to the database you wish to replicate on the source instance and perform the following:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Create the pglogical extension
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">CREATE&lt;/span> EXTENSION pglogical;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">CREATE&lt;/span> EXTENSION pglogical_origin; &lt;span style="color:#75715e">-- only on postgres 9.4, otherwise skip
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Create the publisher node
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- The DSN represents how to connect to the source database
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.create_node(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> node_name :&lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;source&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> dsn :&lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#39;host=hostname_or_ip_address port=5432 dbname=database_to_migrate user=pglogical password=pglogical_password&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Now we need to select which tables should be migrated and add them to a replication set.
Several replication sets were created when you ran &lt;code>CREATE EXTENSION pglogical;&lt;/code>:
&lt;code>default&lt;/code>, &lt;code>default_insert_only&lt;/code>, and &lt;code>ddl_sql&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>The &lt;code>default&lt;/code> replication set is what you should use by default and will replicate
&lt;code>INSERT&lt;/code>, &lt;code>UPDATE&lt;/code>, &lt;code>DELETE&lt;/code>, and &lt;code>TRUNCATE&lt;/code> (note that &lt;code>TRUNCATE CASCADE&lt;/code> doesn&amp;rsquo;t work).&lt;/li>
&lt;li>&lt;code>default_insert_only&lt;/code> only replicates &lt;code>INSERT&lt;/code> statements.
You use this for tables that don&amp;rsquo;t have a primary key.&lt;/li>
&lt;li>&lt;code>ddl_sql&lt;/code> is a special replication set designed to replicate schema changes.
You don&amp;rsquo;t need it here because this is a one-time migration
and will not be making changes to the source instance&amp;rsquo;s schema during the process.
(Using the &lt;code>ddl_sql&lt;/code> replication set is outside the scope of this article.)&lt;/li>
&lt;/ul>
&lt;p>Tables and sequences needed to be added to the replication sets individually,
though there are helper functions to do this in one go:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Add all tables to the default replication set from the &amp;#39;public&amp;#39; schema
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.replication_set_add_all_tables(&lt;span style="color:#e6db74">&amp;#39;default&amp;#39;&lt;/span>, ARRAY[&lt;span style="color:#e6db74">&amp;#39;public&amp;#39;&lt;/span>])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Check which tables have been added to all replication sets
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#66d9ef">FROM&lt;/span> pglogical.replication_set_table;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Add all sequences to the default replication set from the &amp;#39;public&amp;#39; schema
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.replication_set_add_all_sequences(&lt;span style="color:#e6db74">&amp;#39;default&amp;#39;&lt;/span>, ARRAY[&lt;span style="color:#e6db74">&amp;#39;public&amp;#39;&lt;/span>]);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Check which sequences have been added to all replication sets
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#66d9ef">FROM&lt;/span> pglogical.replication_set_seq;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If a table doesn&amp;rsquo;t have a primary key,
you&amp;rsquo;ll need to remove it from the &lt;code>default&lt;/code> replication set
and add it to the &lt;code>default_insert_only&lt;/code> replication set.
Any tables created by an extension like &lt;code>postgis&lt;/code> should also need to be removed from replication
(extension-specific tables will be &amp;ldquo;migrated&amp;rdquo; when you import the db schema on the target db instance).&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Remove a table from the &amp;#39;default&amp;#39; replication set and add it to &amp;#39;default_insert_only&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- (for other table manipulations see the official documentation):
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.replication_set_remove_table(&lt;span style="color:#e6db74">&amp;#39;default&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;table_name_here&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.replication_set_add_table(&lt;span style="color:#e6db74">&amp;#39;default_insert_only&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;table_name_here&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Before you proceed, check your work -
note that you can investigate and view the tables that pglogical uses for replication in the &lt;code>pglogical&lt;/code> schema:
&lt;code>\dt pglogical.*&lt;/code> will give you a list of tables you can look at.
Do not attempt to manipulate these tables yourself except through the functions that pglogical provides
&lt;em>(&amp;ldquo;Here be dragons.&amp;rdquo;)&lt;/em>.&lt;/p>
&lt;p>Finally, create a schema-only dump of the source database you wish to migrate with &lt;code>pg_dump&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>pg_dump -U pglogical -h source_database -s database_name &amp;gt; database_name_schema.sql
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="configure-the-target-database">Configure the target database&lt;/h2>
&lt;p>Create the database you want to migrate on the target:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">CREATE&lt;/span> &lt;span style="color:#66d9ef">DATABASE&lt;/span> database_name;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Import the schema from the previous step on the target database.
If you encounter errors, feel free to drop the database on the target and reimport the schema
as many times as it takes to make sure you&amp;rsquo;ve fixed any errors.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>psql -U pglogical -h target_database -d database_name &amp;lt; database_name_schema.sql
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Recreate any users/roles that need to be imported
(you can also dump and restore these with &lt;code>pg_dumpall&lt;/code>, but this is not covered here).&lt;/p>
&lt;p>With all of that done, we can setup the pglogical subscriber node.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Create the pglogical extension
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">CREATE&lt;/span> EXTENSION pglogical;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">CREATE&lt;/span> EXTENSION pglogical_origin; &lt;span style="color:#75715e">-- only on postgres 9.4, otherwise skip
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Create the subscriber node
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- The DSN describes how to connect to the target instance
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.create_node(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> node_name :&lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#39;target&amp;#39;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> dsn :&lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#39;host=hostname_or_ip_address port=5432 dbname=database_to_migrate user=pglogical password=pglogical_password&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="create-the-subscription">Create the subscription&lt;/h2>
&lt;p>Now that the source and target database nodes have been setup,
we can create the replication subscription.
Creating the subscription will immediately start replication,
so make sure you&amp;rsquo;re ready to start before beginning this step:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- This command is run on the target instance (the &amp;#34;subscriber node&amp;#34;)
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.create_subscription(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> subscription_name :&lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#39;subscription_name_here&amp;#39;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> provider_dsn :&lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#39;host=hostname_or_ip_address port=5432 dbname=database_to_migrate user=pglogical password=pglogical_password&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Now that you&amp;rsquo;ve created the subscription, check its status:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.show_subscription_status(&lt;span style="color:#e6db74">&amp;#39;subscription_name_here&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>There are three possible &amp;ldquo;subscription statuses&amp;rdquo;:&lt;/p>
&lt;ul>
&lt;li>&lt;code>down&lt;/code> - This is probably what you&amp;rsquo;re going to see the first time setting this up.
This means that replication has failed.
This can either due to connectvity issues or an actual problem with replication.
Check the Postgres logs on the source and target databases to see why.&lt;/li>
&lt;li>&lt;code>initializing&lt;/code> - A status of &lt;code>initializing&lt;/code> means that the source database
is performing the initial copy of the table data from source to target.
Seeing this typically indicates success -
you just need to wait until the subscription reaches &lt;code>replicating&lt;/code> state.&lt;/li>
&lt;li>&lt;code>replicating&lt;/code> - This means that the target database node has fully replicated the entire source db,
and is now replicating ongoing changes.
A &amp;ldquo;replicating&amp;rdquo; db is &lt;em>almost&lt;/em> fully migrated -
there are several final steps you&amp;rsquo;ll need to take.&lt;/li>
&lt;/ul>
&lt;p>If replication is &lt;code>down&lt;/code>, see below for troubleshooting instructions.
Otherwise, if replication has been successful, feel free to skip the next section.&lt;/p>
&lt;h2 id="troubleshooting-replication">Troubleshooting replication&lt;/h2>
&lt;p>If replication is &lt;code>initializing&lt;/code> or &lt;code>replicating&lt;/code>, skip this section: things are working.
Otherwise, continue reading.&lt;/p>
&lt;p>Before you do anything else, be aware that you can stop and start replication as needed via:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Pause replication
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.alter_subscription_disable(&lt;span style="color:#e6db74">&amp;#39;subscription_name_here&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Resume replication where you left off
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.alter_subscription_enable(&lt;span style="color:#e6db74">&amp;#39;subscription_name_here&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The Postgres server logs will always have the cause of replication failures -
do what they say to fix things.
The most likely cause of replication issues is networking issues.
Again, ensure that the target database can connect to the source database on port 5432
and that the replication user can connect to the source database from the target
in the source database&amp;rsquo;s &lt;code>pg_hba.conf&lt;/code>.&lt;/p>
&lt;p>Other issues may require resynchronizing a particular table.
You can forcibly resynchronize a table with the following command (run from the source instance):&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- NOTE: will truncate the table on the target
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.alter_subscription_resynchronize_table(&lt;span style="color:#e6db74">&amp;#39;subscription_name_here&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;table_name&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Check the status of the resynchronized table, pretty self-explanatory
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.show_subscription_table(&lt;span style="color:#e6db74">&amp;#39;subscription_name_here&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;table_name&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The above command may not succeed if the table has foreign key constraints -
you&amp;rsquo;ll need to manually &lt;code>TRUNCATE&lt;/code> both the table you are attempting to resynchronize
as well as any tables that depend on it on the target
(obviously, make &lt;em>&lt;strong>EXTRA SURE&lt;/strong>&lt;/em> that you&amp;rsquo;re running the TRUNCATEs on the correct database).&lt;/p>
&lt;p>Perhaps you forgot a table when setting up the initial replication sets?
To fix this, you can still add the table during replication, then call
&lt;code>pglogical.alter_subscription_resynchronize_table()&lt;/code>.&lt;/p>
&lt;p>For anything else, check the official
&lt;a href="https://www.2ndquadrant.com/en/resources/pglogical/pglogical-docs/">pglogical documentation&lt;/a>.
Although there&amp;rsquo;s very few how-tos in there,
there&amp;rsquo;s a number of functions to do most anything you need to.
Don&amp;rsquo;t be afraid to completely restart the process from the beginning if you have to
(drop the target database, reimport the schema,
and re-setup replication on the target instance from scratch).
If you need to peek at pglogical&amp;rsquo;s current state,
remember that you can get a list of tables with &lt;code>\dt pglogical.*&lt;/code>.&lt;/p>
&lt;h2 id="completing-the-migration">Completing the migration&lt;/h2>
&lt;p>If your setup has been successful and reached &lt;code>replicating&lt;/code> state
(if its still &lt;code>initializing&lt;/code>, just wait for the initial copy to complete)
you&amp;rsquo;ll need to take several steps to complete the migration.&lt;/p>
&lt;p>At this point, the actual table data has been synced,
but the sequences are still out of sync
(the pglogical documentation claims they&amp;rsquo;ll be synced,
but in practice, it doesn&amp;rsquo;t seem to happen).
You can synchronize them with this command on the source instance:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Before running it, note that this will actually add 1000 to each sequence value on the target.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- This is actually by design - you can read about and complain about this here:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- https://github.com/2ndQuadrant/pglogical/issues/163
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.synchronize_sequence( seqoid ) &lt;span style="color:#66d9ef">FROM&lt;/span> pglogical.sequence_state;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- You can check individual sequences with this command on the subscriber database:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> last_value &lt;span style="color:#66d9ef">FROM&lt;/span> sequence_name;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If you want an alternative command to sync sequences,
try this command on the target database:
&lt;a href="https://wiki.postgresql.org/wiki/Fixing_Sequences">https://wiki.postgresql.org/wiki/Fixing_Sequences&lt;/a>&lt;/p>
&lt;p>Once the sequences have been synced, you can begin the cutover process to the new database.
I cannot help you with this part -
the exact cutover steps you need to perform will depend on what application is connected to Postgres.
One very important thing to note about the cutover process is that both the source and target are writable.
The replication subscription will continue replicating data until you issue the following command:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Optional - temporarily stop replication:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.alter_subscription_disable(&lt;span style="color:#e6db74">&amp;#39;subscription_name_here&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Permanently disable replication
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- (re-migrating data will require repeating the migration process from scratch):
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> pglogical.drop_subscription(&lt;span style="color:#e6db74">&amp;#39;subscription_name_here&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>At this point replication is terminated -
you can drop the remaining pglogical nodes, extension, and roles at your convenience
(they do not impact normal operation as long as there is no active replication subscription).
Congratulations on a successful migration!&lt;/p></content></item><item><title>Going "Pro" with RStudio Server Open Source</title><link>https://jstaf.github.io/posts/rstudio-server-semi-pro/</link><pubDate>Wed, 20 Jun 2018 00:00:00 +0000</pubDate><guid>https://jstaf.github.io/posts/rstudio-server-semi-pro/</guid><description>RStudio is the go-to tool for programmers working in R. Frequently organizations will want to centralize their R work or provide web-based access to a compute environment. Although RStudio Server has an &amp;ldquo;open source edition&amp;rdquo;, most of the useful administrative functionality is locked behind the rather-expensive RStudio Server Pro version ($10k USD/year). This price isn&amp;rsquo;t sustainable for many organizations, or might not otherwise be worth it if there are only a few potential users.</description><content>&lt;p>RStudio is the go-to tool for programmers working in R. Frequently
organizations will want to centralize their R work or provide web-based
access to a compute environment. Although RStudio Server has an &amp;ldquo;open source
edition&amp;rdquo;, most of the useful administrative functionality is locked behind
the rather-expensive RStudio Server Pro version ($10k USD/year). This price
isn&amp;rsquo;t sustainable for many organizations, or might not otherwise be worth it
if there are only a few potential users. We will cover how to setup and
administer the free version of RStudio Server in a professional manner, and
use Linux&amp;rsquo;s features to unlock most of the functionality from the &amp;ldquo;Pro&amp;rdquo; version.&lt;/p>
&lt;p>And before you ask, yes, this is all perfectly in line with RStudio&amp;rsquo;s open
source licensing. Many of these changes are also useful if you&amp;rsquo;ve got a
license for RStudio Server Pro, particularly the reverse proxy configuration.&lt;/p>
&lt;hr>
&lt;h2 id="setting-up-a-base-installation">Setting up a base installation&lt;/h2>
&lt;p>I&amp;rsquo;m going to assume you&amp;rsquo;ve already got a fresh installation of CentOS 7 ready to go.
In my case, I&amp;rsquo;ve installed CentOS in a GNOME Boxes VM on my laptop,
normally you&amp;rsquo;d be SSH&amp;rsquo;ed into a server and setting things up that way.
We&amp;rsquo;ll start by installing R, RStudio, and several development headers required for many R packages,
in this case &lt;code>tidyverse&lt;/code> and &lt;code>devtools&lt;/code>.
Note that this tutorial assumes you are working as the root user
(since pretty much every command we will need to run requires sudo privileges).&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>yum update
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>yum install epel-release
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># install R plus some useful development headers for R (required for tidyverse + devtools)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>yum install R openssl-devel libcurl-devel libxml2-devel wget
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># download RStudio Server and install it&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>wget https://download2.rstudio.org/rstudio-server-rhel-1.1.453-x86_64.rpm
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>yum install rstudio-server-rhel-1.1.453-x86_64.rpm
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>RStudio Server should now be running at port 8787 on your server.
You can test that the installation worked
by visiting http://localhost:8787/ in a browser.&lt;/p>
&lt;p>&lt;img alt="Initial RStudio screen" src="https://jstaf.github.io/images/rstudio-1.png">&lt;/p>
&lt;p>This is the basic installation of RStudio Server Open Source Edition.
However, there&amp;rsquo;s a number of glaring issues with this installation:&lt;/p>
&lt;ul>
&lt;li>RStudio Server doesn&amp;rsquo;t know about LDAP users or any users not directly on the server
(i.e. any users not created with &lt;code>useradd&lt;/code>).&lt;/li>
&lt;li>RStudio is being hosted over a non-standard port (8787).&lt;/li>
&lt;li>The website is being served over HTTP -
any passwords entered/network traffic will be in plain-text. This is &lt;em>BAD&lt;/em>.&lt;/li>
&lt;li>There are no resource limits for users.
There&amp;rsquo;s a &lt;a href="https://github.com/rstudio/rstudio/issues/1633">known bug&lt;/a> in RStudio
(both Pro and Open Source Edition)
where loading &amp;gt;10GB of data into a session will
lock that user out of RStudio indefinitely.
(RStudio will try to save large sessions to disk, then time out while attempting to re-load them).&lt;/li>
&lt;li>We might want to host RStudio as part of another website (for example, &lt;a href="https://your.website.name/rstudio/)">https://your.website.name/rstudio/)&lt;/a>.&lt;/li>
&lt;/ul>
&lt;h2 id="authenticating-network-users-via-pam">Authenticating network users via PAM&lt;/h2>
&lt;p>RStudio Server uses PAM for authentication.
PAM (Pluggable Authentication Modules) are used on Linux to
break authentication and sign-in into a set of configurable modules.
Without going too deep into things,
we can change how RStudio authenticates users by changing its PAM configuration.
(If you don&amp;rsquo;t care about letting people use network credentials like LDAP,
feel free to skip this section.)&lt;/p>
&lt;p>RStudio&amp;rsquo;s PAM configuration is stored at &lt;code>/etc/pam.d/rstudio&lt;/code>.
Let&amp;rsquo;s look at the current config:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>cat /etc/pam.d/rstudio
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre tabindex="0">&lt;code>#%PAM-1.0
auth requisite pam_succeed_if.so uid &amp;gt;= 500 quiet
auth required pam_unix.so nodelay
account required pam_unix.so
&lt;/code>&lt;/pre>&lt;p>Translating the PAM config into plain-english, this config does two things:&lt;/p>
&lt;ul>
&lt;li>Authentication will succeed if you are attempting to authenticate a user with a UID greater than 500
(this is done to prevent low-numbered system users from logging in -
you don&amp;rsquo;t want any users to logging in as root, for instance).&lt;/li>
&lt;li>Authentication and user accounts are handled by the UNIX authentication module (&lt;code>pam_unix.so&lt;/code>).&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Before you do anything else, create a backup of your old RStudio PAM module:&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>cp /etc/pam.d/rstudio /etc/pam.d/rstudio.bak
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If we want to have our installation authenticate different types of users,
we&amp;rsquo;ll need to change RStudio&amp;rsquo;s PAM authentication. To change authentication
methods, say from UNIX users to LDAP, all we&amp;rsquo;d need to use is change the
authentication module from &lt;code>pam_unix.so&lt;/code> to a new module like &lt;code>pam_ldap.so&lt;/code>.
(Note: this will remove the ability of local UNIX users to login to RStudio,
and only allow LDAP users to login exclusively.)&lt;/p>
&lt;p>&lt;strong>Example &lt;code>/etc/pam.d/rstudio&lt;/code> LDAP auth config:&lt;/strong>&lt;/p>
&lt;pre tabindex="0">&lt;code>#%PAM-1.0
auth requisite pam_succeed_if.so uid &amp;gt;= 500 quiet
auth required pam_ldap.so nodelay
account required pam_ldap.so
&lt;/code>&lt;/pre>&lt;p>What happens if we want to allow a mix of both network (LDAP) and local
(UNIX) users to authenticate? Ideally, you&amp;rsquo;d want a config that matched how
the system normally authenticated users over SSH/whatever. The good news is
that this config already exists: &lt;code>/etc/pam.d/password-auth&lt;/code>. We can use other
PAM files like this one in our existing RStudio config:&lt;/p>
&lt;pre tabindex="0">&lt;code>#%PAM-1.0
auth requisite pam_succeed_if.so uid &amp;gt;= 500 quiet
auth include password-auth
account include password-auth
&lt;/code>&lt;/pre>&lt;p>The changes should take effect immediately for all new sessions
(using either our LDAP or system-auth PAM config).
To be specific, a &amp;ldquo;new session&amp;rdquo; in the context of RStudio means either
logging in with no existing &lt;code>rsession&lt;/code> processes,
or clicking the &amp;ldquo;power&amp;rdquo; button in RStudio Server to start a new session/process on the server.
If something, goes wrong, you can just restore the old RStudio PAM config
by copying over your backup from earlier.&lt;/p>
&lt;h2 id="hosting-rstudio-server-securely-over-https">Hosting RStudio Server securely over HTTPS&lt;/h2>
&lt;p>You typically do not ever want a web application like RStudio exposed to the general internet.
The best practice is to host RStudio behind a webserver like Apache httpd or Nginx
in what&amp;rsquo;s called a &lt;em>reverse proxy configuration&lt;/em>.
When you setup a reverse proxy for an application,
it means that you are setting things up so that the only way of accessing the application
is via your proxy webserver
(which is typically more secure than the application itself).
We&amp;rsquo;ll set up access to RStudio Server in this manner using httpd,
and configure the firewall to allow access to only the ports we specify.&lt;/p>
&lt;p>First, let&amp;rsquo;s make sure that our firewall is up, running, and starts on boot.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>systemctl start firewalld
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>systemctl enable firewalld
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>systemctl status firewalld
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre tabindex="0">&lt;code># should show something like the following:
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2018-03-02 15:28:37 EST; 2 days ago
Docs: man:firewalld(1)
Main PID: 710 (firewalld)
CGroup: /system.slice/firewalld.service
└─710 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid
&lt;/code>&lt;/pre>&lt;p>Let&amp;rsquo;s configure the firewall to allow access to our server over ports 80 (HTTP) and 443 (HTTPS).&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>firewall-cmd --add-service&lt;span style="color:#f92672">=&lt;/span>http --permanent
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>firewall-cmd --add-service&lt;span style="color:#f92672">=&lt;/span>https --permanent
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>firewall-cmd --reload
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Alright our firewall is running and will happily allow connections to our machine.
Let&amp;rsquo;s install and configure the Apache HTTPD server to host RStudio.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>yum install httpd mod_ssl
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>systemctl start httpd
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>systemctl enable httpd
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;img alt="httpd is up" src="https://jstaf.github.io/images/rstudio-2.png">&lt;/p>
&lt;p>Ok, we&amp;rsquo;ve got an HTTP server (if you want to check, visit &lt;code>localhost&lt;/code> in a
browser - it should appear just like the image above).
We just need to tell it how to host RStudio. Let&amp;rsquo;s create a new
Apache VirtualHost that exposes RStudio to the web. You&amp;rsquo;ll need an SSL
certificate for this step. If you don&amp;rsquo;t have an SSL certificate, you can get
one from Let&amp;rsquo;s Encrypt using the instructions here:
&lt;a href="https://certbot.eff.org/#centosrhel7-apache">https://certbot.eff.org/#centosrhel7-apache&lt;/a>.
If Let&amp;rsquo;s Encrypt isn&amp;rsquo;t an option (say if you&amp;rsquo;re doing this on a VM like me),
we can create a self-signed SSL certificate with the following.
For consistency&amp;rsquo;s sake, I&amp;rsquo;ll put it in &lt;code>/etc/rstudio&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>openssl req -x509 -newkey rsa:4096 -keyout /etc/rstudio/rstudio_key.pem -out /etc/rstudio/rstudio_cert.pem -days &lt;span style="color:#ae81ff">3650&lt;/span> -nodes
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># enter whatever you want for the questions since it&amp;#39;s a self-signed cert&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># NOTE: please ensure that your certificates are not world-readable, you do not&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># want random users to be able to read your certificates. Make sure that only&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># root can read the certificates.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>chmod &lt;span style="color:#ae81ff">700&lt;/span> /etc/rstudio/*.pem
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Now that we have an SSL certificate, let&amp;rsquo;s setup our RStudio VirtualHost.
Create a new file &lt;code>/etc/httpd/conf.d/rstudio.conf&lt;/code>, with the following content.
This will host RStudio at your server&amp;rsquo;s base directory,
for instance &lt;a href="https://website.name.here">https://website.name.here&lt;/a>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&amp;lt;VirtualHost *:80&amp;gt;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># redirect all port 80 traffic to 443&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteEngine on
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ReWriteCond %&lt;span style="color:#f92672">{&lt;/span>SERVER_PORT&lt;span style="color:#f92672">}&lt;/span> !^443$
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteRule ^/&lt;span style="color:#f92672">(&lt;/span>.*&lt;span style="color:#f92672">)&lt;/span> https://%&lt;span style="color:#f92672">{&lt;/span>HTTP_HOST&lt;span style="color:#f92672">}&lt;/span>/$1 &lt;span style="color:#f92672">[&lt;/span>NC,R,L&lt;span style="color:#f92672">]&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&amp;lt;/VirtualHost&amp;gt;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&amp;lt;VirtualHost *:443&amp;gt;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># configure SSL&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLEngine on
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLCertificateFile /etc/rstudio/rstudio_cert.pem
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLCertificateKeyFile /etc/rstudio/rstudio_key.pem
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># use if you have a real cert&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># SSLCertificateChainFile /etc/rstudio/rstudio_cert_bundle.crt&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># disable weak SSL ciphers&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLProtocol -ALL +TLSv1.2
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLCipherSuite HIGH:!MEDIUM:!aNULL:!MD5:!SEED:!IDEA:!RC4
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLHonorCipherOrder on
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>TraceEnable off
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># host rstudio&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ProxyPreserveHost on
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ProxyRequests off
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteCond %&lt;span style="color:#f92672">{&lt;/span>HTTP:Upgrade&lt;span style="color:#f92672">}&lt;/span> &lt;span style="color:#f92672">=&lt;/span>websocket
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteRule /&lt;span style="color:#f92672">(&lt;/span>.*&lt;span style="color:#f92672">)&lt;/span> ws://localhost:8787/$1 &lt;span style="color:#f92672">[&lt;/span>P,L&lt;span style="color:#f92672">]&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteCond %&lt;span style="color:#f92672">{&lt;/span>HTTP:Upgrade&lt;span style="color:#f92672">}&lt;/span> !&lt;span style="color:#f92672">=&lt;/span>websocket
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteRule /&lt;span style="color:#f92672">(&lt;/span>.*&lt;span style="color:#f92672">)&lt;/span> http://localhost:8787/$1 &lt;span style="color:#f92672">[&lt;/span>P,L&lt;span style="color:#f92672">]&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ProxyPass / http://localhost:8787/
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ProxyPassReverse / http://localhost:8787/
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RequestHeader set X-Forwarded-Proto &lt;span style="color:#e6db74">&amp;#34;https&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&amp;lt;/VirtualHost&amp;gt;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If you want to host RStudio under a subdirectory (say &lt;a href="https://website.name.here/rstudio/)">https://website.name.here/rstudio/)&lt;/a>,
your conf should look something like this:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&amp;lt;VirtualHost *:80&amp;gt;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># redirect all port 80 traffic to 443&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteEngine on
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ReWriteCond %&lt;span style="color:#f92672">{&lt;/span>SERVER_PORT&lt;span style="color:#f92672">}&lt;/span> !^443$
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteRule ^/&lt;span style="color:#f92672">(&lt;/span>.*&lt;span style="color:#f92672">)&lt;/span> https://%&lt;span style="color:#f92672">{&lt;/span>HTTP_HOST&lt;span style="color:#f92672">}&lt;/span>/$1 &lt;span style="color:#f92672">[&lt;/span>NC,R,L&lt;span style="color:#f92672">]&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&amp;lt;/VirtualHost&amp;gt;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&amp;lt;VirtualHost *:443&amp;gt;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># configure SSL&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLEngine on
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLCertificateFile /etc/rstudio/rstudio_cert.pem
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLCertificateKeyFile /etc/rstudio/rstudio_key.pem
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># use if you have a real cert&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># SSLCertificateChainFile /etc/rstudio/rstudio_cert_bundle.crt&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># disable weak SSL ciphers&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLProtocol -ALL +TLSv1.2
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLCipherSuite HIGH:!MEDIUM:!aNULL:!MD5:!SEED:!IDEA:!RC4
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>SSLHonorCipherOrder on
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>TraceEnable off
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># extra redirects for the RStudio subdirectory&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Redirect /rstudio /rstudio/
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Redirect /auth-sign-in /rstudio/auth-sign-in
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Redirect /auth-sign-out /rstudio/auth-sign-out
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># some redirects for RStudio Server Pro, if you&amp;#39;ve got a license&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Redirect /s /rstudio/s
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Redirect /admin /rstudio/admin
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Catch RStudio redirecting improperly from the auth-sign-in page&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&amp;lt;If &lt;span style="color:#e6db74">&amp;#34;%{HTTP_REFERER} =~ /auth-sign-in/&amp;#34;&lt;/span>&amp;gt;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> RedirectMatch ^/$ /rstudio/
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&amp;lt;/If&amp;gt;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># host rstudio&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ProxyPreserveHost on
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ProxyRequests off
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteCond %&lt;span style="color:#f92672">{&lt;/span>HTTP:Upgrade&lt;span style="color:#f92672">}&lt;/span> &lt;span style="color:#f92672">=&lt;/span>websocket
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteRule /rstudio/&lt;span style="color:#f92672">(&lt;/span>.*&lt;span style="color:#f92672">)&lt;/span> ws://localhost:8787/$1 &lt;span style="color:#f92672">[&lt;/span>P,L&lt;span style="color:#f92672">]&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteCond %&lt;span style="color:#f92672">{&lt;/span>HTTP:Upgrade&lt;span style="color:#f92672">}&lt;/span> !&lt;span style="color:#f92672">=&lt;/span>websocket
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RewriteRule /rstudio/&lt;span style="color:#f92672">(&lt;/span>.*&lt;span style="color:#f92672">)&lt;/span> http://localhost:8787/$1 &lt;span style="color:#f92672">[&lt;/span>P,L&lt;span style="color:#f92672">]&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ProxyPass /rstudio/ http://localhost:8787/
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ProxyPassReverse /rstudio/ http://localhost:8787/
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>RequestHeader set X-Forwarded-Proto &lt;span style="color:#e6db74">&amp;#34;https&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&amp;lt;/VirtualHost&amp;gt;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Note: the RStudio Admin Guide instructions on how to host RStudio under a
subdirectory are actually wrong here. This config solves a
&lt;a href="https://github.com/rstudio/rstudio/issues/1676">longstanding bug&lt;/a>
where RStudio does not properly redirect users to and from its authentication pages.&lt;/p>
&lt;p>To apply the new config, we&amp;rsquo;ll restart Apache
and perform a config change to SELinux to allow httpd to proxy RStudio.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>setsebool -P httpd_can_network_connect on
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>systemctl restart httpd
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>RStudio should now be available over HTTPS when you visit the server.
Additionally, it will redirect from HTTP and force HTTPS automatically if
someone tries to visit the HTTP link.&lt;/p>
&lt;p>&lt;img alt="RStudio over HTTPS" src="https://jstaf.github.io/images/rstudio-3.png">&lt;/p>
&lt;h2 id="set-up-resource-limits">Set up resource limits&lt;/h2>
&lt;p>RStudio Server has a critical bug where any user who loads more than 5-10GB of data
&lt;a href="https://github.com/rstudio/rstudio/issues/1633">will be permanently locked out of their session&lt;/a>.
RStudio will attempt
to save this session to disk when it becomes inactive, and then upon resuming
the session, it will timeout and fail to load. The user will be locked out of
their session. To get around this issue, we&amp;rsquo;ll need to setup some resource
limits (this will also prevent one user from dominating all the memory on the
system, of course).&lt;/p>
&lt;p>Although RStudio Server Pro has a lot of nifty utilities for implementing
resource limits, the Linux kernel does it better. We&amp;rsquo;ll set some resource
limits to bypass the above above bug.&lt;/p>
&lt;p>Resource limits on Linux are set in &lt;code>/etc/security/limits.conf&lt;/code>. To set a
memory limit of 8GB for all non-system users, add the following line to the
file:&lt;/p>
&lt;pre tabindex="0">&lt;code>1000: - as 8388608
&lt;/code>&lt;/pre>&lt;p>Let&amp;rsquo;s break down the line above - it generally follows the format of:&lt;/p>
&lt;pre tabindex="0">&lt;code>who_to_apply_limits_to type_of_limit resource_to_limit limit_value
&lt;/code>&lt;/pre>&lt;p>For the &lt;code>who_to_apply_limits_to&lt;/code> value, we can specify a user (just use the username),
a group (specified as &lt;code>@groupname&lt;/code>), or a range of users/groups
(to use uid numbers, follow the format &lt;code>min_uid:max_uid&lt;/code>).
In this case, we have applied the limit to all users with uids greater than 1000.
System users on Linux generally are numbered below 1000,
and new users created by useradd/LDAP (i.e. real users) will always have uids
higher than this value.
Using &lt;code>1000:&lt;/code> will apply the limits to all non-system users.&lt;/p>
&lt;p>As for the &lt;code>type_of_limit&lt;/code>, this can be either &lt;code>hard&lt;/code> or &lt;code>soft&lt;/code>.
&lt;code>hard&lt;/code> limits are binding, and can not be altered by users.
&lt;code>soft&lt;/code> limits can be changed by users using the &lt;code>ulimit&lt;/code> command,
up to the value of the &lt;code>hard&lt;/code> limit.
The &lt;code>soft&lt;/code> limits are in effect by default.
As far as users are concerned, none of them are going to know or care about the &lt;code>ulimit&lt;/code> command.
Because of this, we might as well set both the hard and soft limits to the same value.
There&amp;rsquo;s a neat shortcut for this - we can specify both limits at the same time using &lt;code>-&lt;/code>.&lt;/p>
&lt;p>There are a lot of different resource limits, which one do we use?
To make a very long story short, the only limits we are usually interested in are
&lt;code>as&lt;/code> (memory limit),
&lt;code>nofile&lt;/code> (open files, often needs to be increased for Hadoop/Spark),
and &lt;code>nproc&lt;/code> (number of processes a user is allowed to start).
In this case we want to set a memory limit using &lt;code>as&lt;/code>.&lt;/p>
&lt;p>Finally, the limit value is different depending on what limit are you trying to set.
In the case of &lt;code>as&lt;/code>, the limit is in kilobytes.
(If one were to calculate a reasonable memory limit in kilobytes in R: &lt;code>gb * 1024 ^ 2&lt;/code>).
In this case, we set a memory limit of 8GB with the value of 8388608 (KB).&lt;/p>
&lt;p>&lt;strong>To make a long story short, we&amp;rsquo;ve set a memory limit of 8GB for all human users on the system.&lt;/strong>&lt;/p>
&lt;p>But wait, you may have tested this out and found it does not actually apply the memory limits!
(You can use &lt;code>object.size(some_variable)&lt;/code> to check the size of an object in R.
If a memory limit is hit, it will display &lt;code>Error: cannot allocate vector of size &amp;lt;some size&amp;gt;&lt;/code>.)
Why not?
As it turns out, session limits set in &lt;code>/etc/security/limits.conf&lt;/code> are applied
only if the following line is present in the PAM config users logged in as.
In order to apply resource limits to RStudio,
you should add the following to &lt;code>/etc/pam.d/rstudio&lt;/code>:&lt;/p>
&lt;pre tabindex="0">&lt;code>session required pam_limits.so
&lt;/code>&lt;/pre>&lt;p>This line enforces resource limits on user sessions using PAM.
Without it, user sessions started using &lt;code>/etc/pam.d/rstudio&lt;/code>
will not respect the limitations in &lt;code>/etc/security/limits.conf&lt;/code>.
Once set, you can use &lt;code>/etc/security/limits.conf&lt;/code> to apply whatever resource
limits you want to RStudio.&lt;/p>
&lt;p>For reference, an example &lt;code>/etc/pam.d/rstudio&lt;/code> might now look like the following:&lt;/p>
&lt;pre tabindex="0">&lt;code>#%PAM-1.0
auth requisite pam_succeed_if.so uid &amp;gt;= 500 quiet
auth include password-auth
account include password-auth
session required pam_limits.so
&lt;/code>&lt;/pre>&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Over the course of this article, we&amp;rsquo;ve done the following:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Installed R and development headers necessary for the &lt;code>tidyverse&lt;/code> and &lt;code>devtools&lt;/code> packages.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Installed RStudio Server Open Source Edition.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Setup RStudio&amp;rsquo;s PAM config to authenticate all users on the server, including network/LDAP users.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Hosted RStudio on a standard port (no port 8787 weirdness).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Hosted RStudio so that all traffic between the user and the server is encrypted over HTTPS.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Used resource limits and PAM to enforce resource limits to RStudio users.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>To make a long story short, we have applied multiple features from
RStudio Server Pro, including: authentication of network/LDAP users,
secure communication over HTTPS, and resource limits for RStudio sessions.
To underscore this, &lt;em>this is ten thousand dollars per year worth of features.&lt;/em>&lt;/p>
&lt;p>So why buy RStudio Server Pro?
As of this blog post,
RStudio Open Source Edition has all the key features of the Pro version,
except for the following:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Multiple sessions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Multiple R versions / custom R initialization logic (such as loading environment modules on an HPC cluster)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>A very nice admin dashboard (that is not to be underestimated&amp;hellip; hnggggg)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Load-balancing across multiple servers&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Supports the RStudio team financially&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>If one of these features is important for you, please buy RStudio Server Pro and support the RStudio team.
If not, the suggestions covered in this post will allow you to use
RStudio Server Open Source for any small- to medium-scale
RStudio Server deployment. Enjoy!&lt;/p></content></item><item><title>Reproducible science with Conda and Snakemake</title><link>https://jstaf.github.io/posts/conda-lessons-learned/</link><pubDate>Mon, 04 Jun 2018 00:00:00 +0000</pubDate><guid>https://jstaf.github.io/posts/conda-lessons-learned/</guid><description>Doing scientific computing is hard. Delivering results with fast, performant code is often the easy part. You know your tools and how to get results. Delivering your workflow to your target audience is where it gets tough. What happens if your clients want to re-run things themselves, on their own hardware? How do they configure your pipeline for a new problem set? What happens if they&amp;rsquo;ve never even used the command line before, much less understand what a server is?</description><content>&lt;p>Doing scientific computing is &lt;em>hard&lt;/em>.
Delivering results with fast, performant code is often the easy part.
You know your tools and how to get results.
Delivering your workflow to your target audience is where it gets tough.
What happens if your clients want to re-run things themselves, on their own hardware?
How do they configure your pipeline for a new problem set?
What happens if they&amp;rsquo;ve never even used the command line before, much less understand what a server is?
This post is more or less a &amp;ldquo;lessons learned&amp;rdquo;
on my approach to solving these types of workflow deployment problems.
It&amp;rsquo;s by no means a perfect solution,
but hopefully this will be useful to other groups struggling with the same issues.&lt;/p>
&lt;p>This is a tall order - you need to provide:&lt;/p>
&lt;ul>
&lt;li>Your analysis results.&lt;/li>
&lt;li>The pipeline itself and all supporting code.&lt;/li>
&lt;li>A foolproof method of deploying the software and the execution environment your pipeline requires.&lt;/li>
&lt;li>The training and documentation required to run things start to finish.
This is harder than it sounds -
you can&amp;rsquo;t force your target audience to care enough to learn UNIX or the basics of programming
(they&amp;rsquo;ve got other stuff to do, remember!).&lt;/li>
&lt;/ul>
&lt;p>You might say the last 3 are unnecessary
(they&amp;rsquo;ve got the results, right?),
but this is the most important part!
Once your clients can run the pipeline themselves,
your job is done and you can move onto your next project!&lt;/p>
&lt;p>&lt;strong>For readers looking for the quick summary:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Jupyter and R notebooks work really well for displaying results (nothing new here&amp;hellip;).&lt;/li>
&lt;li>Snakemake works well for managing and scaling pipeline execution.&lt;/li>
&lt;li>When deploying pipeline software,
Git + Conda environments work well initially, but do not age well.
There&amp;rsquo;s not really any good solutions in this space right now unfortunately
(Docker containers won&amp;rsquo;t be able to pass muster for security-conscious organizations).&lt;/li>
&lt;li>Ideally, documentation gets done in the Git repository &lt;code>README.md&lt;/code>/wiki,
but hands-on training and follow ups are still a must.
Your workflow needs to be as simple as possible to reproduce and run.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="delivering-results">Delivering results&lt;/h2>
&lt;p>This is probably the easiest part
(chances are you&amp;rsquo;ve done this before!).
You need to deliver the actual result data files
along with supporting plots and explanations.
Personally, I find the best solution to this approach is
a report that interleaves summary statistics/plots with explanations
as they are generated by the pipeline.
The easiest way to do this is using
&lt;a href="https://jupyter.org/">Jupyter notebooks&lt;/a> or an
&lt;a href="http://rmarkdown.rstudio.com/">R Markdown report&lt;/a>.
(I won&amp;rsquo;t provide a full walkthrough on how to use tools here,
check out their respective documentation pages.)&lt;/p>
&lt;p>Either tool is great (I personally prefer R markdown notebooks),
but it&amp;rsquo;s important that you use these as a tool to document your workflow
(where possible) and how each plot was generated.
No one wants a folder full of plots with no explanation (aside from labeled axes).
For each step, write what you are doing and why you are doing it
along with what each statistic means.
As for your data, provide a description of what each output file consists of and gzip it all up.&lt;/p>
&lt;p>All of that said,
Jupyter/R notebooks aren&amp;rsquo;t all that great for heavy-duty data-crunching.
So what do you make into a notebook and what can you leave as plain old scripts?
Again, notebooks are there to explain your results:
QC scripts, summary statistics, analysis conclusions, etc..
anything that will be read by someone else should go into a notebook if possible.
Everything else can stay a script.&lt;/p>
&lt;h2 id="creating-a-reproducible-analysis-pipeline">Creating a reproducible analysis pipeline&lt;/h2>
&lt;p>Your analysis needs to run itself, automatically, without human input.
It&amp;rsquo;s not reproducible unless it can be run completely independently of your involvement.
Your client should also be able to swap out the dataset for a new one,
and your pipeline should update itself and handle the change in data appropriately.
Ideally, this should all execute in parallel and take advantage of all available hardware.&lt;/p>
&lt;p>There&amp;rsquo;s a lot of different tools for this,
but the one I&amp;rsquo;ve (relatively happily) settled on is &lt;a href="https://snakemake.readthedocs.io/">Snakemake&lt;/a>.
Snakemake works exactly like GNU Make,
where rules define how output files are created from input files.
The main difference between the two is that Snakemake workflows are all in Python
and supports a lot of stuff that GNU Make doesn&amp;rsquo;t
(running on different OS&amp;rsquo;es, submitting jobs to a cluster, etc.).&lt;/p>
&lt;p>An example Snakemake rule to produce a &lt;a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/">FastQC&lt;/a>
report from an input FASTQ file might look like this.
Notice how there&amp;rsquo;s only 3 ingredients to a rule: an input, output, and a shell command.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>rule fastqc:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> input: &lt;span style="color:#e6db74">&amp;#39;&lt;/span>&lt;span style="color:#e6db74">{sample}&lt;/span>&lt;span style="color:#e6db74">.fastq&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> output: &lt;span style="color:#e6db74">&amp;#39;&lt;/span>&lt;span style="color:#e6db74">{sample}&lt;/span>&lt;span style="color:#e6db74">_fastqc.html&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> shell: &lt;span style="color:#e6db74">&amp;#39;fastqc &lt;/span>&lt;span style="color:#e6db74">{input}&lt;/span>&lt;span style="color:#e6db74">&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>There are a few big advantages of Snakemake vs. other tools I tried:&lt;/p>
&lt;ul>
&lt;li>Snakemake was probably the easiest pipelining software to get the hang of.
You can more or less learn it in an afternoon.&lt;/li>
&lt;li>It&amp;rsquo;s pure Python - anything Python can do, Snakemake can do as well.&lt;/li>
&lt;li>Scaling up a pipeline is effortless.
A serial workflow is identical to a parallel one - no changes needed.
To submit a job to a cluster, all you have to do is provide a cluster submission
command and it will do its thing and submit jobs for you.&lt;/li>
&lt;li>Workflows are really fast to write.&lt;/li>
&lt;li>It&amp;rsquo;s really easy to produce a workflow diagram that shows exactly how a pipeline gets executed.
This is really great for explaining to a professor or doctor how an analysis works.&lt;/li>
&lt;/ul>
&lt;p>And some disadvantages of Snakemake I&amp;rsquo;ve run into:&lt;/p>
&lt;ul>
&lt;li>It&amp;rsquo;s not daemonized. There&amp;rsquo;s no real easy way to have a long-running Snakemake
workflow running in the background besides just &lt;code>nohup&lt;/code>-ing it,
which can be a little inconvienient.&lt;/li>
&lt;li>No dynamic job execution (if some file fails a quality check, do this after, etc.).&lt;/li>
&lt;li>Personal experience has shown that it can&amp;rsquo;t be installed on Windows without a C++ compiler,
which makes it a little harder to install on Windows users&amp;rsquo; computers.
Still, this is better than no Windows support at all though (like most other tools.)&lt;/li>
&lt;/ul>
&lt;p>All in all, after using Snakemake for several years,
I think it&amp;rsquo;s a great tool for bioinformatics and data science use cases
where analysis is done in a standard start-to-finish manner.
Anything involving continuous job execution is probably not a good fit,
such as rerunning an analysis with new data every hour or something like that.
I have no serious regrets after using Snakemake and it&amp;rsquo;s a pretty great tool if
you want to deliver outputs reproducibly
and have other people understand the workflow (even non-technical types).&lt;/p>
&lt;h2 id="deploying-your-pipeline-with-conda">Deploying your pipeline with Conda&lt;/h2>
&lt;p>This is where things always get icky.
You&amp;rsquo;ve got a great software environment and it runs the pipeline happily,
but you want to get your client up and running too.
After all, it isn&amp;rsquo;t &amp;ldquo;reproducible science&amp;rdquo; if they can&amp;rsquo;t re-run things and verify
your results.
Usually the hardest part of this is just installing all of the software on your clients&amp;rsquo; computers.&lt;/p>
&lt;p>Are you really responsible for installing software on clients&amp;rsquo; computers?
Honestly, yes.
Even if you provide them with access to a system with all of the software installed,
at some point they will pick up a collaborator who needs to install the software,
or maybe migrate systems.
You&amp;rsquo;re going to get an email asking how to install the software at the end of the day.
So what&amp;rsquo;s the best way of ensuring that this happens?&lt;/p>
&lt;p>There are three ways of getting a set of software packages installed and running on a new system.
I&amp;rsquo;ll go through these each in order:&lt;/p>
&lt;ul>
&lt;li>Install every bit of your pipeline and all dependencies manually (oh god no.).&lt;/li>
&lt;li>Use a containerization tool like Docker.&lt;/li>
&lt;li>Use a reproducible software environment, like Anaconda.&lt;/li>
&lt;/ul>
&lt;h3 id="installing-things-by-hand">Installing things by hand&lt;/h3>
&lt;p>Don&amp;rsquo;t do it.
If it takes you two or three hours,
it will take your non-technically inclined colleagues two or three weeks
(and you&amp;rsquo;ll get a lot of &amp;ldquo;please help me&amp;rdquo; emails).&lt;/p>
&lt;h3 id="using-dockersingularity-containers">Using Docker/Singularity containers&lt;/h3>
&lt;p>Docker containers seem like an ideal way of implementing a new workflow.
You can install all of your dependencies in a Docker container,
and then have your clients run the analysis using that container.
I think Docker containers are awesome,
and use them for integration testing or any automated tests requiring
testing against a web service or other special compononents.&lt;/p>
&lt;p>Though Docker continers aren&amp;rsquo;t that fun to build,
they make it really easy to repoduce a defined environment,
which makes them perfect for workflow deployment.
So what&amp;rsquo;s the catch?&lt;/p>
&lt;p>To make a long story short,
letting untrusted/semi-untrusted users run Docker is a massive security hole.
&lt;a href="https://docs.docker.com/engine/security/security/">Any Docker container can root its host machine&lt;/a>,
and by that same token
&lt;a href="https://docs.docker.com/install/linux/linux-postinstall/">any user able to launch Docker has the equivalent of root access&lt;/a>.
If your pipeline needs additional resources like those on an HPC cluster or
other shared system,
chances are that your workflow will not be allowed to run.
To use Docker containers in production, you need root access to the system you are running on.
This is a major security consideration, and is unlikely to pass muster for most research groups
unless they own the infrastructure they run on.&lt;/p>
&lt;p>Singularity is a nice alternative to Docker and solves most of it&amp;rsquo;s security issues.
In fact, it has a &amp;ldquo;rootless&amp;rdquo; run mode that lets it run entirely as a user.
The only two &amp;ldquo;gotchas&amp;rdquo; here are that Singularity still requires root priviliges to install,
and there are still some &lt;a href="http://singularity.lbl.gov/release-2-5-0">security issues being ironed out&lt;/a>.&lt;/p>
&lt;p>So to sum things up, Docker is great if you (and your clients) own the infrastructure
and have been entrusted with sudo priviliges.
If not, Singularity is the way to go
(though security issues still seem to crop up with it fairly frequently).&lt;/p>
&lt;h3 id="conda-environments">Conda environments&lt;/h3>
&lt;p>There is of course, a third option:
instead of requiring lots of special security priviliges or install things manually,
why not just use Conda, Anaconda&amp;rsquo;s package manager.
For those unfamiliar with it, Anaconda is a all-included Python distribution.
Though it used to ship just Python packages,
Anaconda now ships more or less every piece of scientific software.
Of particular interest to bioinformaticians is it&amp;rsquo;s &lt;a href="http://bioconda.github.io/">Bioconda&lt;/a>
channel, which ships more or less all bioinformatics software packages.&lt;/p>
&lt;p>Conda works more or less like a Python virtualenv,
though instead of using &lt;code>pip install&lt;/code>, you use &lt;code>conda install&lt;/code> to install everything.
To make a very long story short, I haven&amp;rsquo;t really found anything that isn&amp;rsquo;t conda installable yet.
Once all is said and done, you can export your conda environment to YML with
&lt;code>conda env export &amp;gt; environment-name.yml&lt;/code>.
To reproduce the environment, another user would run &lt;code>conda env create -f environment-name.yml&lt;/code>
and then &lt;code>source activate environment-name&lt;/code> to load it.
All in all, this reduces your entire software pipeline to a single YML file.
Just add this to a Git repository, stuff it on GitHub/Bitbucket/Gitlab and you&amp;rsquo;re done.
To reproduce the pipeline execution environment, it&amp;rsquo;s just three lines:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>git clone https://github.com/username/project-name.git
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>conda env create -f project-name.yml
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>source activate project-name
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>So what&amp;rsquo;s the catch?
This seems a little too easy.
I&amp;rsquo;ll say that this method of pipeline deployment worked really well intially.
It did not age gracefully, however.
After about a year of usage, some of my users began to report issues where certain
dependency versions could not be found.
As it turns out, Conda envs pin every version for every package and dependency.
Anaconda apparently stops shipping packages after a certain period of time,
which means that new environment installs will be broken after a certain period of time.
After using conda environments as my go-to solution for a lot of projects,
the average time to first breakage
(where you need to supply a new &lt;code>conda-env.yml&lt;/code> file to users)
is about a year.&lt;/p>
&lt;p>I haven&amp;rsquo;t found a good way around this issue,
aside from providing the general conda installation instructions on how to re-create
the environment (&amp;quot;&lt;code>conda install&lt;/code> this list of packages&amp;hellip;&amp;quot;).
This was really disappointing,
because conda environments seemed like a rather promising method of long-term software installs.
Just add the environment.yml file to git and call it a day, right?
Unfortunately this only works for the first year or so, after which all bets are off.&lt;/p>
&lt;p>All in all, my work so far leads me to believe that Conda environments are the go-to
solution for short-term work.
Despite the issues with Conda environment longevity,
it&amp;rsquo;s so easy to use and install software that I think using them for your workflows is worth it.
For long-term projects (years or more) you should invest in some form of containerization solution,
along with all the security implications that go with it.
The next time I do a serious data science/bioinformatics project,
I&amp;rsquo;m probably going to do a long term sit down with Conda and
see if I can find a solution to the environment age problem,
because I&amp;rsquo;d really like to use that for all my work, all the time.&lt;/p>
&lt;h2 id="documenting-your-workflow">Documenting your workflow&lt;/h2>
&lt;p>This has been a long blogpost, so I&amp;rsquo;ll keep this short.
In order for users to be able to re-run you workflows,
they need the instructions in order to be able to do so.
In terms of raw documentation, this pipeline (Snakemake + Conda)
generally boils down to only a few lines from installation to execution:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># install Miniconda from https://conda.io/miniconda.html&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>git clone https://github.com/your-pipeline.git
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>conda env create -f your-pipeline.yml
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>source activate your-pipeline.yml
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># example execution for 24 cpus, actual snakemake execution command will likely differ&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>snakemake -j &lt;span style="color:#ae81ff">24&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This is really easy to shove in a &lt;code>README.md&lt;/code> on Github/Bitbucket/wherever.
That said, I&amp;rsquo;ve found that most users will want an in-person training session
where you walk them through the pipeline step-by-step
(&amp;ldquo;drop your files here&amp;rdquo;, let&amp;rsquo;s run through the following commands, etc.).
There&amp;rsquo;s not really any way around this -
you wouldn&amp;rsquo;t be performing the data analysis for them if they could do it themselves.
&lt;code>snakemake --dag | dot -Tsvg &amp;gt; dag.svg&lt;/code> is an incredibly useful command to produce
a workflow diagram to show your end user/data consumer how results are generated.
If you are the only user and all that matters is your end results,
generally just the above installation instructions and a list of dependencies
is sufficient documentation for the future.&lt;/p>
&lt;p>I don&amp;rsquo;t have any magic tricks here,
but the above workflow generally simplifies and automates workflow deployment and execution
enough to make it doable for the average end-user to run.
All in all, the weakest point of this workflow is that Anaconda environments don&amp;rsquo;t age well -
if that ever gets fixed, I&amp;rsquo;d have few regrets.
Hopefully this was an informative read for those of you considering similar workflows.&lt;/p></content></item><item><title>Setting up LDAP auth for MariaDB</title><link>https://jstaf.github.io/posts/mariadb-ldap/</link><pubDate>Thu, 17 May 2018 00:00:00 +0000</pubDate><guid>https://jstaf.github.io/posts/mariadb-ldap/</guid><description>Having separate credentials to log onto a server and access a database on that network is a pain. Why not provide users just one set of credentials for both services? This is a quick how-to guide on how to setup LDAP authentication for MariaDB. As it turns out it&amp;rsquo;s insanely easy to setup. (The official MariaDB documentation on the subject can be quite hard to find however - which may or may not be the primary reason for this blog post&amp;hellip;).</description><content>&lt;p>Having separate credentials to log onto a server and access a database on that network is a pain.
Why not provide users just one set of credentials for both services?
This is a quick how-to guide on how to setup LDAP authentication for MariaDB.
As it turns out it&amp;rsquo;s insanely easy to setup.
(The official MariaDB documentation on the subject can be quite hard to find however -
which may or may not be the primary reason for this blog post&amp;hellip;).&lt;/p>
&lt;p>This tutorial assumes that the database server has already been configured to authenticate users via LDAP
(blog post on this later!).
If you haven&amp;rsquo;t already, install MariaDB and set it up:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo yum install mariadb mariadb-server mariadb-devel
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo systemctl start mariadb
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo systemctl enable mariadb
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo mysql_secure_installation &lt;span style="color:#75715e"># yes to all prompts&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The next step is to login as the root user and enable the &lt;code>auth_pam&lt;/code> plugin.
&lt;code>auth_pam&lt;/code> delegates MariaDB user authentication to the base operating system through PAM.
PAM, or Pluggable Authentication Modules,
allow configuring authentication for different software packages via text file.
More on this in a later in this blog post.&lt;/p>
&lt;p>MariaDB ships with this plugin present, but not enabled.
You can install it with &lt;code>INSTALL SONAME 'auth_pam';&lt;/code>.
To use PAM authentication for a user,
create that user with &lt;code>IDENTIFIED VIA pam&lt;/code> in place of where you&amp;rsquo;d usually specify the user password.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>mysql -u root -p
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>INSTALL SONAME &lt;span style="color:#e6db74">&amp;#39;auth_pam&amp;#39;&lt;/span>;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>CREATE USER &lt;span style="color:#e6db74">&amp;#39;jstaf&amp;#39;&lt;/span>@&lt;span style="color:#e6db74">&amp;#39;%&amp;#39;&lt;/span> IDENTIFIED VIA pam;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If I wanted to create a test database for that user account
(I&amp;rsquo;ve named the database after the demo user in this case&amp;hellip;):&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>CREATE DATABASE jstaf;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>GRANT ALL ON jstaf.* TO &lt;span style="color:#e6db74">&amp;#39;jstaf&amp;#39;&lt;/span>@&lt;span style="color:#e6db74">&amp;#39;%&amp;#39;&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Ok so now that we&amp;rsquo;ve setup our demo user and our test database,
we&amp;rsquo;ll need to actually setup the PAM config for MariaDB.
MariaDB does its best to remain compatible with the original MySQL codebase it was forked from,
and this case it is no different -
the PAM config for MariaDB is &lt;code>/etc/pam.d/mysql&lt;/code> by default.&lt;/p>
&lt;p>Create &lt;code>/etc/pam.d/mysql&lt;/code> with the following contents:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">#%PAM-1.0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>auth required pam_ldap.so
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>account required pam_ldap.so
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>As PAM configs go, this is the absolute minimum.
After the first line of the file
(which merely identifies it as a PAM config to the OS),
the following two lines state that:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Authentication requires successful authentication using &lt;code>pam_ldap.so&lt;/code>,
the PAM module responsible for handling LDAP authorization.
A user will need to successfully authenticate via password to pass the first line.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The account is indeed valid and meets any non-password authorizations
(also handled through &lt;code>pam_ldap.so&lt;/code>).&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Clever users will notice that this same lines could be swapped out for other authentication modules.
As an example, &lt;code>pam_unix.so&lt;/code> covers standard authentication using local user accounts -
there&amp;rsquo;s a PAM module for pretty much every authentication mechanism out there.&lt;/p>
&lt;p>Now all that&amp;rsquo;s left to do is login with your brand new LDAP-enabled user account:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>mysql -u jstaf -p
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Enter password:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Welcome to the MariaDB monitor. Commands end with ; or &lt;span style="color:#ae81ff">\g&lt;/span>.
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Your MariaDB connection id is &lt;span style="color:#ae81ff">14&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Server version: 5.5.56-MariaDB MariaDB Server
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Copyright &lt;span style="color:#f92672">(&lt;/span>c&lt;span style="color:#f92672">)&lt;/span> 2000, 2017, Oracle, MariaDB Corporation Ab and others.
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Type &lt;span style="color:#e6db74">&amp;#39;help;&amp;#39;&lt;/span> or &lt;span style="color:#e6db74">&amp;#39;\h&amp;#39;&lt;/span> &lt;span style="color:#66d9ef">for&lt;/span> help. Type &lt;span style="color:#e6db74">&amp;#39;\c&amp;#39;&lt;/span> to clear the current input statement.
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>MariaDB &lt;span style="color:#f92672">[(&lt;/span>none&lt;span style="color:#f92672">)]&lt;/span>&amp;gt;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Success!&lt;/p></content></item><item><title>Remote backups with Borg and rsync</title><link>https://jstaf.github.io/posts/backups-with-borg-rsync/</link><pubDate>Mon, 12 Mar 2018 00:00:00 +0000</pubDate><guid>https://jstaf.github.io/posts/backups-with-borg-rsync/</guid><description>There&amp;rsquo;s a famous saying that &amp;ldquo;data that&amp;rsquo;s not backed up is data you&amp;rsquo;re prepared to lose.&amp;rdquo; I used Windows for a very long time, and managed to lose quite a bit of data back in the day because of either Windows Update bricking the system or just wanting to reinstall the OS (Windows has a habit of losing performance over time - easiest fix is a fresh install). I had been performing backups manually to an external hard disk, frequently forgot to backup something critical, and only had backups when I cared to make them (i.</description><content>&lt;p>There&amp;rsquo;s a famous saying that &amp;ldquo;data that&amp;rsquo;s not backed up is data you&amp;rsquo;re prepared to lose.&amp;rdquo;
I used Windows for a very long time, and managed to lose quite a bit of data
back in the day because of either Windows Update bricking the system or just
wanting to reinstall the OS (Windows has a habit of losing performance
over time - easiest fix is a fresh install).
I had been performing backups manually to an external hard disk,
frequently forgot to backup something critical, and only had backups when I cared to make
them (i.e. rarely). Fortunately, there&amp;rsquo;s a better way of doing things:
automated backups to a remote server with setup-and-forget tools like rsync and borg.
I haven&amp;rsquo;t lost data since.&lt;/p>
&lt;h2 id="before-we-start">Before we start&lt;/h2>
&lt;p>To use either of these tools, all you need is a UNIX system (Mac/Linux) and a
server or storage device to back up to. There are no other requirements. If
you don&amp;rsquo;t have a system of your own, I highly recommend
&lt;a href="http://rsync.net">Rsync.net&lt;/a>. Rsync.net is a very cheap/reliable backup
provider that simply gives you an SSH endpoint to dump your files in. Though plans
vary, the price is about $1 per GB stored per year, which is quite affordable, and the
service comes with free snapshots and support. If you choose this option,
be aware that there&amp;rsquo;s also a &lt;a href="http://www.rsync.net/products/attic.html">secret pricing tier&lt;/a>
for borg users which gives heavily discounted plans that do not include
support or snapshots (since borg does that for you).&lt;/p>
&lt;p>If you&amp;rsquo;ll be backing up to a remote system, you&amp;rsquo;ll want to setup passwordless SSH before you start
(all of the defaults are fine here, do not enter a passphrase for your key). This is more secure
than using a password to connect, and it means that backups over SSH can be performed non-interactively.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>ssh-keygen -t rsa
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># hit enter for all the prompts here, you typically do not want to set a passphrase&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ssh-copy-id username@server.web.address
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Before you do anything else, make sure that connecting over SSH
to your backup server no longer prompts you for a password.&lt;/p>
&lt;p>Alternatively, if you are backing up to a network drive, make sure you know how to mount and unmount
the drive via the command line (typically via &lt;code>mount&lt;/code> and &lt;code>umount&lt;/code>). Your backups should never be
mounted to your computer unless you are backing things up!
(This reduces the risk of attackers or accidents breaking your precious backups.)
Once this is done, you should be set.&lt;/p>
&lt;hr>
&lt;h2 id="simple-backups-with-rsync">Simple backups with rsync&lt;/h2>
&lt;p>&lt;code>rsync&lt;/code> is a very handy file-copying tool that performs easy, straightforwards backups.
Backups are unencrypted and unauthenticated, but it&amp;rsquo;s trivial to setup and restore from.
If all you want is an up-to-date backup when things go bad, &lt;code>rsync&lt;/code> is the tool for you.&lt;/p>
&lt;h3 id="to-create-a-backup">To create a backup&lt;/h3>
&lt;p>This performs a very simple backup to any storage device.
Files are copied as-is, and all attributes (ownership, permissions, modification times, etc.)
are preserved. No authentication or encryption is performed,
meaning that anyone could get at your files if they can access your storage media.
Files that you delete are deleted from your backup,
and only files modified since the last backup are uploaded.&lt;/p>
&lt;p>&lt;strong>To back up to a local disk or network drive:&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>rsync -az --delete /folder/to/back/up /destination/folder
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>To back up to a remote server:&lt;/strong>&lt;/p>
&lt;p>Note that this command connects to the remote server over SSH, meaning that
information is encrypted while being transferred to the remote server. The
actual backups themselves, however, are unencrypted.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>rsync -az --delete -e ssh /folder/to/back/up username@remote.host.address:/destination/folder
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="to-restore-from-a-backup">To restore from a backup&lt;/h3>
&lt;p>To restore files from your remote backup, no special magic is required - just reverse
the source and destination folders in the above command. Alternatively, you can
use tools like &lt;code>scp&lt;/code> or &lt;code>sftp&lt;/code> to restore individual files.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>rsync -az -e ssh username@remote.host.address:/destination/folder /folder/to/restore/in
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="pros-and-cons-of-rsync">Pros and cons of rsync&lt;/h3>
&lt;p>rsync is useful for its simplicity. You get easy-to-perform backups for virtually no learning curve,
and restoring files is a breeze (just copy them back!).&lt;/p>
&lt;p>There are some big drawbacks to this method, however. rsync is very space inefficient -
no compression or deduplication is performed. You only get access to a single backup as well.
If you want multiple backups for multiple dates, you&amp;rsquo;ll need to manage these manually, and
each extra backup will take up an equal amount of space (7 days&amp;rsquo; of backups == 7 times the storage usage).
Anyone with access to your storage media will also have access to your files.&lt;/p>
&lt;p>In light of this info, rsync is a great tool for fast and dirty backups to local storage media,
or when you are confident that your backup location is secure and cannot be accessed by anyone else.
If you want multiple backups and access controls, you&amp;rsquo;ll need a different tool.&lt;/p>
&lt;hr>
&lt;h2 id="secure-backups-with-borg">Secure backups with borg&lt;/h2>
&lt;p>Borg is a fantastic tool that covers the weaknesses of rsync without sacrificing much in terms of usability.
In particular, you&amp;rsquo;ll be able to keep multiple backups, save space through deduplication and compression,
and secure your data with either passwords or a keyfile.&lt;/p>
&lt;h3 id="setup">Setup&lt;/h3>
&lt;p>Borg requires a little bit of additional setup before you can start using it.
Having borg installed on the remote server will speed things up.
This is already done for you if you use Rsync.net, although you should specify
the environment variable &lt;code>BORG_REMOTE_PATH&lt;/code> to use the most recent version of borg available:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># only for Rsync.net users&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>export BORG_REMOTE_PATH&lt;span style="color:#f92672">=&lt;/span>/usr/local/bin/borg1/borg1
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Additionally, you will need to initialize your repository before you can use it.
To create a new repository that&amp;rsquo;s password-protected, use the following:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># see &amp;#34;borg init --help&amp;#34; for more options like storage quotas, encryption options, etc.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>borg init -e repokey-blake2 username@remote.host.address:/destination/folder
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="creating-a-backup">Creating a backup&lt;/h3>
&lt;p>For your first backup, you may wish to do it interactively so you can watch the progress
and verify that things work. The following creates a backup titled &lt;code>backup-name&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>borg create --progress --stats username@remote.host.address:/destination/folder::backup-name /folder/to/back/up
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>For subsequent backups, you will likely want to do things non-interactively. You can use the
following to create an automatically-named backup (computer name + date).
Note that further commands assume you&amp;rsquo;ve set the &lt;code>BORG_REPO&lt;/code> environment variable
(specifying a default repository to back up to).&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># specify a password for non-interactive use&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>export BORG_PASSPHRASE&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#39;your repository password&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># specify the default repository to use for backups&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>export BORG_REPO&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#39;username@remote.host.address:/destination/folder&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>borg create ::&lt;span style="color:#66d9ef">$(&lt;/span>hostname&lt;span style="color:#66d9ef">)&lt;/span>-&lt;span style="color:#66d9ef">$(&lt;/span>date -I&lt;span style="color:#66d9ef">)&lt;/span> /folder/to/back/up
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="cleaning-up-old-backups">Cleaning up old backups&lt;/h3>
&lt;p>Chances are, you will not want to keep every backup ever made. You might want
to keep say only 7 days&amp;rsquo; worth of daily backups, 8 weeks of weekly backups,
and 12 months of monthly backups. To do so:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>borg prune --keep-daily &lt;span style="color:#ae81ff">7&lt;/span> --keep-weekly &lt;span style="color:#ae81ff">8&lt;/span> --keep-monthly &lt;span style="color:#ae81ff">12&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="inspecting-backups">Inspecting backups&lt;/h3>
&lt;p>To view a list of all backups:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>borg list
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># follows is a list of backups, dates, and ids&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To view all files within a backup (EXTREMELY VERBOSE, so output has been piped to &lt;code>head&lt;/code>):&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>borg list ::backup-name | head
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="restoring-from-a-backup">Restoring from a backup&lt;/h3>
&lt;p>Restoring files is quite easy, although they are extracted to the current working directory
(so if you backup up &lt;code>/home/youruser/some-folder&lt;/code>, expect it to recreate that directory structure
unless you &lt;code>cd&lt;/code> to the root directory).
To extract a single file:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>cd /location/to/restore/to
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>borg extract ::archive file/to/restore
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To extract all files:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>cd /location/to/restore/to
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>borg extract ::archive
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="automating-your-backups-and-sample-scripts">Automating your backups (and sample scripts!)&lt;/h2>
&lt;p>To run your backups automatically, you&amp;rsquo;ll want to create a script, and run it automatically through &lt;code>cron&lt;/code>.
Though &lt;code>crontab&lt;/code> is normally a great way to do this (run a task at a specified time and day), it is not
very flexible - if you set it to perform backups at 3am, and you&amp;rsquo;re not logged onto your laptop at 3am,
the backup wont happen! Instead, we&amp;rsquo;ll create a script and put it in &lt;code>/etc/cron.daily&lt;/code>. Scripts here
are automatically run about 15 minutes after logging on to your computer.
Here are some sample scripts that you can use for either rsync or borg.
Installation is the same - just copy these to &lt;code>/etc/cron.daily&lt;/code>.
You&amp;rsquo;ll note I&amp;rsquo;ve fully specified the path to programs here - this is a &amp;ldquo;best practice&amp;rdquo; when working with
scripts to be run under &lt;code>cron&lt;/code>.&lt;/p>
&lt;h3 id="rsync-example-script">rsync example script&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">#!/bin/bash
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#75715e"># Backup a folder to a remote address using rsync.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Usage: backup-rsync.sh&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># To restore: rsync -az -e ssh username@remote.host.address:backups/$(hostname)/folder /restore/point&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>set -eu
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>/usr/bin/ssh username@remote.host.address mkdir -p backups/&lt;span style="color:#66d9ef">$(&lt;/span>hostname&lt;span style="color:#66d9ef">)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>/usr/bin/rsync -az --delete -e ssh /folder/to/back/up username@remote.host.address:backups/&lt;span style="color:#66d9ef">$(&lt;/span>hostname&lt;span style="color:#66d9ef">)&lt;/span>/
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="borg-example-script">Borg example script&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">#!/bin/bash
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#75715e"># Backup a folder to a remote address using borg.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Usage: backup-borg.sh&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># To restore: borg extract $BORG_REPO::computer-and-date&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>set -eu
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>export BORG_REPO&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#39;username@remote.host.address:borg/repo/path&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>export BORG_PASSPHRASE&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#39;your password&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>export BORG_REMOTE_PATH&lt;span style="color:#f92672">=&lt;/span>/path/to/remote/borg
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>/usr/bin/borg create ::&lt;span style="color:#66d9ef">$(&lt;/span>hostname&lt;span style="color:#66d9ef">)&lt;/span>-&lt;span style="color:#66d9ef">$(&lt;/span>date&lt;span style="color:#66d9ef">)&lt;/span> /folder/to/back/up
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>/usr/bin/borg prune ::&lt;span style="color:#66d9ef">$(&lt;/span>hostname&lt;span style="color:#66d9ef">)&lt;/span>-&lt;span style="color:#66d9ef">$(&lt;/span>date&lt;span style="color:#66d9ef">)&lt;/span> --keep-daily&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">14&lt;/span> --keep-monthly&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">6&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="youre-set">You&amp;rsquo;re set!&lt;/h2>
&lt;p>Assuming you&amp;rsquo;ve added one of these scripts to your &lt;code>/etc/cron.daily/&lt;/code> folder,
all you have to do is wait. If you&amp;rsquo;ve added &lt;code>BORG_REPO&lt;/code> to your &lt;code>.bashrc&lt;/code>,
you can check in and verify that your backups are working properly with
&lt;code>borg list&lt;/code> (you should see a list of your current backups).&lt;/p></content></item><item><title>Installing FL Studio on Linux</title><link>https://jstaf.github.io/posts/flstudio-on-linux/</link><pubDate>Thu, 22 Feb 2018 00:00:00 +0000</pubDate><guid>https://jstaf.github.io/posts/flstudio-on-linux/</guid><description>Linux does a lot of things well. Music production is not usually one of them, mainly due to a lack of good programs available on Linux. Fortunately FL Studio, one of the most popular DAWs out there works flawlessly through Wine.
Wine is a Windows compatibility layer for Linux. You can often run Windows programs with it, though personally, my success has been mixed (especially for performance critical applications like videogames).</description><content>&lt;p>Linux does a lot of things well.
Music production is not usually one of them,
mainly due to a lack of good programs available on Linux.
Fortunately FL Studio, one of the most popular DAWs out there works flawlessly through Wine.&lt;/p>
&lt;p>Wine is a Windows compatibility layer for Linux.
You can often run Windows programs with it,
though personally, my success has been mixed
(especially for performance critical applications like videogames).
In this case though, we can use it to run FL Studio, and it works perfectly.&lt;/p>
&lt;p>Here&amp;rsquo;s a quick preview of our end product
(sorry for terrible video quality, but it shows we have a nice, working install):&lt;/p>
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
&lt;iframe src="https://www.youtube.com/embed/pPJ4lRKLOHk" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" allowfullscreen title="YouTube Video">&lt;/iframe>
&lt;/div>
&lt;hr>
&lt;h2 id="setting-up-wine">Setting up Wine&lt;/h2>
&lt;p>The following instructions are written using Fedora,
but should work on any variety of Linux
(adapt the next command to your package manager of choice).&lt;/p>
&lt;p>To start, we&amp;rsquo;ll need &lt;code>wine&lt;/code> and &lt;code>winetricks&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo dnf install wine winetricks
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Once this completes, you&amp;rsquo;ll need to install some fonts needed by FL Studio.
Run &lt;code>winetricks&lt;/code> in the console, and select &amp;ldquo;select the default wineprefix&amp;rdquo;.
Install the &amp;ldquo;core&amp;rdquo; Microsoft fonts.&lt;/p>
&lt;h2 id="install-fl-studio">Install FL Studio&lt;/h2>
&lt;p>Download the FL Studio installer from the official website.
Once it&amp;rsquo;s downloaded, run the following command and install with all of the default settings:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>wine flstudio_12.5.1.165.exe
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Protip: in terms of ASIO drivers, use FL Studio ASIO instead of ASIO4ALL (it&amp;rsquo;s just better).&lt;/p>
&lt;p>While things are installing, download your registration key from the FL Studio website
(&lt;code>FLRegkey.Reg&lt;/code>). You can register it with the following winetricks command:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>regedit FLRegkey.Reg
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Congrats, you now have a fully working version of FL Studio on Linux.
And before you ask, yes - all of your VST plugins will work out of the box.&lt;/p></content></item><item><title>About me</title><link>https://jstaf.github.io/about/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://jstaf.github.io/about/</guid><description>Hi there! I am a sysadmin currently living in Toronto. I started my career as a neuroscientist, dabbled for a bit as a bioinformatician, but now I spend most of my time fiddling with databases and Kubernetes on various clouds. I collect job titles and hobbies, and my favorite pastime is coding while simultaneously binging B-grade scifi/fanstasy tv shows (and making no progress on my personal projects). I really like bunnies.</description><content>&lt;p>&lt;img alt="Me" src="https://jstaf.github.io/images/me.jpg#floatright"> Hi there! I am a sysadmin currently living in
Toronto. I started my career as a neuroscientist, dabbled for a bit as a
bioinformatician, but now I spend most of my time fiddling with databases and
Kubernetes on various clouds. I collect job titles and hobbies, and my favorite
pastime is coding while simultaneously binging B-grade scifi/fanstasy tv shows
(and making no progress on my personal projects). I &lt;em>really&lt;/em> like bunnies.&lt;/p>
&lt;p>The best way to reach me is via email at
&lt;code>jeff (dot) stafford (at) protonmail (dot) com&lt;/code> (I am bad at replying sometimes
but try to always eventually get back to everyone!).&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Github&lt;/strong>: &lt;a href="https://github.com/jstaf">https://github.com/jstaf&lt;/a>&lt;/li>
&lt;li>&lt;strong>Email&lt;/strong>: &lt;code>jeff (dot) stafford (at) protonmail (dot) com&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Phone&lt;/strong>: (Please email me for my number if you&amp;rsquo;d like to reach me by phone -
I unfortunately had to remove my number from this website after too many
unsolicited phone calls.)&lt;/li>
&lt;/ul></content></item><item><title>Showcase</title><link>https://jstaf.github.io/showcase/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://jstaf.github.io/showcase/</guid><description>These are a bunch of public personal projects in various mediums that I&amp;rsquo;ve dabbled with over the years.
onedriver I was really irritated that Microsoft OneDrive didn&amp;rsquo;t support Linux, and all of the existing OneDrive clients were kind of bad at the time (you want to download my entire OneDrive account to my local computer? Yuck.). This was my first golang project and has kind of taken on a life of its own with several tens of thousands of users.</description><content>&lt;p>These are a bunch of public personal projects in various mediums that I&amp;rsquo;ve
dabbled with over the years.&lt;/p>
&lt;h1 id="onedriver">onedriver&lt;/h1>
&lt;p>I was really irritated that Microsoft OneDrive didn&amp;rsquo;t support Linux,
and all of the existing OneDrive clients were kind of bad at the time
(you want to download my entire OneDrive account to my local computer? Yuck.).
This was my first golang project and has kind of taken on a life of its own
with several tens of thousands of users.
I attribute a lot of this success to it being easy to use and install compared to
the existing alternatives.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Github repository:&lt;/strong> &lt;a href="https://github.com/jstaf/onedriver">https://github.com/jstaf/onedriver&lt;/a>&lt;/li>
&lt;li>&lt;strong>Fedora COPR .rpm repos:&lt;/strong> &lt;a href="https://copr.fedorainfracloud.org/coprs/jstaf/onedriver/">https://copr.fedorainfracloud.org/coprs/jstaf/onedriver/&lt;/a>&lt;/li>
&lt;li>&lt;strong>OpenSUSE Build Service .deb repos:&lt;/strong> &lt;a href="https://software.opensuse.org/download.html?project=home%3Ajstaf&amp;package=onedriver">https://software.opensuse.org/download.html?project=home%3Ajstaf&amp;package=onedriver&lt;/a>&lt;/li>
&lt;li>&lt;strong>AUR:&lt;/strong> &lt;a href="https://aur.archlinux.org/packages/onedriver">https://aur.archlinux.org/packages/onedriver&lt;/a>&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h1 id="mayorate">mayorate&lt;/h1>
&lt;p>&lt;img alt="Playing with nukes" src="https://jstaf.github.io/images/nukes.gif#floatright">&lt;/p>
&lt;p>This was probably my first real coding project.
I was &lt;em>really&lt;/em> into this videogame called Starfarer many years ago
(now &lt;a href="https://fractalsoftworks.com/">Starsector&lt;/a>),
and decided to make a mod for it.
Though I don&amp;rsquo;t really actively play Starsector anymore,
I occasionally like to update the mod to work with the latest version
of the game and boot things up again for old times&amp;rsquo; sake.&lt;/p>
&lt;p>This project was very important to me because it was what made me realize that
computing was easy and more importantly, &lt;em>I enjoyed it&lt;/em>.
It was basically the start of what would later be a career.
(This was in contrast to science, which I actually hated working on and really just
liked playing with the computers.)&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Github repository:&lt;/strong> &lt;a href="https://github.com/jstaf/mayorate">https://github.com/jstaf/mayorate&lt;/a>&lt;/li>
&lt;li>&lt;strong>Forum thread:&lt;/strong> &lt;a href="https://fractalsoftworks.com/forum/index.php?topic=7372.0">https://fractalsoftworks.com/forum/index.php?topic=7372.0&lt;/a>&lt;/li>
&lt;li>&lt;strong>Starsector website&lt;/strong> (you need to buy a copy if you want to try this out): &lt;a href="https://fractalsoftworks.com/">https://fractalsoftworks.com/&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>&lt;img alt="MDSV Narayana" src="https://jstaf.github.io/images/nara.png#centre">
An early rigging of the largest capital ship in the mod,
along with several of its fighters.&lt;/p>
&lt;p>&lt;img alt="The surface of Inir" src="https://jstaf.github.io/images/inir_surface.png#centre">
The hellish surface of the Mayorate mining planet of Inir.
Probably my favorite digital painting I did for this project.&lt;/p>
&lt;hr>
&lt;h1 id="ezldap">ezldap&lt;/h1>
&lt;p>I used to work at a supercomputing facility that used OpenLDAP as an identity provider
for its compute clusters. If you&amp;rsquo;ve ever used OpenLDAP, you&amp;rsquo;ll know it&amp;rsquo;s a royal pain
to work with and you basically have to come up with your own tooling and
directory structure, which is a TON of work compared to Active Directory or FreeIPA.
ezldap was a set of Python scripts and clever templating used to make managing users,
groups, etc. much easier than it otherwise was using stuff like
&lt;code>ldapmodify&lt;/code> and &lt;code>ldapsearch&lt;/code>.&lt;/p>
&lt;p>I stopped working on this project after rolling out FreeIPA and/or IAM-based access
at subsequent places of employment. FreeIPA does everything ezldap did for
OpenLDAP, but better.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Official documentation:&lt;/strong> &lt;a href="https://ezldap.readthedocs.io/en/latest/">https://ezldap.readthedocs.io/en/latest/&lt;/a>&lt;/li>
&lt;li>&lt;strong>Github repository:&lt;/strong> &lt;a href="https://github.com/jstaf/ezldap">https://github.com/jstaf/ezldap&lt;/a>&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h1 id="teaching-materials">Teaching materials&lt;/h1>
&lt;p>While working at the supercomputing facility, like half of my job was not just
maintaining our compute clusters and running analyses, but teaching scientists
and doctors how to use our systems.
I wrote a lot of teaching materials and did a ton of workshops on various computing
topics including teaching as part of some for-credit graduate courses at
Queen&amp;rsquo;s University.
Though I don&amp;rsquo;t really teach these types of workshops anymore, a lot of my teaching
materials have been adopted by the community and now see widespread use across the
world (the &amp;ldquo;Intro to HPC&amp;rdquo; and Snakemake courses I wrote for Software Carpentry are
&lt;em>extremely&lt;/em> popular).&lt;/p>
&lt;blockquote>
&lt;p>&amp;ldquo;Write a genomics pipeline for a scientist and you can frustrate them for a day.&amp;rdquo;&lt;/p>
&lt;p>&amp;ldquo;Teach a scientist how to program and you can frustrate them for a lifetime.&amp;rdquo;&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>&lt;strong>Intro to High-Performance Computing&lt;/strong>*: &lt;a href="https://carpentries-incubator.github.io/hpc-intro/">https://carpentries-incubator.github.io/hpc-intro/&lt;/a>&lt;/li>
&lt;li>&lt;strong>Intro to High-Performance Computing in Python (Snakemake):&lt;/strong> &lt;a href="http://www.hpc-carpentry.org/hpc-python/">http://www.hpc-carpentry.org/hpc-python/&lt;/a>&lt;/li>
&lt;li>&lt;strong>Data Science with R:&lt;/strong> &lt;a href="https://jstaf.github.io/r-data-science/">https://jstaf.github.io/r-data-science/&lt;/a>&lt;/li>
&lt;li>&lt;strong>HPC R:&lt;/strong> &lt;a href="https://jstaf.github.io/hpc-r/">https://jstaf.github.io/hpc-r/&lt;/a>&lt;/li>
&lt;li>&lt;strong>R Package Development:&lt;/strong> &lt;a href="https://github.com/jstaf/r-package-devel">https://github.com/jstaf/r-package-devel&lt;/a>&lt;/li>
&lt;li>&lt;strong>BioMaRt:&lt;/strong> &lt;a href="https://github.com/jstaf/biomaRt_tutorial">https://github.com/jstaf/biomaRt_tutorial&lt;/a>&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h1 id="music">Music&lt;/h1>
&lt;p>Before I discovered programming, my hobby was writing music. I was never particularly
good at it. I would spend all my time playing with synthesizers (all time favorite
is Audjoo Helix) and mixing the tracks so they sounded perfect, but writing a melody
and the actual music composition part was always a massive struggle and it was agony
writing songs from start to finish and keeping the style consistent the whole
way through.&lt;/p>
&lt;p>I was a giant purist who thought that using any kind of drum loops, sampling
from existing songs, or even using pre-existing synthesizer presets was cheating
so most of my time was spent tweaking knobs to generate my instruments before I
ever even got started. Obviously, I got very little done. Most of the time I
would write something, come back the next morning, decided I hated it, and
discard what I had before starting over. Even though I wrote a lot of different
stuff, generally the only songs that I could reliably complete were soundtracks:
I had a direct use for them (I was playing around with creating videogames at
the time) and they also required genuine recorded instruments like strings so I
would spend less time tweaking knobs and more time on music. (There&amp;rsquo;s probably a
lesson to be learned here somewhere&amp;hellip;)&lt;/p>
&lt;p>I&amp;rsquo;ve mostly stopped writing music now (gave myself tinnitus and decided I had to stop
if I didn&amp;rsquo;t want it to get worse), but these are a selection of my fully completed
tracks I like the most.&lt;/p>
&lt;hr>
&lt;p>&lt;iframe width="100%" height="166" scrolling="no" frameborder="no" allow="autoplay"
src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/265956876&amp;color=%231c1511&amp;auto_play=false&amp;hide_related=true&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false&amp;show_teaser=true">&lt;/iframe>
This track never really got a title beyond the original filename (&lt;code>nebula.flp&lt;/code>)
when I created it. It was supposed to be ambient background music
for while you&amp;rsquo;re flying around space in one of my videogame projects, but it
ended up as probably my best track and somewhat dominates the mood whenever this
song would play in-game. (This sounds like a failure for what&amp;rsquo;s supposed to be
&amp;ldquo;background music&amp;rdquo;, but it actually works really well.)
I was actively trying to avoid using drums here as I didn&amp;rsquo;t think I was capable
of writing a song without a heavy drums all over the place at the time.
Fortunately, I was wrong. This is probably the only track I am 100% pleased with
from start to finish.&lt;/p>
&lt;hr>
&lt;p>&lt;iframe width="100%" height="166" scrolling="no" frameborder="no" allow="autoplay"
src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/164824479&amp;color=%231c1511&amp;auto_play=false&amp;hide_related=true&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false&amp;show_teaser=true">&lt;/iframe>
One of very few tracks I finished that qualifies as anything close to normal music.
I wasted a ton of time with synthesizers here, every non-piano/drum instrument
in this one was synthesized by hand by me again. The middle bit beginning at 2:00
is probably the closest I&amp;rsquo;ve gotten to releasing a professional-grade trance track.&lt;/p>
&lt;hr>
&lt;p>&lt;iframe width="100%" height="166" scrolling="no" frameborder="no" allow="autoplay"
src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/136247231&amp;color=%231c1511&amp;auto_play=false&amp;hide_related=true&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false&amp;show_teaser=true">&lt;/iframe>
This is a weird one. It doesn&amp;rsquo;t lend itself well to use in soundtracks (the
whole song is a giant crescendo) or as normal music and there is a lot of odd
stuff going on. The weird pulsing sound at 1:37 is actually a kick drum
resampled in a horrible way and an ungodly amount of FX plugins mashing up what
remains. I synthesized virtually all sounds in this track from scratch aside
from the opening pad and guitars (even those got pretty warped though).&lt;/p>
&lt;hr>
&lt;p>&lt;iframe width="100%" height="166" scrolling="no" frameborder="no" allow="autoplay"
src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/132964843&amp;color=%231c1511&amp;auto_play=false&amp;hide_related=true&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false&amp;show_teaser=true">&lt;/iframe>
Steel Rain was probably my most popular song because it works perfectly for what
it was meant to be (videogame battle soundtrack) and also evokes the same feeling
as the classic game Homeworld (notable for having lots of very atmospheric and
space-y sound to it). This was the easiest of the tracks to write. I found
some exceptional taiko samples on a random corner of the internet, added in an African
drum set, and then just went to town with them and a couple of very menacing soundfonts
and pads. Basically wrote itself once I had the right sounds to start with.&lt;/p>
&lt;hr>
&lt;h1 id="actmon">actmon&lt;/h1>
&lt;p>This was a R package I wrote during grad school to compute statistics on
&lt;em>Drosophila melanogaster&lt;/em> behavior as measured by Trikinetics&amp;rsquo; Drosophila
*Activity Monitor. The one and only R package I&amp;rsquo;ve released to date.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Github repository:&lt;/strong> &lt;a href="https://github.com/jstaf/actmon">https://github.com/jstaf/actmon&lt;/a>&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h1 id="gcamp4d">gcamp4d&lt;/h1>
&lt;p>This was a MATLAB GUI application I wrote to better measure changes in neuronal
activity over time when using GCaMP (a phosphorescent protein used to measure
neuron activity) and a very specific imaging setup on a confocal laser microscope.
I can&amp;rsquo;t imagine anyone outside of my old lab using this, but it made some pretty
sweet 3D images of neurons over time.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Github repository:&lt;/strong> &lt;a href="https://github.com/jstaf/GCaMP_4D">https://github.com/jstaf/GCaMP_4D&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>&lt;img alt="Neurons" src="https://jstaf.github.io/images/gcamp4d.png#centre">
Two neurons imaged in-vivo activating in response to a stimuli.&lt;/p></content></item></channel></rss>