Introduction to Snakemake

🚧 Work in Progress

This repository is under active development.
Expected completion: 20th of May 2026

Who this workshop is for

Researchers and research software engineers comfortable with Python and bash who want to move beyond shell scripts and for loops. No prior Snakemake experience is assumed.

What you will build

By the end of Episode 5, you will have a complete, cluster-ready RNA-seq quantification pipeline:

Paired-end FASTQs
       │
       ▼
    FastQC          ← quality assessment (per sample, per read)
       │
       ▼
     fastp          ← adapter trimming (intermediate outputs auto-deleted)
       │
       ▼
    HISAT2          ← splice-aware alignment (BAMs write-protected)
       │
       ▼
    Subread         ← quantification across all samples in one call
       │
       ▼
  counts matrix

Episodes¶

Episode	Topic	Key concepts
Episode 1	From shell scripts to Snakemake	Rules, inputs/outputs, `rule all`, dry runs
Episode 2	Wildcards, `expand()`, and the DAG	`{wildcards}`, `expand()`, `--dag`, `--rulegraph`
Episode 3	A real Snakefile — RNA-seq from scratch	`configfile:`, `params:`, `log:`, `temp()`, `protected()`
Episode 4	Scaling to the cluster — Slurm via DRMAA	`threads:`, `resources:`, `--executor drmaa`, `--drmaa-args`, profiles
Episode 5	Robustness and best practices	`benchmark:`, `--rerun-incomplete`, `wildcard_constraints:`, conda envs

Before you start¶

This workshop assumes Snakemake 9, snakemake-executor-plugin-drmaa, and Python DRMAA bindings are already installed. See the installation guide for instructions specific to this cluster.

A note on the examples

All exercises use toy data (text files, word counts) in Episodes 1–2, then switch to a realistic but deliberately simplified RNA-seq skeleton in Episodes 3–5. The pipeline is designed to illustrate Snakemake concepts cleanly. For a production-grade RNA-seq workflow ready to run out of the box, see the Snakemake wrappers and community workflow catalogues.