A bioinformatics workflow for identifying and characterizing Archaea from extreme environmental samples using whole-genome shotgun (WGS) sequencing.
This pipeline integrates current bioinformatics tools (2023–2025) to process raw Illumina sequencing reads into:
- Metagenome-Assembled Genomes (MAGs) with high quality metrics
- Taxonomic classification using GTDB-Tk (Genome Taxonomy Database)
- Functional annotation with metabolic pathway analysis
- Secondary metabolite detection via antiSMASH
- Phylogenomic reconstruction with IQ-TREE2
- NCBI submission-ready files and metadata
Best suited for:
- Extreme environment samples (hydrothermal vents, acid mine drainage, salt lakes, subsurface)
- Environmental metagenomic studies
- Novel archaeal isolate characterization
- Genome-centric metagenomics workflows
| Requirement | Minimum | Recommended (Production) | Why? |
|---|---|---|---|
| RAM | 128 GB | 512 GB – 1 TB | MetaSPAdes and GTDB-Tk are memory-intensive |
| CPU | 16 cores | 64+ cores | Parallelization reduces runtime by days |
| Storage | 2 TB SSD | 10 TB NVMe | Assembly graphs and BAM files are massive |
| OS | Linux (Ubuntu 20.04+) or WSL2 on Windows | Linux (Ubuntu 22.04 LTS) | — |
All tools are installed via Conda/Mamba. See installation instructions below.
If you don't have Conda installed, download Miniconda:
# Download Miniconda for Linux
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/miniconda3
# Activate conda
source ~/miniconda3/bin/activategit clone https://github.com/axp-knickei/X-ARCH.git
cd archaea-genomics-pipeline
chmod +x archaea_pipeline.sh setup_environment.sh./setup_environment.shThis creates a Conda environment with all required tools and downloads necessary databases (GTDB, DRAM, antiSMASH — ~50 GB total).
The pipeline uses a config.yaml file for all parameters. You can edit this file directly or override specific parameters via command-line arguments.
Default Configuration (config.yaml):
# General pipeline settings
pipeline:
threads: 32
max_memory_gb: 500 # in GB
sample_name: "archaea_sample"
work_dir: "./archaea_analysis"
data_dir: "./input_data"
# ... other settings for databases, tools, sample metadata, submitter infoFor more details, see config.yaml.
You can run the pipeline using the original bash script, Snakemake, or via Docker.
This is the traditional way to run the pipeline. Ensure your archaea_env conda environment is activated.
# Activate the environment
mamba activate archaea_env
# Run with your data (parameters override config.yaml)
./archaea_pipeline.sh \
-1 /path/to/sample_R1.fastq.gz \
-2 /path/to/sample_R2.fastq.gz \
-s MyArchaeaSample \
-t 64 \
-m 1000 \
-w /path/to/my_analysis_outputFor more details on bash script usage, refer to ./archaea_pipeline.sh --help.
Snakemake provides robust workflow management, automatic parallelization, and resume capabilities.
- Configure: Ensure
config.yamlis updated with your input read paths and desired parameters.reads: r1: "/path/to/sample_R1.fq.gz" r2: "/path/to/sample_R2.fq.gz" # ... other pipeline settings
- Run:
For advanced Snakemake usage (cluster execution, dry-runs), see
# Activate environment mamba activate archaea_env # Run pipeline locally with 32 cores, using the conda environment snakemake --cores 32 --use-conda
SNAKEMAKE_README.md.
Containerization ensures maximum reproducibility and portability.
- Build the Image:
docker build -t x-arch:v1.0 . - Run the Pipeline:
Note: Paths passed to
docker run --rm -it \ -v /path/to/your/data:/data \ -v /path/to/your/output:/output \ -v /path/to/local/databases:/databases \ x-arch:v1.0 \ -1 /data/sample_R1.fq.gz \ -2 /data/sample_R2.fq.gz \ -s MySample \ -t 16 \ -m 64-1,-2, etc., should be relative to the container's mounted volumes (e.g.,/data/sample_R1.fq.gz). For more details on Docker usage, seeDOCKER_README.md.
The pipeline includes unit tests for helper scripts and a mocked integration test for the main workflow.
# Activate environment
mamba activate archaea_env
# Run unit tests
pytest tests/test_python_scripts.py
# Run mocked integration test
./tests/run_integration_test.shFor more detailed testing instructions and troubleshooting, see TESTING.md.
- Tool: fastp v0.23.4
- Input: Raw FASTQ files
- Output: Cleaned FASTQ + HTML QC report
- Runtime: ~1–5 minutes
- Tool: MetaSPAdes v3.15.5
- Input: Clean paired-end reads
- Output: Assembly contigs (FASTA)
- Runtime: 24–72 hours (depends on coverage & complexity)
- Tools:
- Bowtie2: Map reads back to assembly for coverage
- SemiBin2: Deep learning-based genome binning (primary)
- MetaBAT2: Coverage/composition-based binning (secondary)
- Output: Genome bins (FASTA files)
- Runtime: 6–12 hours
- Tool: CheckM2 v1.0.1 (Machine Learning–based, lineage-agnostic)
- Metrics: Completeness (%), Contamination (%), Strain heterogeneity
- Filter: High-quality bins: >90% complete, <5% contamination
- Runtime: 2–4 hours
- Tool: GTDB-Tk v2.3.2 (Genome Taxonomy Database)
- Output: Archaeal lineage assignments (phylum, class, order, family, genus, species)
- Runtime: 3–6 hours (first run downloads ~27 GB database)
- Tools:
- DRAM: Gene annotation + metabolic pathway reconstruction
- antiSMASH 7.0: Secondary metabolite/Biosynthetic Gene Cluster detection
- Output: Heatmaps, metabolic profiles, BGC visualizations
- Runtime: 8–16 hours
- Tool: IQ-TREE2 v2.3+
- Input: 122 concatenated archaeal marker genes (from GTDB-Tk)
- Output: Maximum likelihood phylogenetic tree (Newick format)
- Runtime: 2–4 hours
- Summary analysis report
- Automated figure generation
- Runtime: <1 hour
- Filters contigs <200 bp (NCBI requirement)
- Runs FCS-GX contamination screening
- Generates submission metadata template
- Output: NCBI-ready genome files + BioSample/BioProject metadata
- Runtime: 1–2 hours
archaea_analysis/
├── logs/
│ └── pipeline_YYYYMMDD_HHMMSS.log
├── results/
│ ├── qc/
│ │ ├── sample_R1_clean.fq.gz
│ │ ├── sample_R2_clean.fq.gz
│ │ └── sample_fastp.html
│ ├── assembly/
│ │ ├── sample_contigs.fasta
│ │ └── quast_results/
│ ├── binning/
│ │ ├── semibin2_results/output_bins/
│ │ ├── metabat2_results/
│ │ └── checkm2_results/quality_report.tsv
│ ├── annotation/
│ │ ├── gtdbtk_results/ar53.summary.tsv
│ │ ├── dram_results/ + dram_distillation/product.html
│ │ └── antismash_results/
│ ├── phylogeny/
│ │ ├── sample_tree.treefile
│ │ └── sample_tree.svg
│ ├── submission/
│ │ ├── *_ncbi_ready.fasta
│ │ └── SUBMISSION_METADATA_TEMPLATE.txt
│ └── sample_ANALYSIS_REPORT.txt
└── temp/
└── [intermediate files]
./archaea_pipeline.sh -1 reads_R1.fq.gz -2 reads_R2.fq.gz./archaea_pipeline.sh \
-1 /data/hydrothermal_R1.fq.gz \
-2 /data/hydrothermal_R2.fq.gz \
-s HydroVent_Deep_Sea_01 \
-t 128 \
-m 1000 \
-w /mnt/hpc_storage/analysisscreen -S archaea_analysis
mamba activate archaea_env
./archaea_pipeline.sh -1 R1.fq.gz -2 R2.fq.gz -s Sample_001 -t 64 -m 512
# Detach: Ctrl+A, then D
# Reattach: screen -r archaea_analysis| Stage | Tool | Version | Reference | Link |
|---|---|---|---|---|
| QC | fastp | v0.23.4 | Chen et al. (2023) | GitHub |
| Assembly | MetaSPAdes | v3.15.5 | Nurk et al. (2017) | SourceForge |
| QC | QUAST | v5.2+ | Gurevich et al. (2013) | SourceForge |
| Mapping | Bowtie2 | v2.5+ | Langmead & Salzberg (2012) | SourceForge |
| Binning | SemiBin2 | v1.4+ | Pan et al. (2023) | GitHub |
| Binning | MetaBAT2 | v2.16+ | Kang et al. (2019) | BitBucket |
| QC | CheckM2 | v1.0+ | Chklovski et al. (2023) | GitHub |
| Taxonomy | GTDB-Tk | v2.3+ | Rinke et al. (2021) | GitHub |
| Annotation | DRAM | v1.3+ | Shaffer et al. (2020) | GitHub |
| BGCs | antiSMASH | v7.0+ | Blin et al. (2023) | Web |
| Phylogeny | IQ-TREE2 | v2.3+ | Minh et al. (2020) | GitHub |
If you use this pipeline in your research, please cite:
@software{archaea_pipeline_2025,
author = {Alex Prima},
title = {Archaea Genomics Pipeline: A workflow for extreme environment metagenomics},
year = {2025},
url = {https://github.com/axp-knickei/X-ARCH},
doi = {10.XXXX/zenodo.XXXXXXX} % Optional: add Zenodo DOI if available
}Also cite the individual tools (see References section below).
mamba activate archaea_env
# Re-run setup if environment was not properly created
./setup_environment.sh- Reduce
-mparameter to available RAM - Reduce
-t(threads) to free up memory - Example:
./archaea_pipeline.sh ... -m 250 -t 32
- First run downloads the GTDB database (~27 GB)
- Check internet connection and disk space
- Manual database download:
gtdbtk download-db --release 220
- Download models manually:
checkm2 database --download --path /path/to/models - Set environment variable:
export CHECKM2_DB=/path/to/models
- Check
logs/pipeline_*.logfor detailed error messages - Ensure input files are not corrupted:
gunzip -t reads_R1.fq.gz - Verify disk space:
df -h
For your manuscript, include:
The quality of raw sequencing reads was assessed and trimmed using fastp (v0.23.4)
with parameters: [specify your parameters]. Assembly was performed with MetaSPAdes
(v3.15.5) with a maximum memory limit of [X] GB. Metagenome-assembled genomes (MAGs)
were recovered using SemiBin2 (v1.4+) and MetaBAT2 (v2.16+), with genome quality
assessed using CheckM2 (v1.0+). Taxonomy was assigned using GTDB-Tk (v2.3+) against
the GTDB database (Release 220). Functional annotation was performed with DRAM (v1.3+),
and biosynthetic gene clusters were identified using antiSMASH (v7.0+). Phylogenomic
reconstruction was conducted using IQ-TREE2 (v2.3+) with 1000 ultrafast bootstraps.
We welcome contributions! Please:
- Fork this repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit your changes (
git commit -am 'Add improvement') - Push to the branch (
git push origin feature/improvement) - Open a Pull Request
This project is licensed under the MIT License – see the LICENSE file for details.
- GitHub Issues: Report bugs or request features
- Discussions: Start discussion here
- Email: alex.prima@tu-dortmund.de
- Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5), 824–834.
- Kang, D. D., Li, F., Kirton, E. S., et al. (2019). MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenomic data. PeerJ, 7, e7359.
- Chklovski, A., Parks, D. H., Woodcroft, B. J., & Tyson, G. W. (2023). CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality. bioRxiv.
- Rinke, C., Chuvochina, M., Mussig, A. J., et al. (2021). Standardized archaeal taxonomy in GTDB-Tk provides insight into archaeal diversity. bioRxiv.
- Shaffer, M., Borton, M. A., McGivern, B. B., et al. (2020). DRAM for distilled and refined annotation of metabolism. Nucleic Acids Research, 48(15), 8883–8894.
- Blin, K., Shaw, S., Kautsar, S. A., et al. (2023). antiSMASH 7.0: New and improved predictions of biosynthetic gene clusters. Nucleic Acids Research, 51(W1), W46–W50.
- Minh, B. Q., Schmidt, H. A., Chernomor, O., et al. (2020). IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Molecular Biology and Evolution, 37(5), 1530–1534.
This pipeline was developed with inspiration from:
- NMDC Metagenome Assembled Genome Workflow
- Anvi'o Metagenomics Workflows
- Genome Taxonomy Database (GTDB)
Last Updated: December 2025
Maintained by: Alex Prima (Universitas Brawijaya)
Status: Active
