A Snakemake workflow for the analysis of bacterial riboseq data.
The usage of this workflow is described in the Snakemake Workflow Catalog.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).
This workflow is a best-practice workflow for the analysis of ribosome footprint sequencing (Ribo-Seq) data. The workflow is built using snakemake and consists of the following steps:
- Obtain genome database in
fasta
andgff
format (python
, NCBI Datasets)- Using automatic download from NCBI with a
RefSeq
ID - Using user-supplied files
- Using automatic download from NCBI with a
- Check quality of input sequencing data (
FastQC
) - Cut adapters and filter by length and/or sequencing quality score (
cutadapt
) - Deduplicate reads by unique molecular identifier (UMI,
umi_tools
) - Map reads to the reference genome (
STAR aligner
) - Sort and index for aligned seq data (
samtools
) - Filter reads by feature type (
bedtools
) - Generate summary report for all processing steps (
MultiQC
) - Shift ribo-seq reads according to the ribosome's P-site alignment (
R
,ORFik
) - Calculate basic gene-wise statistics such as RPKM (
R
,ORFik
) - Return report as HTML and PDF files (
R markdown
,weasyprint
)
If you want to contribute, report issues, or suggest features, please get in touch on github.
Step 1: Clone this repository
git clone https://github.com/MPUSP/snakemake-bacterial-riboseq.git
cd snakemake-bacterial-riboseq
Step 2: Install dependencies
It is recommended to install snakemake and run the workflow with conda
, mamba
or micromamba
.
# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# install Mamba
conda install -n base -c conda-forge -y mamba
Step 3: Create snakemake environment
This step creates a new conda environment called snakemake-bacterial-riboseq
.
# create new environment with dependencies & activate it
mamba create -c conda-forge -c bioconda -n snakemake-bacterial-riboseq snakemake pandas
conda activate snakemake-bacterial-riboseq
Important note:
All other dependencies for the workflow are automatically pulled as conda
environments by snakemake, when running the workflow with the --use-conda
parameter (recommended).
An NCBI Refseq ID, e.g. GCF_000006945.2
. Find your genome assembly and corresponding ID on NCBI genomes. Alternatively use a custom pair of *.fasta
file and *.gff
file that describe the genome of choice.
Important requirements when using custom *.fasta
and *.gff
files:
*.gff
genome annotation must have the same chromosome/region name as the*.fasta
file (example:NC_003197.2
)*.gff
genome annotation must havegene
andCDS
type annotation that is automatically parsed to extract transcripts- all chromosomes/regions in the
*.gff
genome annotation must be present in the*.fasta
sequence - but not all sequences in the
*.fasta
file need to have annotated genes in the*.gff
file
Ribosome footprint sequencing data in *.fastq.gz
format. The currently supported input data are single-end, strand-specific reads. Input data files are supplied via a mandatory table, whose location is indicated in the config.yml
file (default: samples.tsv
). The sample sheet has the following layout:
sample | condition | replicate | lib_prep | data_folder | fq1 |
---|---|---|---|---|---|
RPF-RTP1 | RPF-RTP | 1 | McGlincy | data | RPF-RTP1_R1_001.fastq.gz |
RPF-RTP2 | RPF-RTP | 2 | McGlincy | data | RPF-RTP2_R1_001.fastq.gz |
Some configuration parameters of the pipeline may be specific for your data and library preparation protocol. The options should be adjusted in the config.yml
file. For example:
- Minimum and maximum read length after adapter removal (see option
cutadapt: default
). Here, the test data has a minimum read length of 15 + 7 = 22 (2 nt on 5'end + 5 nt on 3'end), and a maximum of 45 + 7 = 52. - Unique molecular identifiers (UMIs). For example, the protocol by McGlincy & Ingolia, 2017 creates a UMI that is located on both the 5'-end (2 nt) and the 3'-end (5 nt). These UMIs are extracted with
umi_tools
(see optionsumi_extraction: method
andpattern
).
To run the workflow from command line, change the working directory.
cd path/to/snakemake-bacterial-riboseq
Adjust the global and module-specific options in the default config file config/config.yml
.
Before running the entire workflow, you can perform a dry run using:
snakemake --dry-run
To run the complete workflow with test files using conda
, execute the following command. The definition of the number of compute cores is mandatory.
snakemake --cores 10 --use-conda --directory .test
- Dr. Rina Ahmed-Begrich
- Affiliation: Max-Planck-Unit for the Science of Pathogens (MPUSP), Berlin, Germany
- ORCID profile: https://orcid.org/0000-0002-0656-1795
- Dr. Michael Jahn
- Affiliation: Max-Planck-Unit for the Science of Pathogens (MPUSP), Berlin, Germany
- ORCID profile: https://orcid.org/0000-0002-3913-153X
- github page: https://github.com/m-jahn
Visit the MPUSP github page at https://github.com/MPUSP for more info on this workflow and other projects.
- Essential tools are linked in the top section of this document
- The sequencing library preparation is based on the publication:
McGlincy, N. J., & Ingolia, N. T. Transcriptome-wide measurement of translation by ribosome profiling. Methods, 126, 112–129, 2017. https://doi.org/10.1016/J.YMETH.2017.05.028.