snakemake-bacterial-riboseq

A Snakemake workflow for the analysis of bacterial riboseq data.

snakemake-bacterial-riboseq

Usage

The usage of this workflow is described in the Snakemake Workflow Catalog.

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).

Workflow overview

This workflow is a best-practice workflow for the analysis of ribosome footprint sequencing (Ribo-Seq) data. The workflow is built using snakemake and consists of the following steps:

Obtain genome database in fasta and gff format (python, NCBI Datasets)
1. Using automatic download from NCBI with a RefSeq ID
2. Using user-supplied files
Check quality of input sequencing data (FastQC)
Cut adapters and filter by length and/or sequencing quality score (cutadapt)
Deduplicate reads by unique molecular identifier (UMI, umi_tools)
Map reads to the reference genome (STAR aligner)
Sort and index for aligned seq data (samtools)
Filter reads by feature type (bedtools)
Generate summary report for all processing steps (MultiQC)
Shift ribo-seq reads according to the ribosome's P-site alignment (R, ORFik)
Calculate basic gene-wise statistics such as RPKM (R, ORFik)
Return report as HTML and PDF files (R markdown, weasyprint)

If you want to contribute, report issues, or suggest features, please get in touch on github.

Installation

Step 1: Clone this repository

git clone https://github.com/MPUSP/snakemake-bacterial-riboseq.git
cd snakemake-bacterial-riboseq

Step 2: Install dependencies

It is recommended to install snakemake and run the workflow with conda, mamba or micromamba.

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# install Mamba
conda install -n base -c conda-forge -y mamba

Step 3: Create snakemake environment

This step creates a new conda environment called snakemake-bacterial-riboseq.

# create new environment with dependencies & activate it
mamba create -c conda-forge -c bioconda -n snakemake-bacterial-riboseq snakemake pandas
conda activate snakemake-bacterial-riboseq

Additional tools

Important note:

All other dependencies for the workflow are automatically pulled as conda environments by snakemake, when running the workflow with the --use-conda parameter (recommended).

Running the workflow

Input data

Reference genome

An NCBI Refseq ID, e.g. GCF_000006945.2. Find your genome assembly and corresponding ID on NCBI genomes. Alternatively use a custom pair of *.fasta file and *.gff file that describe the genome of choice.

Important requirements when using custom *.fasta and *.gff files:

*.gff genome annotation must have the same chromosome/region name as the *.fasta file (example: NC_003197.2)
*.gff genome annotation must have gene and CDS type annotation that is automatically parsed to extract transcripts
all chromosomes/regions in the *.gff genome annotation must be present in the *.fasta sequence
but not all sequences in the *.fasta file need to have annotated genes in the *.gff file

Read data

Ribosome footprint sequencing data in *.fastq.gz format. The currently supported input data are single-end, strand-specific reads. Input data files are supplied via a mandatory table, whose location is indicated in the config.yml file (default: samples.tsv). The sample sheet has the following layout:

sample	condition	replicate	lib_prep	data_folder	fq1
RPF-RTP1	RPF-RTP	1	McGlincy	data	RPF-RTP1_R1_001.fastq.gz
RPF-RTP2	RPF-RTP	2	McGlincy	data	RPF-RTP2_R1_001.fastq.gz

Some configuration parameters of the pipeline may be specific for your data and library preparation protocol. The options should be adjusted in the config.yml file. For example:

Minimum and maximum read length after adapter removal (see option cutadapt: default). Here, the test data has a minimum read length of 15 + 7 = 22 (2 nt on 5'end + 5 nt on 3'end), and a maximum of 45 + 7 = 52.
Unique molecular identifiers (UMIs). For example, the protocol by McGlincy & Ingolia, 2017 creates a UMI that is located on both the 5'-end (2 nt) and the 3'-end (5 nt). These UMIs are extracted with umi_tools (see options umi_extraction: method and pattern).

Execution

To run the workflow from command line, change the working directory.

cd path/to/snakemake-bacterial-riboseq

Adjust the global and module-specific options in the default config file config/config.yml. Before running the entire workflow, you can perform a dry run using:

snakemake --dry-run

To run the complete workflow with test files using conda, execute the following command. The definition of the number of compute cores is mandatory.

snakemake --cores 10 --use-conda --directory .test

Parameters

Authors

Dr. Rina Ahmed-Begrich
- Affiliation: Max-Planck-Unit for the Science of Pathogens (MPUSP), Berlin, Germany
- ORCID profile: https://orcid.org/0000-0002-0656-1795
Dr. Michael Jahn
- Affiliation: Max-Planck-Unit for the Science of Pathogens (MPUSP), Berlin, Germany
- ORCID profile: https://orcid.org/0000-0002-3913-153X
- github page: https://github.com/m-jahn

Visit the MPUSP github page at https://github.com/MPUSP for more info on this workflow and other projects.

References

Essential tools are linked in the top section of this document
The sequencing library preparation is based on the publication:

McGlincy, N. J., & Ingolia, N. T. Transcriptome-wide measurement of translation by ribosome profiling. Methods, 126, 112–129, 2017. https://doi.org/10.1016/J.YMETH.2017.05.028.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
.test		.test
config		config
resources/images		resources/images
workflow		workflow
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

snakemake-bacterial-riboseq

Usage

Workflow overview

Installation

Additional tools

Running the workflow

Input data

Reference genome

Read data

Execution

Parameters

Authors

References

About

Releases 1

Packages

Contributors 2

Languages

License

MPUSP/snakemake-bacterial-riboseq

Folders and files

Latest commit

History

Repository files navigation

snakemake-bacterial-riboseq

Usage

Workflow overview

Installation

Additional tools

Running the workflow

Input data

Reference genome

Read data

Execution

Parameters

Authors

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages