Skip to content

Commit

Permalink
Merge pull request #26 from ghga-de/25-alternate-annovar-with-vep
Browse files Browse the repository at this point in the history
VEP is added as alternative to ANNOVAR
  • Loading branch information
kubranarci committed Dec 7, 2023
2 parents 470575c + 0104b01 commit 1829f29
Show file tree
Hide file tree
Showing 26 changed files with 686 additions and 93 deletions.
34 changes: 28 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@

It calls SNVs from both germline and somatic samples using bcftools mpileup, compares and filters outs germline specific ones with samtools mpileup compare. This workflow uses various annotations from publicly available databases like 1000G variants, dbSNP and gnomAD. The functional effect of the mutations are annotated using Annovar and the variants are assessed for their consequence and split into somatic and non-somatic calls. Besides, extensive QC plots serve functionality for high functional somatic mutation prioritization.

For now, this workflow is only optimal to work in ODCF Cluster. The config file (conf/dkfz_cluster.config) can be used as an example. Running Annotation, DeepAnnotation, and Filter steps are optional and can be turned off using [runsnvAnnotation, runSNVDeepAnnotation, runSNVVCFFilter] parameters sequentially.
This workflow is optimal to work in ODCF Cluster. The config file (conf/dkfz_cluster.config) can be used as an example. Running Annotation, DeepAnnotation, and Filter steps are optional and can be turned off using [runsnvAnnotation, runSNVDeepAnnotation, runSNVVCFFilter] parameters sequentially.

Users outside of ODCF cluster, should prepare annotation files accordingly (check seq2_testdata_snv/annotations for example) to use whole functionality of this workflow.

The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Expand All @@ -33,9 +35,13 @@ The pipeline has main steps: SNV calling using mpileup, basic annotations, deep

In-house scripts to annotate with several databases like gnomAD and dbSNP.

--annotation_tool: annovar or vep
ANNOVAR ([`Annovar`](https://annovar.openbioinformatics.org/en/latest/))
: annotate_variation.pl is used to annotate variants. The tool makes classifications for intergenic, intogenic, nonsynonymous SNP, frameshift deletion or large-scale duplication regions.

ENSEMBL VEP(['ENSEBL VEP'](https://www.ensembl.org/info/docs/tools/vep/index.html))
:can also be used alternative to annovar. Gene annotations will be extracted.

Reliability and confidation annotations: It is an optional ste for mappability, hiseq, self chain and repeat regions checks for reliability and confidence of those scores.

Sequence ad Sequencing based error plots: Provides insights on predicted somatic SNVs.
Expand Down Expand Up @@ -64,12 +70,28 @@ The pipeline has main steps: SNV calling using mpileup, basic annotations, deep

2. Install any of [`Docker`](https://docs.docker.com/engine/installation/) or [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/) (you can follow [this tutorial](https://singularity-tutorial.github.io/01-installation/))

3. Download [Annovar](https://annovar.openbioinformatics.org/en/latest/user-guide/download/) and set-up suitable annotation table directory to perform annotation. Example:
3. To use Annovar:

Download [Annovar](https://annovar.openbioinformatics.org/en/latest/user-guide/download/) and set-up suitable annotation table directory to perform annotation. Example:

```console
annotate_variation.pl -downdb wgEncodeGencodeBasicV19 humandb/ -build hg19
```

Gene annotation is also possible with ENSEMBL VEP tool, for test purposes only, it can be used online. But for big analysis, it is recommended to either download cache file or use --download_cache flag in parameters.

Follow the documentation [here](https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#cache)

Example:

Download [cache](https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/)

```console
cd $HOME/.vep
curl -O https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/homo_sapiens_vep_110_GRCh38.tar.gz
tar xzf homo_sapiens_vep_110_GRCh38.tar.gz
```

4. Download the pipeline and test it on a minimal dataset with a single command:

```console
Expand Down Expand Up @@ -162,7 +184,7 @@ The nf-snvcalling pipeline comes with documentation about the pipeline [usage](h

## Credits

nf-snvcalling was originally translated from roddy-based pipeline by Kuebra Narci kuebra.narci@dkfz-heidelberg.de.
nf-snvcalling was originally translated from roddy-based pipeline by Kuebra Narci [kuebra.narci@dkfz-heidelberg.de](mailto:kuebra.narci@dkfz.de).

The pipeline is originally written in the workflow management language Roddy. [Inspired github page](https://github.com/DKFZ-ODCF/SNVCallingWorkflow)

Expand All @@ -171,11 +193,11 @@ The SNV workflow was in the pan-cancer analysis of whole genomes (PCAWG) and can
Pan-cancer analysis of whole genomes.
The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium.
Nature volume 578, pages 82–93 (2020).
DOI 10.1038/s41586-020-1969-6:
DOI 10.1038/s41586-020-1969-6

**TODO**
We tank the following people for their extensive assistance in the development of this pipeline:

<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
Nagarajan Paramasivam (@NagaComBio) [n.paramasivam@dkfz.de](mailto:n.paramasivam@dkfz.de)

## Contributions and Support

Expand Down
2 changes: 1 addition & 1 deletion bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ def check_samplesheet(file_in, file_out):
if sample and tumor and control: ## iscontrol true
sample_info = [sample, tumor, tumor_index, control, control_index , "1"]
elif sample and tumor and not control: ## iscontrol false
sample_info = [sample, tumor,tumor_index,"dummy.bam","dummy.bai","0"]
sample_info = [sample, tumor,tumor_index,"","","0"]
else:
print_error("Invalid combination of columns provided!", "Line", line)

Expand Down
42 changes: 26 additions & 16 deletions bin/final_plots_and_json.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
set -euo pipefail

usage() { echo "Usage: $0 [-p prefix] [-i filenameSomaticSnvs] [-s filenameSomaticSnvsIndbSNP] [-t mincov] [-v minconfidencescore] [-b thatreshold] [-r rerun]" 1>&2; exit 1; }
usage() { echo "Usage: $0 [-p prefix] [-i filenameSomaticSnvs] [-s filenameSomaticSnvsIndbSNP] [-t mincov] [-v minconfidencescore] [-b thatreshold] [-r rerun] [-j json_report]" 1>&2; exit 1; }

while [[ $# -gt 0 ]]
do
Expand Down Expand Up @@ -41,6 +41,11 @@ do
shift # past argument
shift # past value
;;
-j)
json_report=$2
shift
shift
;;
esac
done

Expand Down Expand Up @@ -87,18 +92,23 @@ echo $THA_SCORE
[[ $(echo "${THA_SCORE} > ${THA_SCORE_THRESHOLD}" | bc -l) ]] && echo -e "THA score\t${THA_SCORE}\n" >${prefix}_is_THA_affected.txt

echo "check 2"
# determine fraction of SNVs called as "synonymous SNV" among all exonic SNVs (QC value)
EXONIC_CLASSIFICATION_COLUMN_INDEX=`cat ${filenameSomaticSnvs} | grep -v '^##' | grep '^#' | perl -ne 'use List::Util qw(first); chomp; my @colnames = split(/\t/, $_); my $columnIndex = first { $colnames[$_] eq "EXONIC_CLASSIFICATION"} 0..$#colnames; $columnIndex += 1; print "$columnIndex\n";'`
export EXONIC_CLASSIFICATION_COLUMN_INDEX=$((${EXONIC_CLASSIFICATION_COLUMN_INDEX}-1))
ANNOVAR_FUNCTION_COLUMN_INDEX=`cat ${filenameSomaticSnvs} | grep -v '^##' | grep '^#' | perl -ne 'use List::Util qw(first); chomp; my @colnames = split(/\t/, $_); my $columnIndex = first { $colnames[$_] eq "ANNOVAR_FUNCTION"} 0..$#colnames; $columnIndex += 1; print "$columnIndex\n";'`
export ANNOVAR_FUNCTION_COLUMN_INDEX=$((${ANNOVAR_FUNCTION_COLUMN_INDEX}-1))
SYNONYMOUS_RATIO=`grep -v '^#' ${filenameSomaticSnvs} | perl -F'\t' -ae 'BEGIN { my $total=0; my $synonymous=0; } if ($F[$ENV{"ANNOVAR_FUNCTION_COLUMN_INDEX"}] eq "exonic" ) {$total++; if ($F[$ENV{"EXONIC_CLASSIFICATION_COLUMN_INDEX"}] eq "synonymous SNV") {$synonymous++;}} END { print $synonymous/$total; }'`
echo "check 3"
echo -e "{" >${prefix}_QC_values${rerun}.json
echo -e "\t\"snvnum\": ${snvnum:-NA}," >>${prefix}_QC_values${rerun}.json
echo -e "\t\"snvindbSNP\": ${snvindbSNP:-NA}," >>${prefix}_QC_values${rerun}.json
echo -e "\t\"snvInDbsnpRatio\": ${SNV_IN_DBSNP_RATIO:-NA}," >>${prefix}_QC_values${rerun}.json
echo -e "\t\"synonymousRatio\": ${SYNONYMOUS_RATIO:-NA}," >>${prefix}_QC_values${rerun}.json
echo -e "\t\"thaScore\": ${THA_SCORE:-NA}" >>${prefix}_QC_values${rerun}.json
echo -e "}" >>${prefix}_QC_values${rerun}.json
echo "done"

if ["$json_report" = true]; then
# determine fraction of SNVs called as "synonymous SNV" among all exonic SNVs (QC value)
EXONIC_CLASSIFICATION_COLUMN_INDEX=`cat ${filenameSomaticSnvs} | grep -v '^##' | grep '^#' | perl -ne 'use List::Util qw(first); chomp; my @colnames = split(/\t/, $_); my $columnIndex = first { $colnames[$_] eq "EXONIC_CLASSIFICATION"} 0..$#colnames; $columnIndex += 1; print "$columnIndex\n";'`
export EXONIC_CLASSIFICATION_COLUMN_INDEX=$((${EXONIC_CLASSIFICATION_COLUMN_INDEX}-1))
ANNOVAR_FUNCTION_COLUMN_INDEX=`cat ${filenameSomaticSnvs} | grep -v '^##' | grep '^#' | perl -ne 'use List::Util qw(first); chomp; my @colnames = split(/\t/, $_); my $columnIndex = first { $colnames[$_] eq "ANNOVAR_FUNCTION"} 0..$#colnames; $columnIndex += 1; print "$columnIndex\n";'`
export ANNOVAR_FUNCTION_COLUMN_INDEX=$((${ANNOVAR_FUNCTION_COLUMN_INDEX}-1))
SYNONYMOUS_RATIO=`grep -v '^#' ${filenameSomaticSnvs} | perl -F'\t' -ae 'BEGIN { my $total=0; my $synonymous=0; } if ($F[$ENV{"ANNOVAR_FUNCTION_COLUMN_INDEX"}] eq "exonic" ) {$total++; if ($F[$ENV{"EXONIC_CLASSIFICATION_COLUMN_INDEX"}] eq "synonymous SNV") {$synonymous++;}} END { print $synonymous/$total; }'`
echo "JSON REPORT WILL BE WRITTEN"
echo -e "{" >${prefix}_QC_values${rerun}.json
echo -e "\t\"snvnum\": ${snvnum:-NA}," >>${prefix}_QC_values${rerun}.json
echo -e "\t\"snvindbSNP\": ${snvindbSNP:-NA}," >>${prefix}_QC_values${rerun}.json
echo -e "\t\"snvInDbsnpRatio\": ${SNV_IN_DBSNP_RATIO:-NA}," >>${prefix}_QC_values${rerun}.json
echo -e "\t\"synonymousRatio\": ${SYNONYMOUS_RATIO:-NA}," >>${prefix}_QC_values${rerun}.json
echo -e "\t\"thaScore\": ${THA_SCORE:-NA}" >>${prefix}_QC_values${rerun}.json
echo -e "}" >>${prefix}_QC_values${rerun}.json
echo "done"
else
echo "JSON REPORT WONT BE WRITTEN"
fi
17 changes: 17 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,23 @@ process {
mode: params.publish_dir_mode
]
}
withName: 'ENSEMBLVEP_DOWNLOAD' {
ext.args = { '--AUTO c --CONVERT --NO_BIOPERL --NO_HTSLIB --NO_TEST --NO_UPDATE' }
publishDir = [
mode: params.publish_dir_mode,
path: { params.outdir_cache ? "${params.outdir_cache}/": "${params.outdir}/cache/" }
]
}
withName: 'ENSEMBLVEP_VEP' {
//ext.args ='--everything --filter_common --per_gene --total_length --offline'
publishDir = [
[
mode: params.publish_dir_mode,
path: { "${params.outdir}/${meta.id}/" },
pattern: "*{gz,tbi,html}"
]
]
}
}
//
// Don't publish results for these processes
Expand Down
20 changes: 12 additions & 8 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,17 @@ params {

// Post Processes
runcontigs = "NONE"
report_json = false

// Reference Files //
genome = "GRCh38"

// Annovar
buildver = "hg38"
dbtype = "wgEncodeGencodeCompV39"
segdupcol = "SEGDUP"
cytobandcol = "CYTOBAND"
geneannocols = '"ANNOVAR_FUNCTION,GENE,EXONIC_CLASSIFICATION,ANNOVAR_TRANSCRIPTS"'
annovar_path = "/Users/w620-admin/Desktop/Workflows/Annovar/annovar_Sept2022"
// Annotation with vep
annotation_tool = "vep"
vep_cache_version = 110
vep_genome = 'GRCh38'
vep_cache = null
download_cache = false // DO NOT Download annotation cache

k_genome ="${projectDir}/seq2_testdata_snv/annotations/kgenome_integrated_snvindels_v2a.GRCh38.27022019.sites.test.vcf.gz"
dbsnp_snv ="${projectDir}/seq2_testdata_snv/annotations/dbsnp_v151_GRCh38.SNV.sorted.test.vcf.gz"
Expand Down Expand Up @@ -90,9 +90,13 @@ process {
cpus = { check_max( 2 * task.attempt, 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
}
withName:'FILTER_PEOVERLAP|TRIPLET_PLOTTER'{
withName:'FILTER_PEOVERLAP|TRIPLET_PLOTTER|ANNOTATE_VCF|ANNOTATION_PIPES|CONFIDENCE_ANNOTATION|CONTEXT_FREQUENCIES|CONTEXT_PLOT|ERROR_PLOTS|FILTER_BY_CRIT|FLAG_BIAS|INTERMUTATION_DISTANCE|MUTATION_DISTANCE|PER_CHROM_PLOT|PLOT_BASESCORE_BIAS|PLOT_BASESCORE_DISTRIBUTION|PURITY_RELOADED|SNV_EXTRACTOR|ENSEMBLVEP_VEP|DBSNP_COUNTER'{
cpus = { check_max( 4 * task.attempt, 'cpus' ) }
memory = { check_max( 12.GB * task.attempt, 'memory' ) }
}
// using vep online is only recommended for test purposes for a minimal set of variants!
withName: 'ENSEMBLVEP_VEP' {
ext.args ='--per_gene --total_length'
}
}

28 changes: 24 additions & 4 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Main Swiches:
- --runplots: Generate extra rainfall and MAF plots.
- --runpurest: Run purityReloaded script.
- --skip_multiqc: Skip MultiQC.
- --annotation_tool: Select either "annovar" or "vep"

Reference Files:

Expand Down Expand Up @@ -55,7 +56,26 @@ annotate_variation.pl -downdb wgEncodeGencodeBasicV19 humandb/ -build hg19
- --cytobandcol : "CYTOBAND_COL"
- --geneannocols : '"ANNOVAR_FUNCTION,GENE,EXONIC_CLASSIFICATION,ANNOVAR_TRANSCRIPTS"'

**3. SNV Reability Options:**
**3. VEP Options:**

ENSEMBL VEP tools requires a cache directory to work offline. Cache file can be either downloaded manually or be downloaded through workflow using --download_cache true flag.

Example:

Download [cache](https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/)

````console
cd $HOME/.vep
curl -O https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/homo_sapiens_vep_110_GRCh38.tar.gz
tar xzf homo_sapiens_vep_110_GRCh38.tar.gz
``
- outdir_cache: if --download_cache = true, downloaded cache file can be saved into this directory for further usage.
- vep_cache: cache input directory, if the path is settled, use --download_cache = false
- download_cache: If true, downloads cache as directory
- vep_cache_version: VEP cache version to use (example: 110)
- vep_genome : VEP genome to use (example: 'GRCh38')

**4. SNV Reability Options:**

- --repeat_masker : UCSC Repeat Masker file (bed.gz). Column Name added: REPEAT_MASKER
- --dac_blacklist : UCSC DAC Black List (bed.gz) (Optional). Column Name added: DAC_BLACKLIST
Expand All @@ -65,7 +85,7 @@ annotate_variation.pl -downdb wgEncodeGencodeBasicV19 humandb/ -build hg19
- --mapability_file : UCSC Mappability regions (bed.gz). Column Name added: MAPABILITY
- --simple_tandemrepeats: UCSC Simple tandem repeats (bed.gz). Column Name added: SIMPLE_TANDEMREPEATS

**4. Deep Annotation Options:**
**5. Deep Annotation Options:**

If --runSNVDeepAnnotation is true, at least one of the following files must be defined (with corresponding indexes):

Expand All @@ -82,7 +102,7 @@ If --runSNVDeepAnnotation is true, at least one of the following files must be d
- --phastconselem_file : UCSC Phast Cons Elements (bed.gz) (Optional). Column Name added: phastConsElem20bp
- --encode_tfbs_file : UCSC Encode TFBS (bed.gz) (Optional). Column Name added: ENCODE_TFBS

**5. Confidence Annotation Options:**
**6. Confidence Annotation Options:**

- --confidenceoptions : confidence options should be listed here (string)

Expand Down Expand Up @@ -136,7 +156,7 @@ You will need to create a samplesheet with information about the samples you wou

```console
--input '[path to samplesheet file]'
```
````

### Full samplesheet

Expand Down
13 changes: 9 additions & 4 deletions modules/local/annovar.nf
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@ process ANNOVAR {
val(chrprefix)

output:
tuple val(meta),path('*.temp.vcf') , emit: vcf
tuple val(meta), path('*.log') , emit: log
tuple val(meta), path('*.temp.vcf.gz'),path('*.temp.vcf.gz.tbi') , emit: vcf
tuple val(meta), path('*.log') , emit: log
tuple val(meta), path('*_genomicSuperDups')
tuple val(meta), path('*_cytoBand')
tuple val(meta), path('*variant_function')
tuple val(meta), path('*exonic_variant_function')
path "versions.yml" , emit: versions
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when
Expand Down Expand Up @@ -64,12 +64,17 @@ process ANNOVAR {
--chrSuffix="" \\
--reportColumns="1" --bChrPosEnd="2,7,8" > ${prefix}.temp.vcf
bgzip -c ${prefix}.temp.vcf > ${prefix}.temp.vcf.gz
tabix ${prefix}.temp.vcf.gz
cat <<-END_VERSIONS > versions.yml
"${task.process}":
annovar_table: ${annovar_table}
annovar_path: ${params.annovar_path}
annovar_buildver: ${params.buildver}
perl: \$(echo \$(perl --version 2>&1) | sed 's/.*v\\(.*\\)) built.*/\\1/')
perl: \$(echo \$(perl --version 2>&1) | sed 's/.*v\\(.*\\)) built.*/\\1/')
gzip: \$(echo \$(gzip --version 2>&1) | sed 's/^.*gzip //; s/ .*\$//')
tabix: \$(echo \$(tabix -h 2>&1) | sed 's/^.*Version: //; s/ .*\$//')
END_VERSIONS
"""
}
7 changes: 4 additions & 3 deletions modules/local/json_report.nf
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
// THA calculation may end with NA
process JSON_REPORT {
tag "$meta.id"
label 'process_medium'
label 'process_single'

conda (params.enable_conda ? "" : null)
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
Expand All @@ -13,7 +13,7 @@ process JSON_REPORT {

output:
tuple val(meta), path('*.txt') , emit: txt
tuple val(meta), path('*.json') , emit: json
tuple val(meta), path('*.json') , emit: json, optional: true
tuple val(meta), path('*.pdf') , emit: plot
path "versions.yml" , emit: versions

Expand All @@ -31,7 +31,8 @@ process JSON_REPORT {
-t $params.min_cov \\
-v $params.min_confidence_score \\
-b $params.tha_score_threshold \\
-r ''
-r '' \\
-j $params.report_json
cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
7 changes: 3 additions & 4 deletions modules/local/snv_reliability_pipe.nf
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
//snv_reliability_pipe

process SNV_RELIABILITY_PIPE {
tag "$meta.id"
label 'process_medium'
label 'process_single'

conda (params.enable_conda ? "" : null)
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'docker://kubran/odcf_mpileupsnvcalling:v0':'kubran/odcf_mpileupsnvcalling:v0' }"

input:
tuple val(meta), file(ch_vcf)
tuple val(meta), file(ch_vcf), file(ch_vcf_i)
tuple file(repeatmasker), file(repeatmasker_i)
tuple file(dacblacklist), file(dacblacklist_i)
tuple file(dukeexcluded), file(dukeexcluded_i)
Expand Down Expand Up @@ -38,7 +37,7 @@ process SNV_RELIABILITY_PIPE {
].join(' ').trim()

"""
cat < $ch_vcf $pipe > ${prefix}.annotated.vcf
zcat < $ch_vcf $pipe > ${prefix}.annotated.vcf
cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
Loading

0 comments on commit 1829f29

Please sign in to comment.