Merge pull request #26 from ghga-de/25-alternate-annovar-with-vep

VEP is added as alternative to ANNOVAR
ghga-de · Dec 7, 2023 · 1829f29 · 1829f29
2 parents 470575c + 0104b01
commit 1829f29
Show file tree

Hide file tree

Showing 26 changed files with 686 additions and 93 deletions.
diff --git a/README.md b/README.md
@@ -14,7 +14,9 @@
 
 It calls SNVs from both germline and somatic samples using bcftools mpileup, compares and filters outs germline specific ones with samtools mpileup compare. This workflow uses various annotations from publicly available databases like 1000G variants, dbSNP and gnomAD. The functional effect of the mutations are annotated using Annovar and the variants are assessed for their consequence and split into somatic and non-somatic calls. Besides, extensive QC plots serve functionality for high functional somatic mutation prioritization.
 
-For now, this workflow is only optimal to work in ODCF Cluster. The config file (conf/dkfz_cluster.config) can be used as an example. Running Annotation, DeepAnnotation, and Filter steps are optional and can be turned off using [runsnvAnnotation, runSNVDeepAnnotation, runSNVVCFFilter] parameters sequentially.
+This workflow is optimal to work in ODCF Cluster. The config file (conf/dkfz_cluster.config) can be used as an example. Running Annotation, DeepAnnotation, and Filter steps are optional and can be turned off using [runsnvAnnotation, runSNVDeepAnnotation, runSNVVCFFilter] parameters sequentially.
+
+Users outside of ODCF cluster, should prepare annotation files accordingly (check seq2_testdata_snv/annotations for example) to use whole functionality of this workflow.
 
 The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
 
@@ -33,9 +35,13 @@ The pipeline has main steps: SNV calling using mpileup, basic annotations, deep
 
    In-house scripts to annotate with several databases like gnomAD and dbSNP.
 
+   --annotation_tool: annovar or vep
    ANNOVAR ([`Annovar`](https://annovar.openbioinformatics.org/en/latest/))
    : annotate_variation.pl is used to annotate variants. The tool makes classifications for intergenic, intogenic, nonsynonymous SNP, frameshift deletion or large-scale duplication regions.
 
+   ENSEMBL VEP(['ENSEBL VEP'](https://www.ensembl.org/info/docs/tools/vep/index.html))
+   :can also be used alternative to annovar. Gene annotations will be extracted.
+
    Reliability and confidation annotations: It is an optional ste for mappability, hiseq, self chain and repeat regions checks for reliability and confidence of those scores.
 
    Sequence ad Sequencing based error plots: Provides insights on predicted somatic SNVs.
@@ -64,12 +70,28 @@ The pipeline has main steps: SNV calling using mpileup, basic annotations, deep
 
 2. Install any of [`Docker`](https://docs.docker.com/engine/installation/) or [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/) (you can follow [this tutorial](https://singularity-tutorial.github.io/01-installation/))
 
-3. Download [Annovar](https://annovar.openbioinformatics.org/en/latest/user-guide/download/) and set-up suitable annotation table directory to perform annotation. Example:
+3. To use Annovar:
+
+Download [Annovar](https://annovar.openbioinformatics.org/en/latest/user-guide/download/) and set-up suitable annotation table directory to perform annotation. Example:
 
 ```console
 annotate_variation.pl -downdb wgEncodeGencodeBasicV19 humandb/ -build hg19
 ```
 
+Gene annotation is also possible with ENSEMBL VEP tool, for test purposes only, it can be used online. But for big analysis, it is recommended to either download cache file or use --download_cache flag in parameters.
+
+Follow the documentation [here](https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#cache)
+
+Example:
+
+Download [cache](https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/)
+
+```console
+cd $HOME/.vep
+curl -O https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/homo_sapiens_vep_110_GRCh38.tar.gz
+tar xzf homo_sapiens_vep_110_GRCh38.tar.gz
+```
+
 4. Download the pipeline and test it on a minimal dataset with a single command:
 
    ```console
@@ -162,7 +184,7 @@ The nf-snvcalling pipeline comes with documentation about the pipeline [usage](h
 
 ## Credits
 
-nf-snvcalling was originally translated from roddy-based pipeline by Kuebra Narci kuebra.narci@dkfz-heidelberg.de.
+nf-snvcalling was originally translated from roddy-based pipeline by Kuebra Narci [kuebra.narci@dkfz-heidelberg.de](mailto:kuebra.narci@dkfz.de).
 
 The pipeline is originally written in the workflow management language Roddy. [Inspired github page](https://github.com/DKFZ-ODCF/SNVCallingWorkflow)
 
@@ -171,11 +193,11 @@ The SNV workflow was in the pan-cancer analysis of whole genomes (PCAWG) and can
 Pan-cancer analysis of whole genomes.
 The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium.
 Nature volume 578, pages 82–93 (2020).
-DOI 10.1038/s41586-020-1969-6:
+DOI 10.1038/s41586-020-1969-6
 
-**TODO**
+We tank the following people for their extensive assistance in the development of this pipeline:
 
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
+Nagarajan Paramasivam (@NagaComBio) [n.paramasivam@dkfz.de](mailto:n.paramasivam@dkfz.de)
 
 ## Contributions and Support
 

diff --git a/bin/check_samplesheet.py b/bin/check_samplesheet.py
@@ -93,7 +93,7 @@ def check_samplesheet(file_in, file_out):
                 if sample and tumor and control:  ## iscontrol true
                     sample_info = [sample, tumor, tumor_index, control, control_index , "1"]
                 elif sample and tumor and not control:  ## iscontrol false
-                    sample_info = [sample, tumor,tumor_index,"dummy.bam","dummy.bai","0"]
+                    sample_info = [sample, tumor,tumor_index,"","","0"]
                 else:
                     print_error("Invalid combination of columns provided!", "Line", line)
 

diff --git a/bin/final_plots_and_json.sh b/bin/final_plots_and_json.sh
@@ -1,6 +1,6 @@
 set -euo pipefail
 
-usage() { echo "Usage: $0 [-p prefix] [-i filenameSomaticSnvs]  [-s filenameSomaticSnvsIndbSNP] [-t mincov] [-v minconfidencescore] [-b thatreshold] [-r rerun]" 1>&2; exit 1; }
+usage() { echo "Usage: $0 [-p prefix] [-i filenameSomaticSnvs]  [-s filenameSomaticSnvsIndbSNP] [-t mincov] [-v minconfidencescore] [-b thatreshold] [-r rerun] [-j json_report]" 1>&2; exit 1; }
 
 while [[ $# -gt 0 ]]
 do
@@ -41,6 +41,11 @@ do
 			shift # past argument
 	    	shift # past value
 			;;
+		-j)
+			json_report=$2
+			shift
+			shift
+			;;
 	esac
 done
 
@@ -87,18 +92,23 @@ echo $THA_SCORE
 [[ $(echo "${THA_SCORE} > ${THA_SCORE_THRESHOLD}" | bc -l) ]] && echo -e "THA score\t${THA_SCORE}\n" >${prefix}_is_THA_affected.txt
 
 echo "check 2"
-# determine fraction of SNVs called as "synonymous SNV" among all exonic SNVs (QC value)
-EXONIC_CLASSIFICATION_COLUMN_INDEX=`cat ${filenameSomaticSnvs} | grep -v '^##' | grep '^#' | perl -ne 'use List::Util qw(first); chomp; my @colnames = split(/\t/, $_); my $columnIndex = first { $colnames[$_] eq "EXONIC_CLASSIFICATION"} 0..$#colnames; $columnIndex += 1; print "$columnIndex\n";'`
-export EXONIC_CLASSIFICATION_COLUMN_INDEX=$((${EXONIC_CLASSIFICATION_COLUMN_INDEX}-1))
-ANNOVAR_FUNCTION_COLUMN_INDEX=`cat ${filenameSomaticSnvs} | grep -v '^##' | grep '^#' | perl -ne 'use List::Util qw(first); chomp; my @colnames = split(/\t/, $_); my $columnIndex = first { $colnames[$_] eq "ANNOVAR_FUNCTION"} 0..$#colnames; $columnIndex += 1; print "$columnIndex\n";'`
-export ANNOVAR_FUNCTION_COLUMN_INDEX=$((${ANNOVAR_FUNCTION_COLUMN_INDEX}-1))
-SYNONYMOUS_RATIO=`grep -v '^#' ${filenameSomaticSnvs} | perl -F'\t' -ae 'BEGIN { my $total=0; my $synonymous=0; } if ($F[$ENV{"ANNOVAR_FUNCTION_COLUMN_INDEX"}] eq "exonic" ) {$total++; if ($F[$ENV{"EXONIC_CLASSIFICATION_COLUMN_INDEX"}] eq "synonymous SNV") {$synonymous++;}} END { print $synonymous/$total; }'`
-echo "check 3"
-echo -e "{" >${prefix}_QC_values${rerun}.json
-echo -e "\t\"snvnum\": ${snvnum:-NA}," >>${prefix}_QC_values${rerun}.json
-echo -e "\t\"snvindbSNP\": ${snvindbSNP:-NA}," >>${prefix}_QC_values${rerun}.json
-echo -e "\t\"snvInDbsnpRatio\": ${SNV_IN_DBSNP_RATIO:-NA}," >>${prefix}_QC_values${rerun}.json
-echo -e "\t\"synonymousRatio\": ${SYNONYMOUS_RATIO:-NA}," >>${prefix}_QC_values${rerun}.json
-echo -e "\t\"thaScore\": ${THA_SCORE:-NA}" >>${prefix}_QC_values${rerun}.json
-echo -e "}" >>${prefix}_QC_values${rerun}.json
-echo "done"
+
+if ["$json_report" = true]; then
+	# determine fraction of SNVs called as "synonymous SNV" among all exonic SNVs (QC value)
+	EXONIC_CLASSIFICATION_COLUMN_INDEX=`cat ${filenameSomaticSnvs} | grep -v '^##' | grep '^#' | perl -ne 'use List::Util qw(first); chomp; my @colnames = split(/\t/, $_); my $columnIndex = first { $colnames[$_] eq "EXONIC_CLASSIFICATION"} 0..$#colnames; $columnIndex += 1; print "$columnIndex\n";'`
+	export EXONIC_CLASSIFICATION_COLUMN_INDEX=$((${EXONIC_CLASSIFICATION_COLUMN_INDEX}-1))
+	ANNOVAR_FUNCTION_COLUMN_INDEX=`cat ${filenameSomaticSnvs} | grep -v '^##' | grep '^#' | perl -ne 'use List::Util qw(first); chomp; my @colnames = split(/\t/, $_); my $columnIndex = first { $colnames[$_] eq "ANNOVAR_FUNCTION"} 0..$#colnames; $columnIndex += 1; print "$columnIndex\n";'`
+	export ANNOVAR_FUNCTION_COLUMN_INDEX=$((${ANNOVAR_FUNCTION_COLUMN_INDEX}-1))
+	SYNONYMOUS_RATIO=`grep -v '^#' ${filenameSomaticSnvs} | perl -F'\t' -ae 'BEGIN { my $total=0; my $synonymous=0; } if ($F[$ENV{"ANNOVAR_FUNCTION_COLUMN_INDEX"}] eq "exonic" ) {$total++; if ($F[$ENV{"EXONIC_CLASSIFICATION_COLUMN_INDEX"}] eq "synonymous SNV") {$synonymous++;}} END { print $synonymous/$total; }'`
+	echo "JSON REPORT WILL BE WRITTEN"
+	echo -e "{" >${prefix}_QC_values${rerun}.json
+	echo -e "\t\"snvnum\": ${snvnum:-NA}," >>${prefix}_QC_values${rerun}.json
+	echo -e "\t\"snvindbSNP\": ${snvindbSNP:-NA}," >>${prefix}_QC_values${rerun}.json
+	echo -e "\t\"snvInDbsnpRatio\": ${SNV_IN_DBSNP_RATIO:-NA}," >>${prefix}_QC_values${rerun}.json
+	echo -e "\t\"synonymousRatio\": ${SYNONYMOUS_RATIO:-NA}," >>${prefix}_QC_values${rerun}.json
+	echo -e "\t\"thaScore\": ${THA_SCORE:-NA}" >>${prefix}_QC_values${rerun}.json
+	echo -e "}" >>${prefix}_QC_values${rerun}.json
+	echo "done"
+else
+	echo "JSON REPORT WONT BE WRITTEN"
+fi
diff --git a/conf/modules.config b/conf/modules.config
@@ -158,6 +158,23 @@ process {
             mode: params.publish_dir_mode
         ]
     }
+    withName: 'ENSEMBLVEP_DOWNLOAD' {
+        ext.args   = { '--AUTO c --CONVERT --NO_BIOPERL --NO_HTSLIB --NO_TEST --NO_UPDATE' }
+        publishDir = [
+            mode: params.publish_dir_mode,
+            path: { params.outdir_cache ? "${params.outdir_cache}/": "${params.outdir}/cache/" }
+        ]
+    }
+    withName: 'ENSEMBLVEP_VEP' {
+        //ext.args         ='--everything --filter_common --per_gene --total_length --offline'
+        publishDir       = [
+            [
+                mode: params.publish_dir_mode,
+                path: { "${params.outdir}/${meta.id}/" },
+                pattern: "*{gz,tbi,html}"
+            ]
+        ]
+    }
 }
 //
 // Don't publish results for these processes

diff --git a/conf/test.config b/conf/test.config
@@ -28,17 +28,17 @@ params {
 
     // Post Processes
     runcontigs                 = "NONE"
+    report_json                = false
 
     // Reference Files //
     genome                     = "GRCh38"
 
-     // Annovar
-    buildver                   = "hg38"
-    dbtype                     = "wgEncodeGencodeCompV39"
-    segdupcol                  = "SEGDUP"
-    cytobandcol                = "CYTOBAND"
-    geneannocols               = '"ANNOVAR_FUNCTION,GENE,EXONIC_CLASSIFICATION,ANNOVAR_TRANSCRIPTS"'
-    annovar_path               = "/Users/w620-admin/Desktop/Workflows/Annovar/annovar_Sept2022"
+    // Annotation with vep
+    annotation_tool            = "vep"
+    vep_cache_version          = 110
+    vep_genome                 = 'GRCh38'
+    vep_cache                  = null
+    download_cache             = false  // DO NOT Download annotation cache
 
     k_genome                   ="${projectDir}/seq2_testdata_snv/annotations/kgenome_integrated_snvindels_v2a.GRCh38.27022019.sites.test.vcf.gz"
     dbsnp_snv                  ="${projectDir}/seq2_testdata_snv/annotations/dbsnp_v151_GRCh38.SNV.sorted.test.vcf.gz"
@@ -90,9 +90,13 @@ process {
         cpus            = { check_max( 2 * task.attempt, 'cpus' ) }
         memory          = { check_max( 6.GB * task.attempt, 'memory' ) }
    }  
-    withName:'FILTER_PEOVERLAP|TRIPLET_PLOTTER'{
+    withName:'FILTER_PEOVERLAP|TRIPLET_PLOTTER|ANNOTATE_VCF|ANNOTATION_PIPES|CONFIDENCE_ANNOTATION|CONTEXT_FREQUENCIES|CONTEXT_PLOT|ERROR_PLOTS|FILTER_BY_CRIT|FLAG_BIAS|INTERMUTATION_DISTANCE|MUTATION_DISTANCE|PER_CHROM_PLOT|PLOT_BASESCORE_BIAS|PLOT_BASESCORE_DISTRIBUTION|PURITY_RELOADED|SNV_EXTRACTOR|ENSEMBLVEP_VEP|DBSNP_COUNTER'{
         cpus            = { check_max( 4 * task.attempt, 'cpus' ) }
         memory          = { check_max( 12.GB * task.attempt, 'memory' ) }
    }
+    // using vep online is only recommended for test purposes for a minimal set of variants!
+    withName: 'ENSEMBLVEP_VEP' {
+        ext.args         ='--per_gene --total_length'
+    }
 }
 
diff --git a/docs/usage.md b/docs/usage.md
@@ -19,6 +19,7 @@ Main Swiches:
 - --runplots: Generate extra rainfall and MAF plots.
 - --runpurest: Run purityReloaded script.
 - --skip_multiqc: Skip MultiQC.
+- --annotation_tool: Select either "annovar" or "vep"
 
 Reference Files:
 
@@ -55,7 +56,26 @@ annotate_variation.pl -downdb wgEncodeGencodeBasicV19 humandb/ -build hg19
 - --cytobandcol : "CYTOBAND_COL"
 - --geneannocols : '"ANNOVAR_FUNCTION,GENE,EXONIC_CLASSIFICATION,ANNOVAR_TRANSCRIPTS"'
 
-**3. SNV Reability Options:**
+**3. VEP Options:**
+
+ENSEMBL VEP tools requires a cache directory to work offline. Cache file can be either downloaded manually or be downloaded through workflow using --download_cache true flag.
+
+Example:
+
+Download [cache](https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/)
+
+````console
+cd $HOME/.vep
+curl -O https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/homo_sapiens_vep_110_GRCh38.tar.gz
+tar xzf homo_sapiens_vep_110_GRCh38.tar.gz
+``
+- outdir_cache: if --download_cache = true, downloaded cache file can be saved into this directory for further usage.
+- vep_cache: cache input directory, if the path is settled, use --download_cache = false
+- download_cache: If true, downloads cache as directory
+- vep_cache_version: VEP cache version to use (example: 110)
+- vep_genome : VEP genome to use (example: 'GRCh38')
+
+**4. SNV Reability Options:**
 
 - --repeat_masker : UCSC Repeat Masker file (bed.gz). Column Name added: REPEAT_MASKER
 - --dac_blacklist : UCSC DAC Black List (bed.gz) (Optional). Column Name added: DAC_BLACKLIST
@@ -65,7 +85,7 @@ annotate_variation.pl -downdb wgEncodeGencodeBasicV19 humandb/ -build hg19
 - --mapability_file : UCSC Mappability regions (bed.gz). Column Name added: MAPABILITY
 - --simple_tandemrepeats: UCSC Simple tandem repeats (bed.gz). Column Name added: SIMPLE_TANDEMREPEATS
 
-**4. Deep Annotation Options:**
+**5. Deep Annotation Options:**
 
 If --runSNVDeepAnnotation is true, at least one of the following files must be defined (with corresponding indexes):
 
@@ -82,7 +102,7 @@ If --runSNVDeepAnnotation is true, at least one of the following files must be d
 - --phastconselem_file : UCSC Phast Cons Elements (bed.gz) (Optional). Column Name added: phastConsElem20bp
 - --encode_tfbs_file : UCSC Encode TFBS (bed.gz) (Optional). Column Name added: ENCODE_TFBS
 
-**5. Confidence Annotation Options:**
+**6. Confidence Annotation Options:**
 
 - --confidenceoptions : confidence options should be listed here (string)
 
@@ -136,7 +156,7 @@ You will need to create a samplesheet with information about the samples you wou
 
 ```console
 --input '[path to samplesheet file]'
-```
+````
 
 ### Full samplesheet
 

diff --git a/modules/local/annovar.nf b/modules/local/annovar.nf
@@ -16,13 +16,13 @@ process ANNOVAR {
     val(chrprefix)
 
     output:
-    tuple val(meta),path('*.temp.vcf')                    , emit: vcf
-    tuple val(meta), path('*.log')                        , emit: log
+    tuple val(meta), path('*.temp.vcf.gz'),path('*.temp.vcf.gz.tbi') , emit: vcf
+    tuple val(meta), path('*.log')                                   , emit: log
     tuple val(meta), path('*_genomicSuperDups')                         
     tuple val(meta), path('*_cytoBand')                                 
     tuple val(meta), path('*variant_function')                          
     tuple val(meta), path('*exonic_variant_function')                  
-    path  "versions.yml"                                  , emit: versions
+    path  "versions.yml"                                              , emit: versions
 
     when:
     task.ext.when == null || task.ext.when
@@ -64,12 +64,17 @@ process ANNOVAR {
         --chrSuffix="" \\
         --reportColumns="1" --bChrPosEnd="2,7,8" > ${prefix}.temp.vcf
 
+    bgzip -c ${prefix}.temp.vcf > ${prefix}.temp.vcf.gz
+    tabix ${prefix}.temp.vcf.gz
+
     cat <<-END_VERSIONS > versions.yml
     "${task.process}":
         annovar_table: ${annovar_table}
         annovar_path: ${params.annovar_path}
         annovar_buildver: ${params.buildver}
-        perl: \$(echo \$(perl --version 2>&1) | sed 's/.*v\\(.*\\)) built.*/\\1/') 
+        perl: \$(echo \$(perl --version 2>&1) | sed 's/.*v\\(.*\\)) built.*/\\1/')
+        gzip: \$(echo \$(gzip --version 2>&1) | sed 's/^.*gzip //; s/ .*\$//')
+        tabix: \$(echo \$(tabix -h 2>&1) | sed 's/^.*Version: //; s/ .*\$//')
     END_VERSIONS
     """
 }
diff --git a/modules/local/json_report.nf b/modules/local/json_report.nf
@@ -2,7 +2,7 @@
 // THA calculation may end with NA
 process JSON_REPORT {
     tag "$meta.id"
-    label 'process_medium'
+    label 'process_single'
 
     conda     (params.enable_conda ? "" : null)
     container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
@@ -13,7 +13,7 @@ process JSON_REPORT {
 
     output:
     tuple val(meta), path('*.txt')                  , emit: txt
-    tuple val(meta), path('*.json')                 , emit: json
+    tuple val(meta), path('*.json')                 , emit: json, optional: true
     tuple val(meta), path('*.pdf')                  , emit: plot        
     path  "versions.yml"                            , emit: versions
 
@@ -31,7 +31,8 @@ process JSON_REPORT {
         -t $params.min_cov \\
         -v $params.min_confidence_score \\
         -b $params.tha_score_threshold \\
-        -r ''
+        -r '' \\
+        -j $params.report_json
 
     cat <<-END_VERSIONS > versions.yml
     "${task.process}":

diff --git a/modules/local/snv_reliability_pipe.nf b/modules/local/snv_reliability_pipe.nf
@@ -1,15 +1,14 @@
-//snv_reliability_pipe
 
 process SNV_RELIABILITY_PIPE {
     tag "$meta.id"
-    label 'process_medium'
+    label 'process_single'
 
     conda     (params.enable_conda ? "" : null)
     container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
         'docker://kubran/odcf_mpileupsnvcalling:v0':'kubran/odcf_mpileupsnvcalling:v0' }"
 
     input:
-    tuple val(meta),                 file(ch_vcf)
+    tuple val(meta),                 file(ch_vcf), file(ch_vcf_i)
     tuple file(repeatmasker),        file(repeatmasker_i)
     tuple file(dacblacklist),        file(dacblacklist_i)
     tuple file(dukeexcluded),        file(dukeexcluded_i)
@@ -38,7 +37,7 @@ process SNV_RELIABILITY_PIPE {
                     ].join(' ').trim()
 
     """
-    cat < $ch_vcf $pipe > ${prefix}.annotated.vcf
+    zcat < $ch_vcf $pipe > ${prefix}.annotated.vcf
 
     cat <<-END_VERSIONS > versions.yml
     "${task.process}":