Skip to content

Kids First and Sentieon collaboration to bring a relatively easy and affordable joint calling workflow to CAVATICA

License

Notifications You must be signed in to change notification settings

kids-first/Kids-First-Sentieon-Joint-Cohort-Genotyping-Workflow

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kids First-Sentieon Joint Cohort Calling

This is a beta workflow. It is not currently used in production, but its output is more than serviceable. An efficient, fast, and cost effective workflow for joint calling cohorts up to ~4000 individuals on the CAVATICA platform. You will likely need to have installed sbpack to push the necessary apps into your project, and it is recommended to have the python packages listed in python_pip_requirements.txt installed. Both workflows have limited implementations of the GVCFtyper algo which is part of the Sentieon driver library. Currently, VQSR is not part of this workfow, but is available to run after this one. Please see the Kids First-Sentieon VQSR Equivalent Workflow for more info on that. For any question/comments/support request, please email support@kidsfirstdrc.org

data service logo

This workflow runs on the CAVATICA platform with few parameters necessary to run. The platform has an internal workflow scatter limit of ~2200 files, which means, if your cohort size is greater than this, it will not run! To load this app into your project, follow the instructions in the sbpack install to load workflow/kf-joint-cohort-call-by-chr-wf.cwl

This notebook can be run locally or copied to a CAVATICA Data Studio Analysis session. If your cohort is >2200 samples, but projected to use less than 4TB per split job and chromosome call job, you can use this. To estimate the amount of disk space required, we recommend taking the average file size of your cohort, and assuming even distribution of file size by chromosome, multiply it by the proportion if the size of the genome to be used, and double it. For example, if the average file size were 7.2 GB, and, using chr 1-22,X,Y as our total genome, chr1 is the largest and take up about 0.08 of total bases, and you want to leave room for the joint call result for that chromosome so:

x = 4000 GB/( 7.2 GB * 0.16 )
x ~ 3,472 files

Therefore, in this scenario, if your cohort is 3,472 files, then you can use this method Similar to the Preferred method, you will push two apps this time, workflow/split_vcf_mini_wf.cwl and tools/bcftools_shard_vcf.cwl to your project

Loading the Workflows and Apps

Since this is a beta project, it is not yet in the CAVATICA apps. It is recommended that you push them to the platform from git releases only. Git releases typically have had some kind of testing and are more likely to work than from main or any other branch.

Cost and run time estimates

Based on using the Advanced Cohort Calling Workflow on a cohort of 2303 samples: Split cost: $60.96; Split Run Time: ~1.5 hours Sentieon GVCFTyper: $231.34, Call run times varied by chr size, 1-7 hours, median ~3.5 hours

About

Kids First and Sentieon collaboration to bring a relatively easy and affordable joint calling workflow to CAVATICA

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Common Workflow Language 61.8%
  • Jupyter Notebook 22.9%
  • Python 10.4%
  • Dockerfile 4.9%