Skip to content

zeyang-shen/spacing_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

python-version DOI

Spacing pipeline

The scripts provided in this repository are used to compute and characterize the spacing relationships of transcription factors.

Here is the overview of the method:

Dependencies

Quick Usage

identify_motif.py can find motifs given a peak file, a FASTA file for peak sequences, and a motif file. The recommended parameters are as below to filter for motifs passing a false positive rate <0.1% (--cutoff) and a location <50 bp from peak centers (-d 50):

python identify_motif.py ../ENCODE_processed_files/CTCF_idr.fa CTCF --motif_path ../motifs/ --cutoff -d 50

To identify motifs and simultaneously separate peaks into those falling at repetitive and nonrepetitive DNA regions, please download the repeats annotations first and run identify_motif.py script by specifying --repeat:

wget https://homer.ucsd.edu/zeyang/hg38_repeats.tar.gz
tar -zxvf hg38_repeats.tar.gz
python identify_motif.py ../ENCODE_processed_files/CTCF_idr.fa CTCF --motif_path ../motifs/ --cutoff -d 50 --repeat hg38_repeats/hg38_repeats_merged.nodup.all.txt

characterize_spacing.py can take in two processed files from identify_motif.py for a pair of transcription factors and output results of spacing relationships. The basic usage is as below:

python characterize_spacing.py ../ENCODE_processed_files/ GATA1 TAL1 --motif_path ../motifs/

Citation

If you use our findings or scripts, please cite our paper: https://doi.org/10.7554/eLife.70878.

Data

motifs/ folder stores the PWM files in the JASPAR format used in the paper.

ENCODE_processed_files/ folder includes the processed data of this paper based on ENCODE ChIP-seq data:

  • _idr.tsv -- ChIP-seq peaks in HOMER peak file format after running IDR
  • _idr.fa -- sequences of ChIP-seq peaks in _idr.tsv
  • _idr_cutoff.tsv -- ChIP-seq peaks that have been identified to have valid motifs
  • _idr_cutoff_inmask.tsv -- Peaks in _idr_cutoff.tsv that fall into repetitive regions
  • _idr_cutoff_masked.tsv -- Peaks in _idr_cutoff.tsv that fall into nonrepetitive regions

Contact

If you enconter a problem when using the scripts, you can

  1. post an issue on Issue section
  2. or email Zeyang Shen by zes017@ucsd.edu

License

This project is licensed under GNU GPL v3

Contributors

The scripts were developed primarily by Zeyang Shen and Rick Zhenzhi Li. Supervision for the project was provided by Christopher K. Glass.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages