Let's try to generate NetTIME training data using ChIP-seq data from 5 conditions: JUN.A549, JUNB.K562, JUND.A549, JUND.GM12878 and JUND.K562. Data generation for training a NetTIME model requires the following metadata files:
- A text file specifying target TF ChIP-seq data locations.
- A list of text files specifying locations of feature data files.
- A list of bed files specifying the genomic regions from which to generate train, validation and test examples (i.e., a list of blacklist filtered genomic regions).
- A Python pickle file containing the indices for TF and cell type labels.
Example metadata files are provided in data/metadata/training_example
. The example pickle index file is provided in data/embeddings
data/metadata/training_example/
|-- ct_feature.txt
|-- target.txt
|-- test_regions.bed
|-- tf_feature.txt
|-- training_regions.bed
`-- validation_regions.bed
data/embeddings/
`-- example.pkl
In addition to DNA sequence and cell-type-specific features used to train the main NetTIME model, TF-specific motif features can be optionally added during training. In the following example, we demonstrate the NetTIME data generation procedure using DNase-seq as the cell-type-specific feature and HOCOMOCO TF motif enrichment as the TF-specific features.
Make sure to also download the genome fasta file and chromoeom size file, and save them in data/annotations
:
mkdir data/annotations
cd data/annotations/
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/latest/hg19.fa.gz
gzip -d hg19.fa.gz
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/latest/hg19.chrom.sizes
preprocess_v1-1_generate_example_sequences.sh
provides one way to generate example sequences from a group of TF ChIP-seq peaks. Combined peak set is first generated by concatenating multiple TF ChIP-seq peak bed files. Peaks that are longer than SEQUENCE_LENGTH
are removed. Two overlapping peaks are merged if
- Two peaks overlap more than
MIN_OVERLAP
number of base pairs. - The merged peaks are shorter than
MAX_UNION
number of base pairs.
Each interval in the resulting merged peak set is used to create one example sequence of length SEQUENCE_LENGTH
where the midpoints of the example and the interval are the same.
The above procedure can be achieved by running the following:
cd preprocess
./preprocess_v1-1_generate_example_sequences.sh \
--region_bed ../data/metadata/training_example/training_regions.bed,../data/metadata/training_example/validation_regions.bed,../data/metadata/training_example/test_regions.bed \
--output_prefix training,validation,test \
--chip_metadata ../data/metadata/training_example/target.txt \
--embedding_index ../data/embeddings/example.pkl \
--output_dir ../data/datasets/training_example \
--set_path_extra_args --ct_feature,DNase,--tf_feature,hocomoco
The main goal of this step is to create a _seq_feature.h5
file containing the example DNA sequence feature, and a _metadata.pkl
file necessary for the downstream data collection steps. The generated _metadata.pkl
file specifies the genomic locations of each example sequence, and where the retrieved feature signal tracks and target labels will be saved.
Alternatively, you can provide your own peak set stored in the bed file format and use preprocess_v1-1-2_generate_custom_examples.sh
to generate the _seq_feature.h5
and _metadata.pkl
files. See example here.
For each dataset (training, validation and test), we need to retrieve feature signals (DNase-seq signal and HOCOMOCO motif enrichment) as well as target labels (from conserved and relaxed peak sets). preprocess_v1-2_retrieve_signal.sh
retrieves necessary data and performs zscore normalization on the retrieved feature signals. For small datasets, we can do this sequentially by running the following:
declare -a datasets=("training" "validation" "test")
for dset in "${datasets[@]}"; do
EXAMPLE_PICKLE="../data/datasets/training_example/${dset}/${dset}_minOverlap200_maxUnion600_metadata.pkl"
REGION_BED="../data/metadata/training_example/${dset}_regions.bed"
ZSCORE_DIR="../data/datasets/training_example/${dset}/zscore_params"
# Retrieve target labels
# Process 2nd to 6th entries in ../data/metadata/training_example/target.txt
for job_id in {2..6}; do
# Conserved peaks
./preprocess_v1-2_retrieve_signal.sh --job_id $job_id \
--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
--target_type "output_conserved"
# Relaxed peaks
./preprocess_v1-2_retrieve_signal.sh --job_id $job_id \
--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
--target_type "output_relaxed"
done
# Retrieve DNase signal
# Process 2nd to 4th entries in ../data/metadata/training_example/ct_feature.txt
for job_id in {2..4}; do
./preprocess_v1-2_retrieve_signal.sh --job_id $job_id \
--metadata "../data/metadata/training_example/ct_feature.txt" \
--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
--zscore_dir $ZSCORE_DIR
done
# Retrieve hocomoco motif enrichment scores
# Process 2nd to 4th entries in ../data/metadata/training_example/tf_feature.txt
for job_id in {2..4}; do
./preprocess_v1-2_retrieve_signal.sh --job_id $job_id \
--metadata "../data/metadata/training_example/tf_feature.txt" \
--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
--zscore_dir $ZSCORE_DIR
done
done
I recommend running this step on a high performance computing environment if you have a large number of example sequences and conditions to get data from. For example, if you're working under Slurm, you can submit array jobs similar to this:
declare -a datasets=("training" "validation" "test")
for dset in "${datasets[@]}"; do
EXAMPLE_PICKLE="../data/datasets/training_example/${dset}/${dset}_minOverlap200_maxUnion600_metadata.pkl"
REGION_BED="../data/metadata/training_example/${dset}_regions.bed"
ZSCORE_DIR="../data/datasets/training_example/${dset}/zscore_params"
# Retrieve target labels
# Process 2nd to 6th entries in ../data/metadata/training_example/target.txt
sbatch --array=2-6 preprocess_v1-2_retrieve_signal.sh \
--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
--target_type "output_conserved" --tmp_dir $SLURM_JOBTMP
sbatch --array=2-6 preprocess_v1-2_retrieve_signal.sh \
--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
--target_type "output_relaxed" --tmp_dir $SLURM_JOBTMP
# Retrieve DNase signal
# Process 2nd to 4th entries in ../data/metadata/training_example/ct_feature.txt
sbatch --array=2-4 preprocess_v1-2_retrieve_signal.sh \
--metadata "../data/metadata/training_example/ct_feature.txt" \
--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
--zscore_dir $ZSCORE_DIR --tmp_dir $SLURM_JOBTMP
# Retrieve hocomoco motif enrichment scores
# Process 2nd to 4th entries in ../data/metadata/training_example/tf_feature.txt
sbatch --array=2-4 preprocess_v1-2_retrieve_signal.sh \
--metadata "../data/metadata/training_example/tf_feature.txt" \
--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
--zscore_dir $ZSCORE_DIR --tmp_dir $SLURM_JOBTMP
done
The last step is to combine the retrieved feature signals as well as target labels into one HDF5 file, and compute class weight in target labels.
declare -a datasets=("training" "validation" "test")
for dset in "${datasets[@]}"; do
EXAMPLE_PICKLE="../data/datasets/training_example/${dset}/${dset}_minOverlap200_maxUnion600_metadata.pkl"
SEQ_FEATURE="../data/datasets/training_example/${dset}/${dset}_minOverlap200_maxUnion600_seq_feature.h5"
OUTPUT_HDF5="../data/datasets/training_example/${dset}_minOverlap200_maxUnion600.h5"
./preprocess_v1-3_merge_dataset.sh --example_pickle $EXAMPLE_PICKLE \
--seq_feature_hdf5 $SEQ_FEATURE --output_hdf5 $OUTPUT_HDF5 \
--generate_hdf5_extra_args --ct_feature,DNase,--tf_feature,hocomoco,--compression
done
Here is a list of files you will be able to generate after successfully running the above three steps.
In this demo, we use the pretrained model to make predictions from 2 conditions: JUN.K562 and JUNB.GM12878. This particular model was trained on the DNA sequence as well as 4 types of cell type-specific features (DNase, H3K4me1, H3K4me3, H3K27ac).
Data generation for NetTIME prediction requires the following metadata files:
- A text file specifying the conditions to make predictions for.
- A list of text files specifying locations of feature data files.
- A bed file specifying the genomic regions from which to generate prediction examples.
- A Python pickle file containing the indices for TF and cell type labels.
Example metadata files are provided in data/metadata/prediction_example
. The example pickle index file is provided in data/embeddings
.
data/metadata/prediction_example/
|-- conditions.txt
|-- ct_feature.txt
|-- predict.bed
`-- tf_feature.txt
data/embeddings/
`-- pretrained.pkl
Generate example sequences given a peak set bed file. Each interval in the bed file is used to create one example sequence of length SEQUENCE_LENGTH
where the midpoints of the example and the interval are the same. If intervals in your bed file are much longer than SEQUENCE_LENGTH
, we recommend binning the intervals before running this script. Turn on the --skip_target
flag when target labels are not available.
cd preprocess
./preprocess_v1-1-2_generate_custom_examples.sh \
--input_bed ../data/metadata/prediction_example/predict.bed \
--condition_metadata ../data/metadata/prediction_example/conditions.txt \
--embedding_index ../data/embeddings/example.pkl \
--output_dir ../data/datasets/prediction_example \
--set_path_extra_args --ct_feature,DNase,H3K4me1,H3K4me3,H3K27ac,--skip_target
Retreive feature signals (DNase-seq, and H3K4me1,H3K4me3 and H3K27ac ChIP-seq) for prediction dataset. Run the following to retrieve data sequentially. See similar example above for instructions to retrieve data in a high performance computing environment such as Slurm.
# Process 2nd and 9th entries in ../data/metadata/prediction_example/ct_feature.txt
for job_id in {2..9}; do
./preprocess_v1-2_retrieve_signal.sh --job_id $job_id \
--metadata "../data/metadata/prediction_example/ct_feature.txt" \
--example_pickle "../data/datasets/prediction_example/predict_metadata.pkl" \
--zscore_dir "../data/datasets/prediction_example/zscore_params"
done
Combine the retrieved feature signals into one HDF5 file. Turn on the --skip_target
flag when target labels are not available.
./preprocess_v1-3_merge_dataset.sh \
--example_pickle "../data/datasets/prediction_example/predict_metadata.pkl" \
--seq_feature_hdf5 "../data/datasets/prediction_example/predict_seq_feature.h5" \
--output_hdf5 "../data/datasets/prediction_example/predict.h5" \
--generate_hdf5_extra_args --ct_feature,DNase,H3K4me1,H3K4me3,H3K27ac,--compression,--skip_target,--condition_metadata,../data/metadata/prediction_example/conditions.txt
Here is a list of files you will be able to generate after successfully running the above three steps.