Skip to content

Latest commit

 

History

History
executable file
·
197 lines (177 loc) · 12.1 KB

README.md

File metadata and controls

executable file
·
197 lines (177 loc) · 12.1 KB

Data generation for NetTIME

Generating data to train a NetTIME model from scratch

Let's try to generate NetTIME training data using ChIP-seq data from 5 conditions: JUN.A549, JUNB.K562, JUND.A549, JUND.GM12878 and JUND.K562. Data generation for training a NetTIME model requires the following metadata files:

  1. A text file specifying target TF ChIP-seq data locations.
  2. A list of text files specifying locations of feature data files.
  3. A list of bed files specifying the genomic regions from which to generate train, validation and test examples (i.e., a list of blacklist filtered genomic regions).
  4. A Python pickle file containing the indices for TF and cell type labels.

Example metadata files are provided in data/metadata/training_example. The example pickle index file is provided in data/embeddings

data/metadata/training_example/
|-- ct_feature.txt
|-- target.txt
|-- test_regions.bed
|-- tf_feature.txt
|-- training_regions.bed
`-- validation_regions.bed
data/embeddings/
`-- example.pkl

In addition to DNA sequence and cell-type-specific features used to train the main NetTIME model, TF-specific motif features can be optionally added during training. In the following example, we demonstrate the NetTIME data generation procedure using DNase-seq as the cell-type-specific feature and HOCOMOCO TF motif enrichment as the TF-specific features.

Make sure to also download the genome fasta file and chromoeom size file, and save them in data/annotations:

mkdir data/annotations
cd data/annotations/
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/latest/hg19.fa.gz
gzip -d hg19.fa.gz
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/latest/hg19.chrom.sizes

Step 1: generate example sequences

preprocess_v1-1_generate_example_sequences.sh provides one way to generate example sequences from a group of TF ChIP-seq peaks. Combined peak set is first generated by concatenating multiple TF ChIP-seq peak bed files. Peaks that are longer than SEQUENCE_LENGTH are removed. Two overlapping peaks are merged if

  1. Two peaks overlap more than MIN_OVERLAP number of base pairs.
  2. The merged peaks are shorter than MAX_UNION number of base pairs.

Each interval in the resulting merged peak set is used to create one example sequence of length SEQUENCE_LENGTH where the midpoints of the example and the interval are the same.

The above procedure can be achieved by running the following:

cd preprocess
./preprocess_v1-1_generate_example_sequences.sh \
--region_bed ../data/metadata/training_example/training_regions.bed,../data/metadata/training_example/validation_regions.bed,../data/metadata/training_example/test_regions.bed \
--output_prefix training,validation,test \
--chip_metadata ../data/metadata/training_example/target.txt \
--embedding_index ../data/embeddings/example.pkl \
--output_dir ../data/datasets/training_example \
--set_path_extra_args --ct_feature,DNase,--tf_feature,hocomoco

The main goal of this step is to create a _seq_feature.h5 file containing the example DNA sequence feature, and a _metadata.pkl file necessary for the downstream data collection steps. The generated _metadata.pkl file specifies the genomic locations of each example sequence, and where the retrieved feature signal tracks and target labels will be saved.

Alternatively, you can provide your own peak set stored in the bed file format and use preprocess_v1-1-2_generate_custom_examples.sh to generate the _seq_feature.h5 and _metadata.pkl files. See example here.

Step 2: retrieve feature signal tracks and target labels

For each dataset (training, validation and test), we need to retrieve feature signals (DNase-seq signal and HOCOMOCO motif enrichment) as well as target labels (from conserved and relaxed peak sets). preprocess_v1-2_retrieve_signal.sh retrieves necessary data and performs zscore normalization on the retrieved feature signals. For small datasets, we can do this sequentially by running the following:

declare -a datasets=("training" "validation" "test")
for dset in "${datasets[@]}"; do
	EXAMPLE_PICKLE="../data/datasets/training_example/${dset}/${dset}_minOverlap200_maxUnion600_metadata.pkl"
	REGION_BED="../data/metadata/training_example/${dset}_regions.bed"
	ZSCORE_DIR="../data/datasets/training_example/${dset}/zscore_params"
	# Retrieve target labels
	# Process 2nd to 6th entries in ../data/metadata/training_example/target.txt
	for job_id in {2..6}; do 
		# Conserved peaks
		./preprocess_v1-2_retrieve_signal.sh --job_id $job_id \
		--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
		--target_type "output_conserved"
		# Relaxed peaks
		./preprocess_v1-2_retrieve_signal.sh --job_id $job_id \
		--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
		--target_type "output_relaxed"
	done

	# Retrieve DNase signal
	# Process 2nd to 4th entries in ../data/metadata/training_example/ct_feature.txt
	for job_id in {2..4}; do
		./preprocess_v1-2_retrieve_signal.sh --job_id $job_id \
		--metadata "../data/metadata/training_example/ct_feature.txt" \
		--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
		--zscore_dir $ZSCORE_DIR
	done

	# Retrieve hocomoco motif enrichment scores
	# Process 2nd to 4th entries in ../data/metadata/training_example/tf_feature.txt
	for job_id in {2..4}; do
		./preprocess_v1-2_retrieve_signal.sh --job_id $job_id \
		--metadata "../data/metadata/training_example/tf_feature.txt" \
		--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
		--zscore_dir $ZSCORE_DIR
	done
done

I recommend running this step on a high performance computing environment if you have a large number of example sequences and conditions to get data from. For example, if you're working under Slurm, you can submit array jobs similar to this:

declare -a datasets=("training" "validation" "test")
for dset in "${datasets[@]}"; do
	EXAMPLE_PICKLE="../data/datasets/training_example/${dset}/${dset}_minOverlap200_maxUnion600_metadata.pkl"
	REGION_BED="../data/metadata/training_example/${dset}_regions.bed"
	ZSCORE_DIR="../data/datasets/training_example/${dset}/zscore_params"
	# Retrieve target labels
	# Process 2nd to 6th entries in ../data/metadata/training_example/target.txt
	sbatch --array=2-6 preprocess_v1-2_retrieve_signal.sh \
	--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
	--target_type "output_conserved" --tmp_dir $SLURM_JOBTMP
	sbatch --array=2-6 preprocess_v1-2_retrieve_signal.sh \
	--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
	--target_type "output_relaxed" --tmp_dir $SLURM_JOBTMP

	# Retrieve DNase signal
	# Process 2nd to 4th entries in ../data/metadata/training_example/ct_feature.txt
	sbatch --array=2-4 preprocess_v1-2_retrieve_signal.sh \
	--metadata "../data/metadata/training_example/ct_feature.txt" \
	--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
	--zscore_dir $ZSCORE_DIR --tmp_dir $SLURM_JOBTMP

	# Retrieve hocomoco motif enrichment scores
	# Process 2nd to 4th entries in ../data/metadata/training_example/tf_feature.txt
	sbatch --array=2-4 preprocess_v1-2_retrieve_signal.sh \
	--metadata "../data/metadata/training_example/tf_feature.txt" \
	--example_pickle $EXAMPLE_PICKLE --region_bed $REGION_BED \
	--zscore_dir $ZSCORE_DIR --tmp_dir $SLURM_JOBTMP
done

Step 3: Merge data files

The last step is to combine the retrieved feature signals as well as target labels into one HDF5 file, and compute class weight in target labels.

declare -a datasets=("training" "validation" "test")
for dset in "${datasets[@]}"; do
	EXAMPLE_PICKLE="../data/datasets/training_example/${dset}/${dset}_minOverlap200_maxUnion600_metadata.pkl"
	SEQ_FEATURE="../data/datasets/training_example/${dset}/${dset}_minOverlap200_maxUnion600_seq_feature.h5"
	OUTPUT_HDF5="../data/datasets/training_example/${dset}_minOverlap200_maxUnion600.h5"
	./preprocess_v1-3_merge_dataset.sh --example_pickle $EXAMPLE_PICKLE \
	--seq_feature_hdf5 $SEQ_FEATURE --output_hdf5 $OUTPUT_HDF5 \
	--generate_hdf5_extra_args --ct_feature,DNase,--tf_feature,hocomoco,--compression
done

Here is a list of files you will be able to generate after successfully running the above three steps.

Generating data to make predictions from a trained model

In this demo, we use the pretrained model to make predictions from 2 conditions: JUN.K562 and JUNB.GM12878. This particular model was trained on the DNA sequence as well as 4 types of cell type-specific features (DNase, H3K4me1, H3K4me3, H3K27ac).

Data generation for NetTIME prediction requires the following metadata files:

  1. A text file specifying the conditions to make predictions for.
  2. A list of text files specifying locations of feature data files.
  3. A bed file specifying the genomic regions from which to generate prediction examples.
  4. A Python pickle file containing the indices for TF and cell type labels.

Example metadata files are provided in data/metadata/prediction_example. The example pickle index file is provided in data/embeddings.

data/metadata/prediction_example/
|-- conditions.txt
|-- ct_feature.txt
|-- predict.bed
`-- tf_feature.txt
data/embeddings/
`-- pretrained.pkl

Generate example sequences

Generate example sequences given a peak set bed file. Each interval in the bed file is used to create one example sequence of length SEQUENCE_LENGTH where the midpoints of the example and the interval are the same. If intervals in your bed file are much longer than SEQUENCE_LENGTH, we recommend binning the intervals before running this script. Turn on the --skip_target flag when target labels are not available.

cd preprocess
./preprocess_v1-1-2_generate_custom_examples.sh \
--input_bed ../data/metadata/prediction_example/predict.bed \
--condition_metadata ../data/metadata/prediction_example/conditions.txt \
--embedding_index ../data/embeddings/example.pkl \
--output_dir ../data/datasets/prediction_example \
--set_path_extra_args --ct_feature,DNase,H3K4me1,H3K4me3,H3K27ac,--skip_target

Retrieve feature signal tracks

Retreive feature signals (DNase-seq, and H3K4me1,H3K4me3 and H3K27ac ChIP-seq) for prediction dataset. Run the following to retrieve data sequentially. See similar example above for instructions to retrieve data in a high performance computing environment such as Slurm.

# Process 2nd and 9th entries in ../data/metadata/prediction_example/ct_feature.txt
for job_id in {2..9}; do
	./preprocess_v1-2_retrieve_signal.sh --job_id $job_id \
	--metadata "../data/metadata/prediction_example/ct_feature.txt" \
	--example_pickle "../data/datasets/prediction_example/predict_metadata.pkl" \
	--zscore_dir "../data/datasets/prediction_example/zscore_params"
done

Merge data files

Combine the retrieved feature signals into one HDF5 file. Turn on the --skip_target flag when target labels are not available.

./preprocess_v1-3_merge_dataset.sh \
--example_pickle "../data/datasets/prediction_example/predict_metadata.pkl" \
--seq_feature_hdf5 "../data/datasets/prediction_example/predict_seq_feature.h5" \
--output_hdf5 "../data/datasets/prediction_example/predict.h5" \
--generate_hdf5_extra_args --ct_feature,DNase,H3K4me1,H3K4me3,H3K27ac,--compression,--skip_target,--condition_metadata,../data/metadata/prediction_example/conditions.txt

Here is a list of files you will be able to generate after successfully running the above three steps.