Skip to content

Wang-lab-UCSD/Benchmarking-tpLMs

Repository files navigation

Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks

✨In this repository, we have the datasets, models, and code used in our study!✨

🛠️ Installation and environment set-up

First, please clone this repository and create a corresponding conda environment 🐍.
❗ NOTE: For the PyTorch installation, please install the version appropriate for your hardware: see here

conda create -n tplm python=3.10
conda activate tplm
conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install scikit-learn==1.3.1
pip install -U "huggingface_hub[cli]"

We provide the environment.yml but recommend running the commands above instead of installing from the yml file.

💻 Reproducing Results

❗ NOTE: These experiments are performed with an NVIDIA A6000 GPU with CUDA 12.3. Please note exact reproducibility is not guaranteed across devices: see here

To reproduce the results from our study in sequential order, please follow the steps listed below.

  1. download_data_embs.sh
  2. run_tplm_benchmarks.sh
  3. run_embedding_fusion_benchmarks.sh
  4. run_ppi.sh
  5. run_cath.sh

1️⃣ Downloading Data and Embeddings

The data and embeddings are stored in HuggingFace and our download_data_embs.sh uses huggingface-cli to download the necessary files.

❗ NOTE: Before running download_data_embs.sh, please add your HuggingFace token after the --token flag. Once added, run download_data_embs.sh.

huggingface-cli login --add-to-git-credential --token # Add your Huggingface token here 
Dataset Details
The datasets used in this study are created by the following authors:
Generating New Embeddings

We have provided sample scripts for generating embeddings for each protein language model (pLM) in the embedding_generation/ directory. To generate your own embeddings using the pLMs from this study, follow these steps:

  1. Clone the Repository:

    • Clone the repository of the respective pLM you intend to use. Please follow the specific setup and environment setup instructions detailed in each pLM's repository.
  2. Generate Embeddings:

    • Copy the embedding generation script we provided in embedding_generation/ into the cloned pLM's directory. Each pLM has a different embedding generation script, so please make sure you use the appropriate one.
    • Execute these scripts within the pLM's environment and directory to generate new embeddings. Ensure that the outputs are directed to the appropriate location.

2️⃣ Benchmarking text-integrated protein language models against ESM2 3B

Run run_tplm_benchmarks.sh to train models for benchmarking tpLMs against ESM2 3B on AAV, GB1, GFP, Location, Meltome, and Stability.


3️⃣ Evaluating embedding fusion

Run run_embedding_fusion_benchmarks.sh to train models for benchmarking embedding fusion with tpLMs on AAV, GB1, GFP, Location, Meltome, and Stability.


4️⃣ Identifying optimal combinations and evaluating performance on protein-protein interaction prediction

Run run_ppi.sh to use the greedy heuristic to identify a promising combination of embeddings, then train models with all possible combinations of embeddings to identify the true best combination.


5️⃣ Identifying optimal combinations and evaluating performance on homologous sequence recovery

Run run_cath.sh to use the greedy heuristic to identify a promising combination of embeddings, then evaluate all possible combinations of embeddings to identify the true best combination.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published