Skip to content

A most basic working setup for training ML-models with Pytorch using DDP and mulitple nodes.

License

Notifications You must be signed in to change notification settings

sadamov/ddp_starter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DDP Starter

This is a starter project for distributed deep learning with PyTorch and Slurm.

Pre-requisites

Installation

Copy latest code from github:

git clone git@github.com:sadamov/ddp_starter.git
cd ddp_starter

Create a new conda environment and install dependencies:

mamba env create -f environment.yml

Usage

sbatch test_slurm.sh

Then check out the logs in ./lightning_logs to see if the run was successful. The metrics.csv contains the training and validation losses across all epochs.

`Trainer.fit` stopped: `max_epochs=10` reached.

means that the run was successful.

For real case trainings, you will need to modify the batch_size and num_workers in the dataloader of class IrisDataModule to best utilize the available GPU and CPU resources.

About

A most basic working setup for training ML-models with Pytorch using DDP and mulitple nodes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published