Walkthrough: Training MNIST on HPC#
An end-to-end example of setting up and running a model training job with pytorch via slurm.
Specifically, this quickstart will show how to setup and run a job training a simple model on the ubiquitous MNIST dataset on one of HPC’s H100 nodes.
Before you begin…
Make sure you have an account with HPC. See the official documentation for details.
Logging on to HPC#
First you must log in to the HPC cluster. Details are provided in the official documentation, with one caveat: you must log in to specific login nodes in order to access the H100’s, either login3 or login4:
$ ssh <your-username>@login3.hpc.caltech.edu
Follow the instructions to complete the login with two-factor authentication.
After successful login, you will be on the login node inside the home directory for your account:
[<your-username>@login3 ~]$ pwd
/home/<your-username>
Testing GPU access#
The login node is not intended for computationally-intensive tasks, thus you will not have direct access to GPU resources there. To test GPU access, you’ll need to use one of the compute nodes.
One way to do so is to log onto one of the compute nodes interactively using
srun:
$ srun --pty -t 00:00:30 --partition=gpu --gres=gpu:h100:1 -N 1 -n 1 /bin/bash -l
As soon as your job is allocated, you should notice a change in the hostname
(the thing after the @ in your bash prompt) indicating that you are now on
a compute node, e.g.
[<your-username>@hpc-33-13 ~]$
The compute node will have access to the gpu resources requested via the --gres
flag — a single H100 card in this example.
You can run familiar commands to query the state of the GPU from the terminal
session on the compute node:
[<your-username>@hpc-33-13 ~]$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:E3:00.0 Off | 0 |
| N/A 24C P0 67W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Type exit to exit the interactive terminal session on the compute node.
Notice that the hostname reverts back to login3, reflecting that you are
back on the login node.
[<your-username>@hpc-33-13 ~]$ exit
logout
[<your-username>@login3 ~]$
Setting up a training job#
The ability to start interactive sessions on compute nodes via srun is a handy,
but it is not ideal for launching jobs.
Batch processing is a much better fit for real computational experiments - the
remainder of the quickstart shows how to setup, launch, and evaluate a model
training job with slurm.
Downloading the example#
We’ll use the mnist example from a fork of the pytorch/examples
for this demo.
The reason we’re using a forked version is to better illustrate best-practices when working on shared HPC systems: ensuring that data and other components that require a lot of storage (e.g. saved models) are kept separate from source code.
The original pytorch mnist example has hard-coded paths for saving data to the source directory. We’ve modified the example to add additional user flags to the run script allowing us to specify separate locations for storing data and training outputs.
Begin by downloading the source code (if you haven’t already):
Caution
Be sure you’re on the login node for this part!
$ mkdir repos && cd $_
$ git clone https://github.com/rossbar/pytorch-examples.git
Setting up a virtual environment#
For this workflow, we’ll use the Python built-in venv module to manage
environments.
First we’ll create a centralized location to keep our virtual environments:
$ mkdir -p ~/venvs
Then create the environment for this specific experiment. We’ll call it
mnist-example:
$ python3 -m venv ~/venvs/mnist-example
Enter the environment you’ve just created.
It should be empty, give-or-take libraries for packaging such as pip,
setuptools and/or wheel:
$ source ~/venvs/mnist-example/bin/activate
$ pip list
Package Version
---------- -------
pip 21.2.3
setuptools 53.0.0
Installing dependencies#
Caution
This step requires careful attention when using libraries that interface with
GPUs, such as pytorch.
In order to fully support the hardware, GPUs must be discoverable at
installation time, i.e. when you run pip install.
This means that you must install dependencies with device support (like pytorch)
on the compute nodes with allocated GPU(s).
There are multiple ways you can do so: one option is to include the dependency
installation step in your job script.
For illustrative purposes, we will instead create the environment interactively
using srun.
Note
Once the necessary dependencies have been properly installed in an environment, it can be reused without any additional installation by simply activating it again.
For clarity’s sake, exit the mnist-example environment on the login node:
$ deactivate
Now, request an interactive terminal on a compute node with at least one GPU allocated:
$ srun --pty -t 00:30:00 --partition=gpu --gres=gpu:h100:1 -N 1 -n 1 /bin/bash -l
Notice the increased wall time (-t 00:30:00) - it’s good to give yourself
some leeway here in case the downloading/installation takes longer than expected.
Once the interactive session on the compute node starts:
$ source ~/venvs/mnist-example/bin/activate # Enter the virtual environment
$ cd ~/repos/pytorch-examples/mnist # Go to the source repo where deps are specified
$ pip install -r requirements.txt ipython # Installed the deps (and ipython)
Once this has completed, you can test the successful installation:
$ ipython
In [1]: import torch.nn as nn
If the installation completed correctly, you shouldn’t see any exceptions at
import time.
If you get an ImportError or ModuleNotFoundError, it means something went
wrong while installing pytorch.
Consider opening an issue to this repo!
Once the installation is complete, you can close your interactive session on
the compute node with exit.
Preparing for the run#
Now that the virtual environment is set up with all the packages need to run our job, the focus shifts to making final preparations for the run. In this case, this means ensuring that we have the dataset that we’ll be training on, and that all of the paths for loading/saving data have been set up.
First, ensure you’re on the login node, and activate the virtual environment
Note
Remember, once the environment is successfully created on the compute node, it can be entered from anywhere (though GPU-specific features will only work on nodes with GPU resources allocated to them!)
$ source ~/venvs/mnist-example/bin-activate
The mnist example provides a command-line interface using argparse, therefore
we can learn about the options by passing in the --help flag:
$ cd repos/pytorch-examples/mnist
$ python main.py --help
usage: main.py [-h] [--data-path DATA_PATH] [--batch-size N] [--test-batch-size N]
[--epochs N] [--lr LR] [--gamma M] [--no-cuda] [--no-mps] [--dry-run]
[--seed S] [--log-interval N] [--save-model]
[--model-save-path MODEL_SAVE_PATH]
PyTorch MNIST Example
optional arguments:
-h, --help show this help message and exit
--data-path DATA_PATH
path to MNIST data (default: "../data")
--batch-size N input batch size for training (default: 64)
--test-batch-size N input batch size for testing (default: 1000)
--epochs N number of epochs to train (default: 14)
--lr LR learning rate (default: 1.0)
--gamma M Learning rate step gamma (default: 0.7)
--no-cuda disables CUDA training
--no-mps disables macOS GPU training
--dry-run quickly check a single pass
--seed S random seed (default: 1)
--log-interval N how many batches to wait before logging training status
--save-model For Saving the current Model
--model-save-path MODEL_SAVE_PATH
path to which model will be saved (default: cwd)
The most important options are the --data-path and --model-save-path flags,
which determine where the training data and training output will be stored.
As a general rule of thumb, you should never store data in your home directory.
Your home directory on HPC is capped at 50GB by default, and is not intended
for storage/access of large data.
As a rule of thumb - your home directory should only be used for source code
and virtual environments.
For more details on storage allocations on HPC, including central storage and
scratch space, see the official documentation.
For this simple example, the downloading of the training data is handled for
us by torchvision.datasets, so all we have to do is create a directory where
data will be stored.
Since the MNIST dataset is so small (~60MB) we don’t have to worry about
storage utilization.
Go ahead and create a directory on your research group’s central storage
partition (if you haven’t already):
$ mkdir -p /central/groups/<your-research-group-name>/<your-username>/mnist_example
If you’re not sure what <your-research-group-name> is, try ls /central/groups
and see if there are any obvious candidates (e.g. your professor’s last name).
Else ask the person from your group who gave you HPC access!
Now we have everything we need to run the job.
Submitting the job#
We’ll use slurm’s sbatch command to submit our job to the scheduler.
To do so, we first need to write the script that describes our job.
We have two main steps
Enter the virtual environment
Run
main.py
Create a new bash script somehwere in your home directory called train-mnist.sh
and paste the following into it:
#! /bin/bash
VENV=$HOME/venvs/mnist-example
SRCDIR=$HOME/repos/pytorch-examples/mnist
EXPERIMENT_DIR=/central/groups/<your-research-group-name>/<your-username>/mnist_example
# Activate Python virtual environment
source $VENV/bin/activate
# Sanity check: list out Python packages in current environment
pip list
# Run the job
python $SRCDIR/main.py \
--data-path $EXPERIMENT_DIR \
--save-model \
--model-save-path $EXPERIMENT_DIR/model
Be sure to replace <your-research-group-name> and <your-username>.
sbatch should be smart enough to assume that whatever job script you pass it
is executable, but it’s always good to be explicit:
chmod u+x train-mnist.sh
Now we’re ready to submit the job. This is a very small experiment, so we only set the wall time to 10 minutes and only request a single GPU.
$ sbatch -t 00:10:00 --partition=gpu --gres=gpu:h100:1 -N 1 -n 8 train-mnist.sh
Submitted batch job <job-id>
Note
The sbatch options can be included in the job script directly with #SBATCH.
See the official HPC docs
for details.
sbatch returns the <job-id> of the submitted job, which can be used to look
up the status of the job:
$ scontrol show job <job-id>
The current status of the job (e.g. PENDING, RUNNING, FAILED, COMPLETED, etc.)
is indicated in the JobState field.
stdout and stderr for batch jobs are piped to text files slurm-<job-id>.out
in your home directory (by default).
You can monitor the output of your job with a file pager:
$ less slurm-<job-id>.out
Checking the results#
Assuming the job completed successfully[1], there should be a model saved to
the path specified by --model-save-dir.
We can verify this from the login node:
$ source ~/venvs/mnist-example/bin/activate
$ ipython
In [1]: import torch
In [2]: model_path = "/central/groups/<your-research-group-name>/<your-username>/mnist_example/model/mnist_cnn.pt"
In [3]: model = torch.load(model_path, map_location=torch.device("cpu"))
In [4]: model.keys()
Out[4]: odict_keys(['conv1.weight', 'conv1.bias', 'conv2.weight', 'conv2.bias', 'fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias'])
Of course, if you actually wanted to use the model for inference, you’d want to set up another slurm job and run it on the compute nodes!