Accessing GPUs on an HPC Cluster (Paramganga)
To start gpu session from terminal right away
srun -p gpu --ntasks=1 --gres=gpu:2 --time=1-00:00:00 -n 1 --pty bash -i
Basics
use tmux sessions and remember login node
allocate gpu in a tmux session
then in other tmux windows ssh into that gpu to get multiple bash instances open
ssh gpu020
Transfer Files
use rsync
instead of scp
- it's super fast
rsync -avz ~/Downloads/hwd_df.csv <ssh_hostname>:/home/testuser/amz_ml_2024/
rsync -avz <ssh_hostname>:/home/testuser/workspace/project1/files ~/Downloads/
scp
to copy files from local to hpc or hpc to local
scp –r /home/testuser testuser@<local IP>:/dir/dir/file
to copy the files to your system
or use scp ssh_name:/path/to/remote/file /path/to/local/file
instead in your terminal (given that ssh key is set, if not use remote_user@remote_host
and enter pass)
Using terminal
- run
ssh {username}@paramganga.iitr.ac.in
. add port-p 4422
if not on the IITR network or use vscode. optionally add-o UserKnownHostsFile=/dev/null
- enter password
- start tmux session
tmux new -s my_session
or simplytmux
- get gpu node alloted by
srun --nodes=1 --gres=gpu:2 --partition=gpu --ntasks-per-node=16 --time=1-00:00:00 --pty bash -i
you may use--exclusive
in the tmux session - exit tmux session and remember the login node
- after the process is run go to that login node and attach to that tmux session. use
squeue
to list the running jobs - enter
exit
to logout - make sure to remove refs from .ssh/known_hosts using
ssh-keygen -R [paramganga.iitr.ac.in](http://paramganga.iitr.ac.in)
or by specifying-o UserKnownHostsFile=/dev/null
or modify in config file
GPU
Old NVIDIA drivers of CUDA toolkit 11.6
install pytorch and cuda relevant libraries for cu118 version (backward compatible for 116)
pin that version in conda - don’t tamper your conda env again and again
Shell Script runs
run.sh
#!/bin/bash
#SBATCH --job-name=j1
#SBATCH --nodes=1
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=10:00:00
#SBATCH --mail-type=BEGIN
#SBATCH --mail-user=$USER
#SBATCH --output=%x-%j.out
# sleep 100
# papermill notebook.ipynb notebook_o.ipynb
# CUDA_VISIBLE_DEVICES=0,1 python main.py
sbatch run.sh
module
module avail
to list all available pkgs, be it on spack
slurm and other account mgmt
passwd
to change the password
sinfo -a
squeue -u $USER --start
list the user’s jobs with approx time after which it’d start
squeue -p gpu
view all gpu jobs
sacct
your job status / history
scancel
to terminate queued or running jobs
TMUX
- On the command prompt, type
tmux
, (tmux new -s my_session
for a named session) - Run the desired program.
- Use the key sequence
Ctrl-b + d
to detach from the session. - use
tmux ls
to list all sessions - Reattach to the Tmux session by typing
tmux attach-session -t 0
. tmux kill-session -t 0
to kill that session
Passwordless Login
follow this to log in without a password use passwd to change password
1st time setup hpc
add the following script in /home/testuser/.bashrc
follow Conda and Spack Installation Guide to setup miniconda in /home/$USER/miniconda3
ssh using rsa id (passwordless)
1st time setup local
add this in your .bashrc ssh-keygen -f "/c/Users/msing/.ssh/known_hosts" -R "[paramganga.iitr.ac.in](http://paramganga.iitr.ac.in/)" &> /dev/null
ssh-keygen -R '[[paramganga.iitr.ac.in](http://paramganga.iitr.ac.in/)]:4422'
add this in your ~/.ssh/config file
# Port 4422
Add Port 4422
you might need to manually remove the old host key from ~/.ssh/known_hosts
file in some situations
generate ssh key and copy the public key on pc
make sure to use comment to hide name of your laptop
ssh-keygen -m PEM -f ~/.ssh/mykey -C "local@default"
and without passphrase
for .pem files do → chmod 400 file.pem
(aws)
checkout more details and configurations at my dotfiles repo
More References
- Conda and Spack Installation Guide (ORNL)
- UVA Deep Learning Cluster Tutorial
- Northeastern HPC Spack Documentation
- NMSU Discovery Cluster: SLURM GPU Jobs
- How to Mount a Remote File System Locally (Stack Overflow)
- iTerm2 tmux Integration Documentation
- NYU HPC SLURM Tutorial
FairShare / Priority
Regular use will decrease your fairshare, increasing wait time for next allocation. Hence, use the gpus responsibly