Accessing GPUs on an HPC Cluster (Paramganga)

May 2, 2025

To start gpu session from terminal right away

srun -p gpu --ntasks=1 --gres=gpu:2 --time=1-00:00:00 -n 1 --pty bash -i

Basics

srun --nodes=1 --gres=gpu:1 --partition=gpu --ntasks-per-node=16 --time=1-00:00:00 --pty bash -i

use tmux sessions and remember login node

allocate gpu in a tmux session

then in other tmux windows ssh into that gpu to get multiple bash instances open

ssh gpu020

Transfer Files

use rsync instead of scp - it's super fast

rsync -avz ~/Downloads/hwd_df.csv <ssh_hostname>:/home/testuser/amz_ml_2024/

rsync -avz <ssh_hostname>:/home/testuser/workspace/project1/files ~/Downloads/

scp to copy files from local to hpc or hpc to local

scp –r /home/testuser testuser@<local IP>:/dir/dir/fileto copy the files to your system

or use scp ssh_name:/path/to/remote/file /path/to/local/file instead in your terminal (given that ssh key is set, if not use remote_user@remote_host and enter pass)

Using terminal

run ssh {username}@paramganga.iitr.ac.in . add port -p 4422 if not on the IITR network or use vscode. optionally add -o UserKnownHostsFile=/dev/null
enter password
start tmux session tmux new -s my_session or simply tmux
get gpu node alloted by srun --nodes=1 --gres=gpu:2 --partition=gpu --ntasks-per-node=16 --time=1-00:00:00 --pty bash -i you may use --exclusive in the tmux session
exit tmux session and remember the login node
after the process is run go to that login node and attach to that tmux session. use squeue to list the running jobs
enter exit to logout
make sure to remove refs from .ssh/known_hosts using ssh-keygen -R [paramganga.iitr.ac.in](http://paramganga.iitr.ac.in) or by specifying -o UserKnownHostsFile=/dev/null or modify in config file

GPU

Old NVIDIA drivers of CUDA toolkit 11.6

install pytorch and cuda relevant libraries for cu118 version (backward compatible for 116)

pin that version in conda - don’t tamper your conda env again and again

Shell Script runs

run.sh

#!/bin/bash
#SBATCH --job-name=j1
#SBATCH --nodes=1
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=10:00:00
#SBATCH --mail-type=BEGIN
#SBATCH --mail-user=$USER
#SBATCH --output=%x-%j.out

# sleep 100
# papermill notebook.ipynb notebook_o.ipynb
# CUDA_VISIBLE_DEVICES=0,1 python main.py

sbatch run.sh

module

module availto list all available pkgs, be it on spack

slurm and other account mgmt

passwd to change the password

sinfo -a

squeue -u $USER --start list the user’s jobs with approx time after which it’d start

squeue -p gpu view all gpu jobs

sacct your job status / history

scancel to terminate queued or running jobs

SLURM Commands

TMUX

On the command prompt, type tmux, (tmux new -s my_session for a named session)
Run the desired program.
Use the key sequence Ctrl-b + dto detach from the session.
use tmux ls to list all sessions
Reattach to the Tmux session by typing tmux attach-session -t 0.
tmux kill-session -t 0 to kill that session

follow this to log in without a password use passwd to change password

1st time setup hpc

add the following script in /home/testuser/.bashrc

module load spack
source /home/apps/spack/share/spack/setup-env.sh
spack load tmux

follow Conda and Spack Installation Guide to setup miniconda in /home/$USER/miniconda3

ssh using rsa id (passwordless)

1st time setup local

add this in your .bashrc ssh-keygen -f "/c/Users/msing/.ssh/known_hosts" -R "[paramganga.iitr.ac.in](http://paramganga.iitr.ac.in/)" &> /dev/null

ssh-keygen -R '[[paramganga.iitr.ac.in](http://paramganga.iitr.ac.in/)]:4422'

add this in your ~/.ssh/config file

Host paramganga.iitr.ac.in
  HostName paramganga.iitr.ac.in
  User <your_username>
  IdentityFile ~/.ssh/mac_to_pgiitr
  UserKnownHostsFile /dev/null
  # Port 4422

Add Port 4422

you might need to manually remove the old host key from ~/.ssh/known_hosts file in some situations

generate ssh key and copy the public key on pc

make sure to use comment to hide name of your laptop

ssh-keygen -m PEM -f ~/.ssh/mykey -C "local@default"

and without passphrase

for .pem files do → chmod 400 file.pem (aws)

checkout more details and configurations at my dotfiles repo

More References

FairShare / Priority

Regular use will decrease your fairshare, increasing wait time for next allocation. Hence, use the gpus responsibly