Hyperparameter Grid-Search on SLURM Clusters

About: Why should you use a SLURM cluster? You should use it if you:

  • Need to run code for many hours, and it’s too annoying to run it on your laptop
  • Need to run a bunch of code in parallel (e.g. fit a model using 100 different hyper-parameters)
  • Need access to a diversity of computing hardwares (e.g. GPUs)

Since resources on computing clusters are shared, it’s important to learn how to use the resources respectfully.

Overview

Running Code on a SLURM Cluster

Broadly, SLURM allows you to run two types of jobs: interactive shells, and batch jobs. An “interactive shell” allows you to run your code and interact with it (e.g. you can see its output print on your screen, you can interrupt the execution using Control-C, etc.). On the other hand, a batch job is a job that you do not directly interact with – you tell the job manager (SLURM) what to execute, and where to store the print statements, and the job manager allocates a machine for you and runs it. In the batch job case, you can still kill your job, but only by interacting with the job manager.

Typically, you would want to use an interactive job in the following cases:

  • You only need to run a few things
  • You want to interact with your job (e.g. use the debugger)

On the other hand, you would likely want a batch job when needing to run a bunch of things simultaneously. We’ll describe below how to run each type of job.

Running an “Interactive Shell” Job

To run an interactive shell, simply use the command, srun -p PARTITION --pty --mem MEMORY -t TIME bash, where PARTITION specifies from which group of machines you want your shell to run on, MEMORY (in MB) specifies how much memory to give your shell, and TIME (in D-HH:MM) specifies how long until your job is automatically killed. For example, you can run srun -p test --pty --mem 5000 -t 0-02:00 bash.

Once you’ve executed the above command, you’ll be logged into a new node that you can run whatever you like on. When you are done using the interactive shell, you can type exit.

Note: if you lose network connection or close your terminal, your interactive shell will immediately be killed (which is super inconvenient!). To deal with this, we recommend you use tmux (described below).

Running Batch Jobs

To run a batch job, you first need to write a SLURM script. This script will tell the job manager how to run your job, and how much resources to allocate to your job. Here’s an example script:

#!/bin/bash
#SBATCH -n 1 # Number of cores
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 0-10:00 # Runtime in D-HH:MM
#SBATCH -p serial_requeue # Partition to submit to
#SBATCH --mem=5000 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o OUTPUTDIR/out_%j.txt # File to which STDOUT will be written
#SBATCH -e OUTPUTDIR/err_%j.txt # File to which STDERR will be written

python your_code.py # how to run your job

Notice that it contains nearly the same info as the command required to spawn an interactive shell. To actually submit the job to the job manager, you will have to run sbatch PATH-TO-SLURM-SCRIPT. Once you have done that, the only way for you to check on your job’s progress is by looking at the files to which you directed your program’s printing to (i.e. STDOUT and STDERR), or via SLURM commands (see the useful commands below).

Now, if you wanted to submit many jobs, each for instance training a different model on a different data-set using different hyper-parameters, you would need to programmatically generate these SLURM scripts. Since this is annoying to write, we’ve provided you with code to take care of the annoying bits (see below).

Some Useful SLURM Commands

If you have a job running (as an interactive shell or as a batch job), here are some useful commands:

  • Check on job status sacct
  • Cancel all jobs you submitted: scancel -u USERNAME
  • Cancel a specific job: scancel JOBID
  • List all jobs on a specific partition: showq-slurm -l -o -p PARTITION (you can then grep for your username).
  • Alias LIST_JOBS for listing jobs in a user-friendly manner: alias LIST_JOBS='sacct -X --format=JobId,JobName%64,Partition,Start,Elapsed,Timelimit,State'
  • Alias STOP_RUNNING for cancelling all your running jobs: alias STOP_RUNNING='sacct -X --format=JobId,State | grep RUNNING | cut -c -8 | xargs scancel'
  • Alias STOP_PENDING for cancelling all your pending jobs: alias STOP_PENDING='sacct -X --format=JobId,State | grep PENDING | cut -c -8 | xargs scancel'

Additional Tips

  1. Separate your code into two parts: (1) a script that trains models (on different datasets, with different parameters) and stores the resultant model to disk (e.g. using a pickle file), and (2) a script that reads the trained models from file, computes all evaluation metrics and produces all visualizations. This way, if you need to make a quick change to a metric or a visualization, you do not need to re-train your models. You can have the second script be automatically ran using the first script.
  2. If your visualizations / metrics script takes a long time to run, consider parallelizing it as well.
  3. In each results file, store the commit version of your repository in case you ever need to figure out which version of the code something was computed with. You can do this in python like this.
  4. Once your visualization / metrics script computes the necessary info for every run of your model / data, you will likely need to do some model selection. For this, we recommend you load all metrics you computed for every model/data-set/hyper-parameters into a pandas dataframe, and then write your model selection code in terms of operations on that data-frame. Your code will be cleaner and easy to modify this way. Additionally, dataframes can be exported into latex so that you can just plop them straight into your paper.

TMUX (Terminal Multiplexer)

tmux is useful for several reasons:

  1. If you lose connection / close your terminal when logged into a remote server, it will keep alive any interactive shells you have.
  2. It allows you to split your screen into parts (have multiple terminals) without having to log in multiple times.

To use tmux, you do have to:

  1. Make sure you alway log into the same login node, since your tmux session will be stored there.
  2. Copy the .tmux.conf script into your home directory on the cluster. While this script is not necessary, it is makes tmux significantly more convenient to use.

Once on the cluster, type tmux to start a new session. Now if you close your terminal / lose connection, after logging back into the cluster (into the same login node), type tmux attach, and your session will re-appear just as you left it. Sometimes the nodes are restarted for maintenance, and in these cases you may lose the session.

In addition to keeping your session alive, tmux also has a number of other handy features:

  • Control-b \ will open up a vertical split.
  • Control-b - will open up a horizontal split.
  • Control-b o will switch between the splits
  • Control-b c will create a new pane (pane number listed in the bottom right).
  • Control-b n and Control-b p will go between the different panes (forwards and backwards, respectively)
  • Control-b [ will go into scroll mode inside a split:
    • Use the arrow keys to scroll.
    • Use Control-r to search up (keep hitting Control-r for more results)
    • Use Control-s to search down

There are more things you can do with tmux, but hopefully this is enough to get you started.

Sample .tmux.conf script

############################################################################
# Global Options
############################################################################
# large history
set-option -g history-limit 10000

# utf8 support
set-window-option -g utf8 on

# Automatically set window title
setw -g automatic-rename on

# Fast typing!
set-option -sg escape-time 1

# Key settings
set-window-option -g xterm-keys on # for vim
set-window-option -g mode-keys emacs
set-option -g status-keys emacs

# show activity
set-window-option -g monitor-activity on
set -g visual-activity on

# window title
set-option -g set-titles on
set-option -g set-titles-string '#H:#W'

# global colors
set-option -g default-terminal "screen-256color" #"xterm-256color" # "screen-256color"

# additional-menu colors
setw -g mode-bg black
setw -g mode-fg white

# pane border colors
set-option -g pane-active-border-fg white
set-option -g pane-active-border-bg white

# By default, all windows in a session are constrained to the size of the 
# smallest client connected to that session, 
# even if both clients are looking at different windows. 
# It seems that in this particular case, Screen has the better default 
# where a window is only constrained in size if a smaller client 
# is actively looking at it.
setw -g aggressive-resize on


############################################################################
# Status Bar
############################################################################
set-option -g status-utf8 on
set-option -g status-justify right
set-option -g status-bg black
set-option -g status-fg white
set-option -g status-interval 4
set-option -g status-left-length 130
set-option -g status-left '#[fg=cyan]#(date +"%-I:%M%p %b-%d")'
set-option -g status-right ''

set-window-option -g window-status-current-fg white
set-window-option -g window-status-current-bg black
#setw -g window-status-current-attr reverse
set-window-option -g window-status-format ' #I '
set-window-option -g window-status-current-format '#[fg=green][#[default]#I#[fg=green]]'

############################################################################
# Bindings & Unbindings
############################################################################

# reload tmux conf
bind-key r source-file ~/.tmux.conf \; display-message "Configuration reloaded"
 
# split horizontally
unbind '"'
bind-key - split-window -v

# split vertically
unbind %
bind-key \ split-window -h
 
 # Navigation: use the vim motion keys to move between panes
bind-key -r h select-pane -L
bind-key -r j select-pane -D
bind-key -r k select-pane -U
bind-key -r l select-pane -R
  
# Pane Resizing
bind-key -r H resize-pane -L
bind-key -r J resize-pane -D
bind-key -r K resize-pane -U
bind-key -r L resize-pane -R

# Pane Maximizing [Don't switch windows between using these]
unbind <
bind < new-window -d -n tmux-zoom 'clear && echo TMUX ZOOM && read' \; swap-pane -s tmux-zoom.0 \; select-window -t tmux-zoom
unbind >
bind > last-window \; swap-pane -s tmux-zoom.0 \; kill-window -t tmux-zoom

# in-scroll-mode commands
bind-key -t emacs-copy C-v page-down
bind-key -t emacs-copy C-f page-up

# Even out all panes in the current window
bind | select-layout "even-vertical"
bind _ select-layout "even-horizontal"

Automatic Job Submission Script

This script will help you automatically create a sequence of jobs by appropriately generating the SLURM scripts and submitting them. To use this script, copy each of the scripts below into your code-base. Of these scripts, there are two files you need to modify:

  1. Modify the template.sh file so that your job is submitted to the right partition, given the right amount of memory, etc.
  2. Modify run.py as instructed inside the file.

To get a sense for what the script does, simply run it as is using python submit_batch.py.

run.py

import json
import os
import sys

# If this flag is set to True, the jobs won't be submitted to SLURM;
# they will instead be ran one after another in your current terminal
# session. You can use this to either run a sequence of jobs locally
# on your machine, or to run a sequence of jobs one after another
# in an interactive shell.
DRYRUN = True

# This is the base directory where the results will be stored.
# In SLURM, you may not want this to be your home directory
# If you're storing lots of files (or storing a lot of data).
OUTPUT_DIR = 'output'

# This list contains the jobs and hyper-parameters to search over.
# The list consists of tuples, in which the first element is
# the name of the job (here it describes the method we are using)
# and the second is a dictionary of parameters that will be
# be grid-searched over. That is, for the first item in the QUEUE below,
# the following jobs will be ran:
# 1. dataset='mnist', learning_rate=0.01
# 2. dataset='uci', learning_rate=0.001
# 3. dataset='mnist', learning_rate=0.001
# 4. dataset='uci', learning_rate=0.01
# Note that the second parameter must be a dictionary in which each
# value is a list of options.
QUEUE = [
    ('neural_network', dict(dataset=['mnist', 'uci'], learning_rate=[0.01, 0.001])),
    ('gaussian_process', dict(dataset=['mnist', 'uci'], optimize_lenscale=[True, False])),
]


def run(exp_dir, exp_name, exp_kwargs):
    '''
    This is the function that will actually execute the job.
    To use it, here's what you need to do:
    1. Create directory 'exp_dir' as a function of 'exp_kwarg'.
       This is so that each set of experiment+hyperparameters get their own directory.
    2. Get your experiment's parameters from 'exp_kwargs'
    3. Run your experiment
    4. Store the results however you see fit in 'exp_dir'
    '''
    print('Running experiment {}:'.format(exp_name))
    print('Results are stored in:', exp_dir)
    print('with hyperparameters', exp_kwargs)
    print('\n')


def main():
    assert(len(sys.argv) > 2)

    exp_dir = sys.argv[1]
    exp_name = sys.argv[2]
    exp_kwargs = json.loads(sys.argv[3])
    
    run(exp_dir, exp_name, exp_kwargs)


if __name__ == '__main__':
    main()

submit_batch.py

import os
import copy
import sys
import itertools
import json
import subprocess
import socket

from run import (
    run,
    QUEUE,
    DRYRUN,
    OUTPUT_DIR,
)


TMP_DIR = 'tmp'
TEMPLATE = 'template.sh'


def safe_zip(*args):
    if len(args) > 0:
        first = args[0]
        for a in args[1:]:
            assert(len(a) == len(first))

    return list(zip(*args))


def submit_job(output_dir, template, exp_name, exp_kwargs):
    exp_dir = os.path.join(output_dir, exp_name)
    if not os.path.exists(exp_dir):
        os.makedirs(exp_dir)

    script = copy.deepcopy(template)
    script = script.replace('EXPNAME', exp_name)
    script = script.replace('EXPDIR', exp_dir)
    script = script.replace('KWARGS', "'{}'".format(json.dumps(exp_kwargs)))
    
    sname = os.path.join(TMP_DIR, 'slurm_{}.sh'.format(exp_name))
    with open(sname, 'w') as f:
        f.write(script)
    
    if not DRYRUN:
        ret = subprocess.call('sbatch {}'.format(sname).split(' '))
        if ret != 0:
            print('Error code {} when submitting job for {}'.format(ret, sname))
    else:
        run(exp_dir, exp_name, exp_kwargs)
    

def main():
    # Create the directory where your results will go.
    # In this directory you can make a sub-directory for each experiment you run.
    # Experiments are listed in the 'QUEUE' variable above
    if not os.path.exists(OUTPUT_DIR):
        os.makedirs(OUTPUT_DIR)

    # Create a temporary directory for the slurm scripts
    # that are automatically generated by this script
    if not os.path.exists(TMP_DIR):
        os.makedirs(TMP_DIR)

    # Read in the template so we can modify it programmatically
    with open(TEMPLATE, 'r') as f:
        template = f.read()

    # For each experiment, create a job for every combination of parameters
    for experiment_name, params in QUEUE:
        for vals in itertools.product(*list(params.values())):
            exp_kwargs = dict(safe_zip(params.keys(), vals))
            submit_job(OUTPUT_DIR, template, experiment_name, exp_kwargs)


if __name__ == '__main__':
    main()

template.sh

#!/bin/bash
#SBATCH -n 1 # Number of cores
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 0-10:00 # Runtime in D-HH:MM
#SBATCH -p test # partition
#SBATCH --mem=10000 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o EXPDIR/out_%j.txt # File to which STDOUT will be written
#SBATCH -e EXPDIR/err_%j.txt # File to which STDERR will be written

python -u run.py EXPDIR EXPNAME KWARGS