Batch processing guide (HPC)

Computing servers

Now that you are set up, you will be able to start using our available computing services. To do so, you need to read the Research Group Specifications of the group you belong to and follow the steps to connect to their development environment (frontend server). Once connected to your Research Group environment, you will enter a server named <research_group>Exx.

The remaining question is: Where to really run our experiments?

If everybody runs their processing-intensive experiments in their development environment, that would probably use a lot of RAM memory, and these servers (<research_group>Exx) would probably collapse, or run too slowly for normal coding.

For this reason we have a computing service, to be able to run a lot of experiments using all our computing servers, but keeping our access servers free to be used to code or other non-processing-intensive tasks.

The process is the following: When you send a job to your server, it is sent to a Queue Manager that looks for a free node in your partition with the capacity to run your project, and when it finds one available, it executes your job on said node, named <research_group>Cxx.  Note that in this page we give a brief introduction of the Queue Manager monitoring system, but if you want more information on the matter please visit the official slurm documentation.

You can find a list of the available nodes in every partition from every research group in the inventory.

Monitoring the computing service

To see the resources information of the computing service and partitions, you can use sinfo:

myuser@mygroupe01:~$ sinfo 

To see the current status of the computing service just type squeue or sview:

myuser@mygroupe01:~$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            240123  standard   python asalvado  R 2-01:59:39      1 mygroupc5
            264825  standard matrix_c etacchin  R    2:01:29      1 mygroupc5

In this example you can see that the user asalvado is running (R) a python experiment for more than 2 days, and etacchin some matrice-related one for more than 2 hours, both in mygroupc5 node.

If you run sinfo-usage <partition_name> you can see in detail the available and allocated resources on any compute node. Example:

myuser@mygroupe01:~$ sinfo-usage gpi.compute 
         NODE NODE Limit Allocated Available
NAME STATUS CPU GPU RAM CPU GPU RAM CPU GPU RAM
gpic09 mix 32 8 251G 9 2 75G 23 6 176G
gpic10 idle 32 8 251G 0 0 0G 32 8 251G
gpic11 idle 40 8 376G 0 0 0G 40 8 376G
gpic12 mix 40 8 376G 1 1 16G 39 7 360G
gpic13 mix 40 8 251G 8 4 32G 32 4 219G
gpic14 idle 40 6 251G 0 0 0G 40 6 251G

Let's submit some jobs!

Submitting a job

Do you know the sleep command? Well, it just does nothing for the seconds that you specify as a parameter... quite simple, right?

As any other command, if you simply type it in the shell, it will run the host you are logged in. If, for example, we were on gpie01 (GPI research group, on development server 01):

myuser@gpie01:~$ sleep 10 &
[1] 30329
myuser@gpie01:~$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            240123  standard   python asalvado  R 2-01:59:49      1 gpic5
            264825  standard matrix_c etacchin  R    2:01:39      1 gpic5

Please note that the final & of the sleep 10 & command is just to recover the terminal after the call, to not wait until it finish (more info about running shell commands).

Then, after calling sleep 10, we can see that nothing changed in the computing service...(because the sleep command is running in the development server). But if we simply type srun before it, then it will run in the computing service:

myuser@gpie01:~$ srun sleep 10&
[1] 31262
myuser@gpie01:~$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            240123  standard   python asalvado  R 2-02:00:22      1 gpic5
            264825  standard matrix_c etacchin  R    2:02:12      1 gpic5
            264836  standard    sleep   myuser  R       0:03      1 gpic5

In the last line of the squeue command you can see that our sleep is now running, not in gpie01, but in the gpic5 node.

Submitting a job to a partition queue

The computing service is divided into several partition queues, like in a supermarket. If, for example, you belonged to the GPI Research Group, your possible partitions would be gpi.compute (for more information on research groups and partitions, please review the Research Groups Specifications page). In the GPI example, the gpi.compute partition allows to run your command with a big time limit (24h), but you will probably wait in the cue until there are resources available for you. To use it:

myuser@gpie01:~$ srun -p gpi.compute sleep 10&

In any case, if you need to compute something, please submit it into the system and wait until it is computed, do not cancel your own jobs because they are waiting. Cancelling jobs that you actually want to compute is bad for several reasons:

  • Of course, it won't be computed until you submit it again...
  • The more time that your job is in the queue the more priority it gains to be the next to be run.
  • The length of the queue allows us to determine the actual computing needs of the group, that is, your actual computing needs.
    • If you "wait" to submit (without any jobs in the queue) instead of waiting in the queue, for the system you are not actually waiting.

Priorities of a job

Our queue manages jobs differently depending on its occupation: if it isn't full, the queue works as a FIFO (first in first out), which means that the servers assigns its resources based solely on the order its clients asked for them.

When the queue is full, however, we switch to using a multifactor priority that manages the positions in the queue based on three factors: Your share (explained below), your occupation of the server in the last 24 hours, and the time you have been waiting on the queue, mainly weighting on the share (fairshare factor).

The summarized process is the following:

We have three types of users, and each type has a different share (a percentage of 'ownership' of the cluster).

The categories, in order of share, are:

   1. Members: PhDs and professors <Highest share>
   2. Collaborators 
   3. Students <Lowest share>

The fairshare algorithm adjusts the priority of jobs for all users to ensure that they have their appropriate share.

Possible situations you might be encountering: If you have a high share but are running many jobs, your priority is lowered to allow others to go ahead of you. On the other hand, if you have a low share but are using the system minimally, your priority is increased so that you can utilize it more. Also, the more time you have been waiting on the queue, the higher your priority gets.

The queue manager we use is Slurm, which is an open-source workload manager and job scheduler designed for high-performance computing (HPC) clusters and supercomputers. You can visit the multifactor priority plugin and classic fairshare algorithm pages on their official website for more in depth info on this topic.

Cancelling a job

To cancel your running job you have to get its jobid by running squeue, and the just:

>> scancel $JOBID

Asking for computing resources

CPUs

To ask for 2 CPUs:

>> srun -c 2 ./myapp

RAM

To ask for 4G of RAM:

>> srun --mem 4G ./myapp

Please note that, if you get the message srun: error: task: Killed, it is probably because your app is using more RAM that your reserved. Just ask for more RAM, or try to reduce the amount of RAM that you need.

GPUs and their RAM

GPUs are part of the Generic Resources (gres), so to ask for 1 GPU:

>> srun --gres=gpu:1 ./myapp

If you want an GPU with at least 6GB of RAM:

>> srun --gres=gpu:1,gpumem:6G ./myapp

Usually you need to check how much GPU RAM is your job actually using. The nvidia-smi is the right command to use, but to integrate it into our comuting service, you should use:

>> srun-monitor-gpu $JOBID

Please note that you always should specify the -A parameter when using the srun command to avoid any error messages, and it needs to be the one your partition belongs to. All our accounts have the same names as our research groups, so for example if you are on research group CSL, you should use the paramenter -A csl, and then specify one of our two possible partitions in csl (csl and csl.develop) with the parameter -p. 

Remember that you can always type the srun -h command to know more information about its options.

Where $JOBID is the jobid of your running job. You can see it in the first lines when your job starts, or in the output of the squeue command.

Graphical Interfaces

If you job does any kind of graphical visualization (opens any window), then you need to pass an extra parameter --x11:

>> srun --x11 ./myapp

Submit jobs with a batch file

You also can create an automated script, in this case you ask for a GPU. Look at this example myscript.sh

#!/bin/bash
#SBATCH -p veu             # Partition to submit to
#SBATCH --mem=1G      # Max CPU Memory
#SBATCH --gres=gpu:1
python myprogram.py
>> sbatch myscript.sh

Interactive Jobs

If, for example, you belong to the GPI research group and are currently working on the developing environment gpie01, gpie01 is CPU and memory limited per user, some daily operations can be slowed or even "killed". When this occurs, you should work with an interactive shell executed into a job. All the operations will be executed on a compute server but at the same time, will be applied to your current directory. This is an example:

>>  srun --pty --mem 16000 -c2 --time 2:00:00 /bin/bash

Then you can execute interactively what you want having 2 cores and 16GB of RAM on a remote compute server during 2hours.

Please note that this explanation applies to all developing environments form every research group, gpie01 is only an example case.

Job arrays

If we plan to run the same experiment hundreds or thousands of times but just changing some parameters, the Job Array is our friend.

The best way to understand them is following an example.
Imagine tha we want to run this three commands:

>> srun  myapp --param 10
>> srun  myapp --param 15
>> srun  myapp --param 20

In this simple use case we could just do it like that: three srun commands.
But if we want to test myapp with hundreds or thousands of parameters, do it manually is not an option; and job arrays are here for the rescue!

To achieve this three commands using Job Arrays we always need to create a shell script that we can call worker.sh. A first approach to solve the problem could be a worker.sh like this:

#!/bin/bash
myapp --param $SLURM_ARRAY_TASK_ID

And we should launch it with:

>> sbatch --array=10,15,20 worker.sh

Here we should note several things:

  • The sbatch command is almost like srun, but for shell scripts
  • The worker.sh script is srunned all the times requested in the --array param of the sbatch command.
  • The values and ranges passed to the --array are converted to the $SLURM_ARRAY_TASK_ID variable in each execution
  • By default the standard output and error are saved in files like slurm-jobid-taskid.out in the current directory

For example, let's try our first hello job array with a simple worker that just prints out a hello world message, and waits a little bit for convenience:

imatge@nx2:~>> cat worker.sh 
#!/bin/bash
echo "Hello job array with parameter: " $SLURM_ARRAY_TASK_ID
sleep 10 # to be able to run squeue

We can launch 5 jobs in an array with:

>> sbatch --array=0-5 worker.sh
Submitted batch job 43702

If we monitor the queue we see our 5 executions:

>> squeue 
               JOBID PARTITION      NAME     USER ST       TIME CPUS MIN_MEM GRES            NODELIST(REASON)
             43702_0  standard worker.sh   imatge  R       0:02    1    256M (null)          v5
             43702_1  standard worker.sh   imatge  R       0:02    1    256M (null)          c4
             43702_2  standard worker.sh   imatge  R       0:02    1    256M (null)          c4
             43702_3  standard worker.sh   imatge  R       0:02    1    256M (null)          c4
             43702_4  standard worker.sh   imatge  R       0:02    1    256M (null)          c4
             43702_5  standard worker.sh   imatge  R       0:02    1    256M (null)          c4

Once they are done, we can see the output files in the current directory following the slurm-jobid-taskid names:

>> ll slurm-43702_*
-rw-r--r-- 1 imatge imatge 35 Nov  4 19:21 slurm-43702_0.out
-rw-r--r-- 1 imatge imatge 35 Nov  4 19:20 slurm-43702_1.out
-rw-r--r-- 1 imatge imatge 35 Nov  4 19:20 slurm-43702_2.out
-rw-r--r-- 1 imatge imatge 35 Nov  4 19:20 slurm-43702_3.out
-rw-r--r-- 1 imatge imatge 35 Nov  4 19:20 slurm-43702_4.out
-rw-r--r-- 1 imatge imatge 35 Nov  4 19:20 slurm-43702_5.out

And we can check the expected output contents:

>> cat slurm-43702_*
Hello job array with parameter:  0
Hello job array with parameter:  1
Hello job array with parameter:  2
Hello job array with parameter:  3
Hello job array with parameter:  4
Hello job array with parameter:  5

Note that you can combine ranges and values in the --array parameter of the sbatch command like this:

>> sbatch --array=1-3,5-7,100 worker.sh
Submitted batch job 43710

>> cat slurm-43710_*
Hello job array with parameter:  100
Hello job array with parameter:  1
Hello job array with parameter:  2
Hello job array with parameter:  3
Hello job array with parameter:  5
Hello job array with parameter:  6
Hello job array with parameter:  7

This range and values flexibility could very useful when for example we run thousands of workers, but just some of the failed and you want to rerun only those.

Note also that we can also use a Python script (or any other script language) as a worker, like this:

>> cat worker.py 
#!/usr/bin/python
import os
print "Hello job array with Python, parameter: ", os.environ['SLURM_ARRAY_TASK_ID']

>> sbatch --array=1-3 worker.py
Submitted batch job 43720

>> cat slurm-43720_*
Hello job array with Python, parameter:  1
Hello job array with Python, parameter:  2
Hello job array with Python, parameter:  3

We can use arrays or dictionaries in our script to handle multple parameters of our worker script:

>> cat worker.sh
#!/bin/bash

param1[0]="Bob"; param2[0]="Monday"
param1[1]="Sam"; param2[1]="Monday"
param1[2]="Bob"; param2[2]="Tuesday"
param1[3]="Sam"; param2[3]="Tuesday"

echo "My friend" ${param1[$SLURM_ARRAY_TASK_ID]} "will come on" ${param2[$SLURM_ARRAY_TASK_ID]}

>> sbatch --array=0-3 worker.sh
Submitted batch job 43734

>> cat slurm-43734_*
My friend Bob will come on Monday
My friend Sam will come on Monday
My friend Bob will come on Tuesday
My friend Sam will come on Tuesday

To handle parameters we can alse use external files and use the SLURM_ARRAY_TASK_ID as the line nuber to read the desired parameters, like this:

>> cat param1.txt 
Bob
Sam
Bob
Sam
>> cat param2.txt 
Monday
Monday
Tuesday
Tuesday

>> cat worker.sh
#!/bin/bash
param1=`sed "${SLURM_ARRAY_TASK_ID}q;d" param1.txt`
param2=`sed "${SLURM_ARRAY_TASK_ID}q;d" param2.txt`
echo "My friend" $param1 "will come on" $param2

>> sbatch --array=1-4 worker.sh
Submitted batch job 43740

>> cat slurm-43740_*
My friend Bob will come on Monday
My friend Sam will come on Monday
My friend Bob will come on Tuesday
My friend Sam will come on Tuesday

Please note that in this case we have to use the range 1-4 instead of the range 0-3 that we used when we save the parameters in arrays instead of files.

Computing with graphical user interface support

This service is based on open source software called open ondemand. The access web page is https://ondemand.tsc.upc.edu/ and you can log in with your UPC username and password.

The following image shows the environment shortcuts:

In "File" you will be able to explore the files that are in your user "home". There is also a basic file editor.
In "Interactive Apps" or the following icon, it will show you the applications available to your user. Depending on your research group, you'll find your related "Desktop APP".

Once you click on your Desktop App, you'll find a form like this:

 

Here you can configure a job like you do with the "srun" command. So you can choose the number of CPU cores, RAM memory and optionally a number of GPUs.
Once configured and you click "Launch", this will open a desktop on one of our compute servers for you. You will see that it is in "queued" for a while, once it changes to "Running" raise the "image Quality" to the top and you will be able to access your desktop.
When it opens you will see a terminal (a black window with your username), do not close it or you will end the session. In that terminal you can open any software installed on "Calcula".  A Matlab app is executed on this example figure.

 

When you are done using your desktop you can end it by clicking "Delete".

Closing you're web browser does not ends the running session, so you can resume your remote connection any time before the job stills running.