Running jobs using Apptainer containers
Overview
Teaching: 30 min
Exercises: 40 minQuestions
How do I set up and run a SLURM job from a Apptainer container?
How do I set up and run an parallel MPI job from a Apptainer container?
Objectives
Learn how MPI applications within Apptainer containers can be run on HPC platforms
Understand the challenges and related performance implications when running MPI jobs via Apptainer
Running SLURM jobs that use Apptainer containers
In the most basic case, including Apptainer in your SLURM job won’t look very different than what we have been doing interactively so far. For completeness, and so we can see it all at once, let’s look at a SLURM batch script that will run the nhmmer image we created earlier.
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --ntasks=1
#SBATCH --mem=10g
#SBATCH --tmp=10g
module load apptainer
apptainer exec ~/hmmerInUbuntu.sif nhmmer -h
Things can get a little more complicated for workflows that deviate from this pattern, so let’s take a look at MPI parallel workflows below, and instance-based workflows in the next episode.
Running GPU-accelerated codes via CUDA with Apptainer containers
CUDA Overview
CUDA - Compute Device Unified Architecture - is a proprietary parallel computing platform developed by NVIDIA, primarily for use with their GPUs to allow acceleration of specific computing tasks. CUDA provides an interface that allows programmers to distribute eligible tasks in parallel to the GPU hardware, which can provide significant speedups over traditional CPU-based computing in certain applications.
CUDA codes with Apptainer containers
The primary change that needs to be made for CUDA-capable codes to run inside of a container is to make an appropriate GPU visible to the container. Luckily, this is such a common task for container-based workflows that apptainer provides a single command-line option that works for us in most cases.
In order to use CUDA in this way, we first need to request a SLURM job with GPU resources. For an example job that wants to run on a single A100 GPU preamble to such a SLURM script will looks like the following.
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --ntasks=1
#SBATCH --mem=10g
#SBATCH --tmp=10g
#SBATCH --partition=a100-4
#SBATCH --gres=gpu:a100:1
Note that we have added two lines here. One specifying the partition where the job will run, and a second specifying the type and number of GPUs we want for this job. You can read more about what kind of GPUs are available and how to access them at MSI’s page describing the available system partitions.
With the job script appropriately modified to request a GPU, we are now able to look at how to make apptainer aware of the GPU. Let’s use a container for the GROMACS software, a molecular-dynamics code that can offload part of the simulation work to GPUs, as our example. NVIDIA provides a container for GROMACS, and a variety of other software, via their NGC Catalog.
apptainer pull docker://nvcr.io/hpc/gromacs:2023.2
This gives us an image file called gromacs\_2023.2.sif. Using this inside of our GPU-enabled job would then look like:
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --ntasks=1
#SBATCH --mem=10g
#SBATCH --tmp=10g
#SBATCH --partition=a100-4
#SBATCH --gres=gpu:a100:1
module load apptainer
apptainer run --nv gromacs_2023.2.sif gmx mdrun -h
The key addition here is the use of the --nv option for apptainer run, which passes all available NVIDIA GPUs into the container. It works in the same way for exec, shell, and other commands that execute commands inside the container.
In most cases that is all you need to do. Occasionally, you may need to hint or limit the GPU devices that are available to the container. Apptainer interacts with this information via environment variables. For instance, if you are running a complex pipeline where your job requests multiple GPUs but you want to run a specific step using a container on a single GPU, you can indicate which GPU should be used by its index. NVIDIA GPUs on a system are indexed starting from zero, and you can see the indices for the GPUs available in your current session by running nvidia-smi.
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --ntasks=1
#SBATCH --mem=10g
#SBATCH --tmp=10g
#SBATCH --partition=a100-4
#SBATCH --gres=gpu:a100:4
module load apptainer
APPTAINER_CUDA_VISIBLE_DEVICES=0 apptainer run --nv gromacs_2023.2.sif gmx mdrun -h
If you wanted to use two GPUs for this step, you could modify that final line to look like:
APPTAINER_CUDA_VISIBLE_DEVICES=0,1 apptainer run --nv gromacs_2023.2.sif gmx mdrun -h
Running MPI parallel codes with Apptainer containers
MPI overview
MPI - Message Passing Interface - is a widely used standard for parallel programming. It is used for exchanging messages/data between processes in a parallel application. If you’ve been involved in developing or working with computational science software, you may already be familiar with MPI and running MPI applications.
When working with an MPI code on a large-scale cluster, a common approach is to compile the code yourself, within your own user directory on the cluster platform, building against the supported MPI implementation on the cluster. Alternatively, if the code is widely used on the cluster, the platform administrators may build and package the application as a module so that it is easily accessible by all users of the cluster.
MPI codes with Apptainer containers
If our target platform uses OpenMPI, one of the two widely used source MPI implementations, we can build/install a compatible OpenMPI version within the image as part of the image build process. We can then build our code that requires MPI, either interactively in an image sandbox or via a definition file.
If the target platform uses a version of MPI based on MPICH, the other widely used open source MPI implementation, there is ABI compatibility between MPICH and several other MPI implementations. In this case, you can build MPICH and your code within an image sandbox or as part of the image build process via a definition file, and you should be able to successfully run containers based on this image on your target cluster platform.
MSI has both OpenMPI and MPICH options available, so the best choice here will depend on your specific workflow and the aparallel code you want to run.
As described in Apptainer’s MPI documentation, support for both OpenMPI and MPICH is provided. Instructions are given for building the relevant MPI version from source via a definition file and we’ll see this used in an example below.
Container portability and performance on HPC platforms
While building a container on one system that is intended for use on another, remote HPC platform does provide some level of portability, if you’re after the best possible performance, it can present some issues. The version of MPI in the container will need to be built and configured to support the hardware on your target platform if the best possible performance is to be achieved. Where a platform has specialist hardware with proprietary drivers, building on a different platform with different hardware present means that building with the right driver support for optimal performance is not likely to be possible. This is especially true if the version of MPI available is different (but compatible). Apptainer’s MPI documentation highlights two different models for working with MPI codes. The hybrid model that we’ll be looking at here involves using the MPI executable from the MPI installation on the host system to launch apptainer and run the application within the container. The application in the container is linked against and uses the MPI installation within the container which, in turn, communicates with the MPI daemon process running on the host system. In the following section we’ll look at building am Apptainer image containing a small MPI application that can then be run using the hybrid model.
Building and running an Apptainer image for an MPI code
Building and testing an image
We’ll build an image from a definition file. Containers based on this image will print a ‘Hello world” message based on the available parallel resources.
Begin by creating a file called mpitest.c in your current directory with the contents:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char **argv) {
int rc;
int size;
int myrank;
rc = MPI_Init (&argc, &argv);
if (rc != MPI_SUCCESS) {
fprintf (stderr, "MPI_Init() failed");
return EXIT_FAILURE;
}
rc = MPI_Comm_size (MPI_COMM_WORLD, &size);
if (rc != MPI_SUCCESS) {
fprintf (stderr, "MPI_Comm_size() failed");
goto exit_with_error;
}
rc = MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
if (rc != MPI_SUCCESS) {
fprintf (stderr, "MPI_Comm_rank() failed");
goto exit_with_error;
}
fprintf (stdout, "Hello, I am rank %d/%d\n", myrank, size);
MPI_Finalize();
return EXIT_SUCCESS;
exit_with_error:
MPI_Finalize();
return EXIT_FAILURE;
}
In the same directory, save the following definition file content to a .def file, e.g. ompi_example.def:
Bootstrap: docker
From: ubuntu:20.04
%files
mpitest.c /opt
%environment
# Point to OMPI binaries, libraries, man pages
export OMPI_DIR=/opt/ompi
export PATH="$OMPI_DIR/bin:$PATH"
export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"
export MANPATH="$OMPI_DIR/share/man:$MANPATH"
%post
echo "Installing required packages..."
apt update && apt install -y wget rsh-client build-essential
echo "Installing Open MPI"
export OMPI_DIR=/opt/ompi
export OMPI_VERSION=4.1.5
export OMPI_URL="https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-$OMPI_VERSION.tar.bz2"
mkdir -p /tmp/ompi
mkdir -p /opt
# Download
cd /tmp/ompi && wget -O openmpi-$OMPI_VERSION.tar.bz2 $OMPI_URL && tar -xjf openmpi-$OMPI_VERSION.tar.bz2
# Compile and install
cd /tmp/ompi/openmpi-$OMPI_VERSION && ./configure --prefix=$OMPI_DIR && make -j8 install
# Set env variables so we can compile our application
export PATH=$OMPI_DIR/bin:$PATH
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH
echo "Compiling the MPI application..."
cd /opt && mpicc -o mpitest mpitest.c
cp mpitest /bin
A quick overview of what the above definition file is doing:
- The image is being bootstrapped from the
ubuntu:20.04Docker image. - In the
%filessection: The MPI “Hello world” test is copied from the current directory into the/optdirectory within the image. - In the
%environmentsection: Set a couple of environment variables that will be available within all containers run from the generated image. - In the
%postsection:- Ubuntu’s
aptpackage manager is used to update the package directory and then install the compilers and other libraries required for the OMPI build. - The OMPI .tar.bz2 file is extracted and the configure, build and install steps are run.
- The environment is set up to use the newly installed OMPI, and we build our “Hello world” example.
- Ubuntu’s
Build and test the image
Using the above definition file, build an Apptainer image named
ompi_example.sif.Once the image has finished building, test it by running the
mpitestprogram that we built.Solution
You should be able to build an image from the definition file as follows:
$ apptainer build ompi_example.sif ompi_example.defLet’s begin with a single-process run of
mpitestto ensure that we can run the container as expected. We’ll use the MPI installation within the container for this test. Note that when we run a parallel job on an HPC cluster platform, we use the MPI installation on the cluster to coordinate the run so things are a little different…Start a shell in the Apptainer container based on your image and then run a single process job via
mpirun:$ apptainer shell ompi_example.sif Apptainer> mpirun -np 1 mpitestYou should see output similar to the following:
Hello, I am rank 0/1
Running Apptainer containers via MPI
Assuming the above tests worked, we can now try undertaking a parallel run within our container image.
This is where things get interesting and we’ll begin by looking at how Apptainer containers are run within an MPI environment.
If you’re familiar with running MPI codes, you’ll know that you use mpirun (as we did in the previous example), mpiexec or a similar MPI executable to start your application. This executable may be run directly on the local system or cluster platform that you’re using, or you may need to run it through a job script submitted to a job scheduler. Your MPI-based application code, which will be linked against the MPI libraries, will make MPI API calls into these MPI libraries which in turn talk to the MPI daemon process running on the host system. This daemon process handles the communication between MPI processes, including talking to the daemons on other nodes to exchange information between processes running on different machines, as necessary.
When running code within an Apptainer container, we don’t use the MPI executables stored within the container (i.e. we DO NOT run apptainer exec mpirun -np <numprocs> /path/to/my/executable). Instead we use the MPI installation on the host system to run Apptainer and start an instance of our executable from within a container for each MPI process. Without Apptainer support in an MPI implementation, this results in starting a separate Apptainer container instance within each process. This can present some overhead if a large number of processes are being run on a host. Where Apptainer support is built into an MPI implementation this can address this potential issue and reduce the overhead of running code from within a container as part of an MPI job.
Ultimately, this means that our running MPI code is linking to the MPI libraries from the MPI install within our container and these are, in turn, communicating with the MPI daemon on the host system which is part of the host system’s MPI installation. In the case of OMPI, these two installations of MPI may be different but as long as there is ABI compatibility between the version of MPI installed in your container image and the version on the host system, your job should run successfully.
We can now try running a 2-process MPI run of our test program.
Undertake a parallel run of
mpitest(general example)You should be able to run the example using a command similar to the one shown below. However, if you are not currently inside an interactive SLURM job, you may need to write and submit a job submission script at this point to initiate running of the benchmark.
Also note that due to a peculiarity of how we install OMPI at MSI, we will need to unset OPAL_PREFIX before running a hybrid OMPI+Apptainer job.
unset OPAL_PREFIX $ mpirun -np 2 apptainer exec ompi_example.sif mpitestExpected output and discussion
As you can see in the mpirun command shown above, we have called
mpirunon the host system and are passing to MPI theapptainerexecutable for which the parameters are the image file and any parameters we want to pass to the image’s run script, in this case the path/name of the executable to run.Hello, I am rank 1/2 Hello, I am rank 0/2
Key Points
Apptainer images containing MPI applications can be built on one platform and then run on another (e.g. an HPC cluster) if the two platforms have compatible MPI implementations.
When running an MPI application within a Apptainer container, use the MPI executable on the host system to launch an Apptainer container for each process.
Think about parallel application performance requirements and how where you build/run your image may affect that.