Published January 6, 2024 © GPL3+

How to build Nvidia Jetson HPC Cluster Using SLURM

This guide demonstrates how to build a SLURM cluster using Nvidia Jetson, Raspberry Pi, and Orange Pi single board computers.

AdvancedFull instructions provided8 hours2,871

How to build Nvidia Jetson HPC Cluster Using SLURM

Things used in this project

Hardware components

NVIDIA Jetson Nano Developer Kit

NVIDIA Jetson Xavier NX Developer Kit

Seeed Studio reComputer Jetson-20-1-H2 with Jetson Xavier NX 16 GB module, Aluminium case, pre-installed JetPack System

Raspberry Pi 4 Model B

Orange Pi 5 Plus

Software apps and online services

Slurm Workload Manager

Story

HPC stands for High Performance Computing and it is a cluster system for performing high computation loads from many users as commonly needed in research environments. Single board computers are revolutionizing the computer industry. This inexpensive computers can be used to accomplish previously unexplored tasks. One such task is a cluster computing on the edge that can run parallel jobs. A cluster is a collection of devices that, acting as a single system, delivers high-performance by distributing tasks among its members. Within a cluster, one device functions as the head node, responsible for directing the others and supplying them with instructions to achieve a specific goal. Recently, I built a Beowulf Kubernetes cluster using Nvidia Jetson devices and a Raspberry Pi as an NFS node, specifically to run an inference of the large language models. Now, I'm rebuilding it to explore the SLURM workload manager and create an environment for computational science, where I can gain hands-on experience with Simple Linux Utility for Resource Management (SLURM) and Message Passing Interface (MPI).

Introduction to Simple Linux Utility for Resource Management

The Slurm task scheduling tool (Simple Linux Utility for Resource Management, acronym, abbreviated as SLURM), is a free, open-source task for linux scheduling tool, widely used by supercomputers and computer groups around the world. Slurm uses optimal algorithms based on Hilbert curve scheduling or fat network topology in order to optimize task distribution in parallel computers.

Let’s first look at a diagram of the Slurm architecture below.

Source: https://ssarcandy.tw/2019/03/16/Setup-Slurm-Cluster/

Basically, the two most important things are Slurm Controller (slurmctld) and Slurm Compute Node (slurmd). The Controller is used to allocate tasks. It manages all compute node is responsible for deciding which task should go to which node for execution, and compute node is the machine that actually executes the task.

Slurm Cluster's structure

HPC cluster is made up of a number of compute nodes, each with a complement of processors, memory and GPUs. The basic architecture of a compute cluster consists of a head node, which is the computer from which a user submits jobs to run, and compute nodes, which are a large number of computers on which the jobs can be run. It is also possible to log into a compute node and run jobs directly from there.

There are three compute nodes in my cluster, and specified with different configurations:

ComputeNode 1: Nvidia Jetson Xavier NX with 16GB memory nd Ubuntu 20.04.
ComputeNode 2: Nvidia Jetson Xavier NX with 8GB memory and Ubuntu 20.04.
Compute Node 3: Nvidia Jetson Nano with 4GB memory. Follow instructions hereto upgrade Jetson Nano's OS from Ubuntu 18.04 to 20.04.

I interact with an HPC cluster through head node which is based on Raspberry Pi 4 Model B 4GB RAM with Ubuntu 20.04. Basically, the head node manages job distribution among the compute nodes.

Below the image of my cluster up and running

An Orange Pi 5 Plus single board computer with 128 GB SSD and 16RAM is used for NFS (Network File System) storage. This suggests that it provides a shared storage system that is accessible to all nodes in the cluster. The network switch acts as a central point of communication within the cluster, allowing data to be routed between different devices. A router is depicted as the gateway to the public network, which implies internet connectivity.

HPC cluster computing requires a storage location that is shared across all of the different nodes so they can work on the same files. I will add an NFS sharing mechanism. I will use Orange Pi 5 Plus with 128GB SSD as a master NFS storage. And compute nodes and head node as NFS clients

Fortunately, setting up NFS sharing is a straightforward process. To begin, the NFS kernel server package must be installed on the master storage node using the following command:

sudo apt-get install -y nfs-kernel-server

Next, create a shared directory with:

sudo mkdir /mnt/mydrive/shared

Then, modify the permissions of the shared directory to 777:

sudo chmod -R 777 /mnt/mydrive/shared

Proceed to edit the NFS exports file by adding the following line:

/mnt/mydrive/shared *(rw,all_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000)

Afterward, update the NFS active exports with:

sudo exportfs -ra

This command updates the NFS server's list of exports and makes the shared directory available to the cluster nodes.

Install the NFS Client on each node using the following command:

sudo apt install nfs-common -y

Create a directory for NFS storage:

sudo mkdir /nfs-storage

Set the ownership of the directory to 'nobody.nogroup' and grant full permissions:

sudo chown nobody.nogroup -R /nfs-storage
sudo chmod 777 -R /nfs-storage

Edit fstab by adding a line with the following command to the file.

Where IP_ADDRESS:/mnt/mydrive/shared will be the IP Address of the NFS Server that we have shared the directory and /nfs-storage is the directory we want to mount.

sudo nano /etc/fstab

add below line

192.168.0.104:/mnt/mydrive/shared /nfs-storage nfs defaults 0 2

Mount the storage to the directory using the command:

sudo mount -a

Add the hostname of every node in the cluster to the /etc/hosts file using the command:

sudo nano /etc/hosts

Then add the lines below to the file, which will be the IP Address and the name of the hostname.

192.168.0.103 node1
192.168.0.102 node2
192.168.0.101 node3
192.168.0.104 head

By following these steps, you'll successfully mount the NFS client and share directory on the specified machines in your cluster.

Build Munge from Source

Munge is an authentication service for creating and validating credentials. It's crucial to ensure that the clocks, user accounts, and group IDs (UIDs and GIDs) are synchronized across all nodes in the cluster. Additionally, you need to establish passwordless SSH communication between the head and compute nodes. On the head node, execute the following commands:

Generate SSH keys

ssh-keygen

Copy the SSH key to the target node (replace "nodeX" with the actual node name)

ssh-copy-id jetson@nodeX

All services in the Slurm Cluster need to keep uid and gid consistent.

To maintain consistency, you should create two users, namely slurm and munge, across all servers in the cluster. Munge should be installed on both the service and compute nodes.

You need to install essential development tools and dependencies:

sudo apt install build-essential libssl-dev ntp

Use the following commands to create a MUNGE user and a munge group on each node of the cluster:

export MUNGEUSER=1001 
groupadd -g $MUNGEUSER munge 
useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge \ -u $MUNGEUSER -g munge -s /sbin/nologin munge

After completing these steps, you can proceed to build Munge from source using the instructions provided on GitHub. Clone the munge repository

git clone https://github.com/dun/munge.git

Configure the build with appropriate paths:

cd munge
./bootstrap
./configure \
     --prefix=/usr \
     --sysconfdir=/etc \
     --localstatedir=/var \
     --runstatedir=/run

Build the Munge software:

make
make check
sudo make install

On the head node, use the following command to create a new Munge key:

dd if=/dev/random bs=1 count=1024 >/etc/munge/munge.key

This command generates 1024 bytes of random data from /dev/random and saves it to the file /etc/munge/munge.key. Such operations are commonly employed in cryptography and security tasks to generate secure random keys or initialization vectors for various applications.

Next, set the correct permissions for Munge on every server:

chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/  /run/munge/ 
chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/ 
chmod 0711 /run/munge/

To verify a successful Munge installation, execute the following commands:

munge -n | unmunge | grep STATUS

If the output shows "STATUS: Success (0), " the installation was successful. We can test MUNGE on a specific node (e.g., node1) using SSH:

ssh jetson@node1 munge -n | unmunge

Expected output (example):

ssh jetson@node1 munge -n | unmunge
STATUS:          Success (0)
ENCODE_HOST:     node1 (192.168.0.101)
ENCODE_TIME:     2023-09-08 06:37:53 +0000 (1694155073)
DECODE_TIME:     2023-09-08 06:37:53 +0000 (1694155073)
TTL:             300
CIPHER:          aes128 (4)
MAC:             sha256 (5)
ZIP:             none (0)
UID:             ubuntu (1000)
GID:             ubuntu (1000)
LENGTH:          0

If you encounter errors during testing, verify that the /etc/munge/munge.key file is identical across all nodes, then reboot all nodes and try the test again.

Installing MariaDB on Head node

I will now guide you through the process of installing MariaDB and configuring it to work with the Slurmdbd accounting tool. Slurmdbd is a services that collects usage statistics and requires a database management system such as MariaDB to store this data. MariaDB is a popular, open-source relational database management system that is often used as an alternative to MySQL.

To begin the installation process, you will first need to update the package index on your server using the following command:

sudo apt update

Then install the package:

sudo apt install mariadb-server

Once the installation is complete, it is important to run the security script to ensure that your database is secure:

sudo mysql_secure_installation

Next, you will need to access the MariaDB command line interface by running:

sudo mysql -u root -p

Once you are logged in, you can create the necessary database and user for Slurmdbd by running the following commands:

create database slurm_acct_db;
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = password('slurmdbpass');
grant usage on *.* to 'slurm'@'localhost';
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
flush privileges;
exit

It is recommended that you change the password slurmdbpass'to something more secure.

Building and installing OpenPMIx from source

OpenPMIx (Process Management for Exascale Environments) is used in conjunction with Slurm. The integration of OpenPMIx with Slurm enhances its capabilities, especially in HPC environments, by providing better process management, scalability, and support for advanced features.

To install OpenPMIx, I compile it from the source code and then install it in a shared NFS folder. The first step is to create the necessary directories within this shared NFS folder:

mkdir -p ./git/pmix/build/4.2.8 
mkdir -p ./git/pmix/install/4.2.8

The build directory will house the intermediate build files, while the install directory will hold the final installation.

Change to the pmix directory

cd ./git/pmix

Clone the PMIx source code repository from GitHub:

git clone https://github.com/pmix/pmix.git source

Change to the directory containing the source code:

cd ./git/pmix/source

Check out the specific branch you want to build (in this case, v4.2.8):

git checkout v4.2.8

Fetch the latest updates from the remote repository:

git pull

Initialize and update any submodules within the repository:

git submodule update --init --recursive

Generate configuration files using the autogen.pl script:

./autogen.pl

Configure PMIx with the desired installation prefix

/nfs-storage/git/pmix/source/configure --prefix=/nfs-storage/git/pmix/install/4.2.8

Build and install PMIx, redirecting any output to /dev/null:

sudo make -j install >/dev/null

Navigate to the installation directory. Then, check the installed PMIx version using the pmix_info command:

./bin/pmix_info

You should see output similar to:

Package: PMIx jetson@node3 Distribution
PMIX: 4.2.8rc1
PMIX repo revision: v4.2.8
PMIX release date: Unreleased developer copy
PMIX Standard: 4.2
PMIX Standard ABI: Stable (0.0), Provisional (0.0)
Prefix: /nfs-storage/git/pmix/install/4.2.8
Configured architecture: pmix.arch
Configure host: node3
Configured by: jetson
Configured on: Tue Jan 2 12:56:08 UTC 2024
Configure host: node3
Configure command line: '--prefix=/nfs-storage/git/pmix/install/4.2.8'
Built by: root
Built on: Tue Jan 2 13:02:46 UTC 2024
Built host: node3
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C compiler family name: GNU
C compiler version: "11" "." "4" "." "0"
Internal debug support: no
dl support: yes
Symbol vis. support: yes
Manpages built: yes

This command will output information about the installed PMIx package, including its version, configuration details, and build environment

Download and install SLURM on the head node

To set up SLURM on your head node, follow these steps:

Download SLURM from the official website using the wget command:

wget https://download.schedmd.com/slurm/slurm-23.02.5.tar.bz2

Unpack the downloaded archive:

tar xvjf slurm-23.02.5.tar.bz2

To build Slurm, ensure that you have the necessary dependencies installed. You can use the following command to install them:

sudo apt install libmysqlclient-dev libpam0g-dev libjson-c-dev libhttp-parser-dev libyaml-dev libreadline-dev libgtk-3-dev man2html libcurl4-openssl-dev

Additionally, install fpm by running:

sudo apt-get install ruby-dev build-essential && sudo gem i fpm -f

Configure SLURM with with pmix support the desired settings. Replace the paths and options as needed. Here's an example configuration:

./configure --prefix=/nfs-storage/apps/slurm-build  --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/usr/lib/aarch64-linux-gnu/security/ --without-shared-libslurm --with-pmix=/nfs-storage/git/pmix/install/4.2.8

Compile SLURM using the following commands:

make
make contrib
make install
cd ..

It normally takes about 10 minutes to complete. You can specify the number of CPU cores to use for the compilation process with the -j option.

Create a DEB package from the SLURM installation:

fpm -s dir -t deb -v 1.0 -n slurm-23.02.5 --prefix=/usr -C /nfs-storage/apps/slurm-build .

Install the generated SLURM DEB package:

sudo dpkg -i /nfs-storage/slurm-23.02.5_1.0_arm64.deb

Set up a dedicated SLURM user

export SLURMUSER=1005
groupadd -g $SLURMUSER slurm 
useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm

Follow similar steps to build SLURM on the compute nodes as well. Then copy the slurm.conf file from the head node to the compute nodes. The configuration file of the entire cluster needs to be kept slurm.conf consistent.

Create necessary directories and set permissions:

sudo mkdir -p /var/log/slurm /run/slurm /var/lib/slurm /etc/slurm /var/spool/slurm

Create the corresponding directory and permission settings

sudo chown slurm:slurm -R /var/log/slurm* /run/slurm* /var/lib/slurm* /etc/slurm* /var/spool/slurm*

A slurm.conf configuration file is as follows:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=head
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=gpu
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=0
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
#SelectType=select/cons_tres
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=localhost
AccountingStorageEnforce=limits
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageUser=
AccountingStoreFlags=job_env,job_script
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=node1 NodeAddr=IP_ADDRESS CPUs=4 State=UNKNOWN
NodeName=node2 NodeAddr=IP_ADDRESS CPUs=6 State=UNKNOWN
NodeName=node3 NodeAddr=IP_ADDRESS CPUs=6 State=UNKNOWN
PartitionName=jetsoncluster Nodes=node1,node2,node3 Default=YES MaxTime=INFINITE State=UP

Start the slurm services:

sudo systemctl daemon-reload
sudo systemctl enable slurmdbd
sudo systemctl start slurmdbd
sudo systemctl enable slurmctld
sudo systemctl start slurmctld

These services will need to be started every time the node is turned on or rebooted.

Install SLURM on the compute nodes

Copy the slurm.conf file from the head node to each compute nodes. Now install SLURM on worker nodes:

sudo dpkg -i /nfs-storage/slurm-23.02.5_1.0_arm64.deb

Reload systemd, enable, and start the slurmd daemon on the compute nodes:

sudo systemctl daemon-reload    
sudo systemctl enable slurmd 
sudo systemctl start slurmd

Complete this configuration on each of the compute nodes.

On the compute node, you can use the following command to view the hardware configuration

slurmd -C

The command will display a comprehensive hardware configuration overview in the following format:

NodeName=node2 CPUs=6 Boards=1 SocketsPerBoard=3 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=6855
UpTime=0-00:21:57

Now we can run a test job by telling SLURM to give us 3 nodes, and run the hostname command on each of them:

sudo srun -N3 hostname

You should see output similar to:

node1
node2
node3

This should print out the hostname for all the nodes in the cluster. Also, use sinfo to check the status of the SLURM partitions and nodes.

sinfo

You should see something like:

PARTITION       AVAIL     TIMELIMIT     NODES     STATE NODELIST
jetsoncluster*   up       infinite 3    idle       node[1-3]

You might see below:

PARTITION          AVAIL     TIMELIMIT NODES STATE NODELIST
jetsoncluster      up        infinite 1  down node1
jetsoncluster*     up        infinite 1 down* node3
jetsoncluster*     up        infinite 1  down node2

In case any node is down, use the scontrol command to update its state:

sudo scontrol update nodename=node3 state=resume

To verify available MPI plugin types on your system, you can execute the following command in your terminal:

sudo srun --mpi=list

This command will produce a list of compatible MPI plugin types

MPI plugin types are...
pmi2
pmix
cray_shasta
none

If sometimes there is a problem with slurm, you can restart slurm services.

This completes the installation and configuration of SLURM on the Nvidia Jetson cluster! We are now in a position to begin installing further libraries to allow parallel workloads.

Writing your first bash script for SLURM

There are two types of jobs in Slurm: interactive and batch. Interactive jobs enable you to type in commands while the job is running. Batch jobs, on the other hand, are self-contained sets of commands in a script that are submitted to the cluster for execution on a compute nodes.

Create a new file for your SLURM job script. In this example, we'll name it my_slurm_job.sh.

touch my_slurm_job.sh

Content of SLURM batch file

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --partition=jetsoncluster
echo 'Here starts the calculation'

Next, make the script executable by running the following command:

chmod +x my_slurm_job.sh

Submit the script to the SLURM scheduler

sbatch my_slurm_job.sh

The sbatch command will submit the job to the SLURM scheduler, which will then allocate resources and run the job.

To get information about running or waiting jobs use squeue.

Once the job is completed, you'll find the output file in the directory where you submitted the job, with the name slurm-X.out.

Here starts the calculation

You've successfully written and submitted your first SLURM job script. Adjust the directives and commands based on your specific requirements and application.

Install Python

In this section, I'll provide a step-by-step guide to setting up Python on a head node, ensuring it's also accessible to compute nodes via NFS.

Begin by accessing the superuser mode on head node. Execute the command:

sudo su -

To start an interactive bash session on node1, run:

srun --nodelist=node1 --pty bash

To run an interactive job you need to use the srun command. Obtain python by downloading it from its official source:

wget https://www.python.org/ftp/python/3.8.15/Python-3.8.15.tgz

Next, configure the Python installation with the following settings:

./configure \
--enable-optimizations \
--prefix=/nfs-storage/usr \
--with-ensurepip=install

Compile python using the make

make -j

Install the built version of Python by running:

make install

Verify the Python installation across three nodes with a simple print command:

sudo srun --nodes=3 /nfs-storage/usr/bin/python3 -c "print('Hello')"

This should output 'Hello' three times, once from each node.

Hello
Hello
Hello

Confirm the version of pip installed on these nodes:

sudo srun --nodes=3 /nfs-storage/usr/bin/pip3 --version

The output should display the pip version (e.g., pip 22.0.4) along with its directory path for each node.

pip 22.0.4 from /nfs-storage/usr/lib/python3.8/site-packages/pip (python 3.8)
pip 22.0.4 from /nfs-storage/usr/lib/python3.8/site-packages/pip (python 3.8)
pip 22.0.4 from /nfs-storage/usr/lib/python3.8/site-packages/pip (python 3.8)

By following these steps, you can successfully establish a shared Python environment across your head node and compute nodes via NFS.

Build MPICH for parallel processing

Message Passing Interface (MPI) is one of the core technologies that underlies HPC and it allows to perform distributed programming. It has many implementations. Here, we will see installation of MPICH.

We have a one line shell command can install mpich2. However, it will fail.

sudo apt-get install python-mpi4py

So recommended way is to build it from source. Download the latest version of MPICH from this link.

wget https://www.mpich.org/static/downloads/4.1.2/mpich-4.1.2.tar.gz

Unpack it. And change to that folder.

tar xf mpich-4.1.2.tar.gz  
cd mpich-4.1.2

Now, we can build the package.

./configure --prefix=/nfs-storage/usr && make -j $(nproc) && make install

Confirm that OpenMPI is successfully installed run:

/nfs-storage/usr/bin/mpiexec --version

This will print you the version of MPI installed.

Version:  4.1.2
Release Date: Wed Jun 7 15:22:45 CDT 2023

The general format for launching a code on multiple processes is:

/nfs-storage/usr/bin/mpiexec -n <number_of_processes> <command_to_launch_code>

For this demonstration, we will create a python script called example.py that prints the message "Hello World!" to the console. This script will be executed 10 times simultaneously using MPI to demonstrate parallel processing.

First, let's create the Python script. In a text editor, enter the following code and save it as example.py:

if __name__ == "__main__":
    print("Hello World!")

Next, we will use the mpiexec command to run the script 10 times in parallel. The -n option specifies the number of processes to launch, and the path to the Python interpreter and the script are also specified. Here is the command to run:

/nfs-storage/usr/bin/mpiexec -n 10 /nfs-storage/usr/bin/python3 example.py

You should see output similar to the following:

Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World! 
Hello World!

When you execute the above command, mpiexec launches 10 different instances of python example.py simultaneously, which each print "Hello World!".

For a more advanced example, let's modify the example to run on multiple nodes using a hostfile. A hostfile is a file that specifies the list of nodes and the number of processes to run on each node. Here is an example hostfile:

/nfs-storage/usr/bin/mpiexec --hostfile ./hostfile -n 16 /nfs-storage/usr/bin/python3 example.py

This hostfile specifies that there are 3 nodes, each with 3 slots (cores) available.

Next, we will create a C program that uses MPI to print the message "Hello, world" from each process. Here is the code for the program, called "hello.c":

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    
    printf("Hello, world!\n");
    
    MPI_Finalize();
    
    return 0;
}

To compile the program, use the mpicc command:

mpicc hello.c -o hello

This will create an executable file called "hello".

Run below command:

mpiexec --hostfile ./hosts -n 12 hello

This will launch 12 processes in total, 4 on each of the 3 nodes. You should see output similar to the following:

Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world

This demonstrates that the program was executed in parallel on multiple nodes using MPI.

Install mpi4py - MPI for python

I will build the mpi4py package on the NFS server, which will eliminate the need to build it individually on each node. By doing so, we can save time and resources, as well as ensure that all nodes have access to the same version of the package.

Download the mpi4py package:

wget https://github.com/mpi4py/mpi4py/releases/download/3.1.5/mpi4py-3.1.5.tar.gz

Build the mpi4py package using the following command:

/nfs-storage/usr/bin/python3 setup.py build --mpicc=/nfs-storage/usr/bin/mpicc

/nfs-storage/usr/bin/mpicc is my build path for mpich. Replace it with the path on your device.

Finally, install mpi4py:

/nfs-storage/usr/bin/python3 setup.py install --prefix=/nfs-storage/usr

Check the versions of mpi4py in every compute node:

sudo srun --nodes=3 /nfs-storage/usr/bin/python3 -c "import mpi4py; print(mpi4py.__version__)"

You should be able to see the following:

3.1.5
3.1.5
3.1.5

Here is a sample batch script to run an MPI program to calculate to calculate the value of pi number in parallel:

#!/bin/bash
#SBATCH --nodes=3
#SBATCH --nodelist=node1,node2,node3
#SBATCH --output=output_%j.txt

sudo srun -n 12 --mpi=pmix /nfs-storage/usr/bin/python3 calculate.py

Here, the script uses three nodes (node1, node2, and node3) and runs 12 processes in parallel using PMIx MPI implementation to calculate the value of pi using the Python script calculate.py.

from mpi4py import MPI
from math   import pi as PI
from numpy  import array

def get_n():
    prompt  = "Enter the number of intervals: (0 quits) "
    try:
        n = 100
        if n < 0: n = 0
    except:
        n = 0
    return n

def comp_pi(n, myrank=0, nprocs=1):
    h = 1.0 / n
    s = 0.0
    for i in range(myrank + 1, n + 1, nprocs):
        x = h * (i - 0.5)
        s += 4.0 / (1.0 + x**2)
    return s * h

def prn_pi(pi, PI):
    message = "pi is approximately %.16f, error is %.16f"
    print  (message % (pi, abs(pi - PI)))

comm = MPI.COMM_WORLD
nprocs = comm.Get_size()
myrank = comm.Get_rank()

n    = array(0, dtype=int)
pi   = array(0, dtype=float)
mypi = array(0, dtype=float)

while True:
    if myrank == 0:
        _n = get_n()
        n.fill(_n)
    comm.Bcast([n, MPI.INT], root=0)
    if n == 0:
        break
    _mypi = comp_pi(n, myrank, nprocs)
    mypi.fill(_mypi)
    comm.Reduce([mypi, MPI.DOUBLE], [pi, MPI.DOUBLE],
                op=MPI.SUM, root=0)
    if myrank == 0:
        prn_pi(pi, PI)

The output of the job is saved in a file named output_%j.txt. Finally, you can watch the demo video to see the Slurm cluster in action.

Congratulations, you now have a functioning Slurm cluster. Now our system is ready to take any parallel computing application that you want to develop.

(optional) Running LLM on the SLURM cluster using Llama.cpp

I am currently executing a Large Language Model (LLM) on a Slurm cluster using the Llama.cpp program, but I am encountering an issue running it via MPI. It is important to note that using MPI for the inference of Large Language models generally results in slower performance compared to running it without MPI. However, for a single request such as this, the benefits of parallel processing may not be apparent, but we can use a cluster to load large models into RAM, which cannot be fitted onto a single node.

To begin, you will need to clone the Llama.cpp repository from GitHub using the following command:

git clone https://github.com/ggerganov/llama.cpp.git

Next, navigate to the cloned directory using the command:

cd llama.cpp/

Once you are in the cloned directory, compile the code using the following command:

make CC=mpicc CXX=mpicxx LLAMA_MPI=1

Next, download the model and place it into the models subfolder. You can find various quantized models on the Hugging Face model hub.

Finally, run the program with the following command:

./mpirun -hostfile hosts -n 12 /nfs-storage/llama.cpp/build/bin/main --color --model "/nfs-storage/models/mistral-7b-v0.1.Q4_K_M.gguf" -p "Who is Robert Oppenheimer" -n 128

I am experiencing an issue with running llama.cpp via MPI. It might be due to llama.cpp package. Troubleshooting is ongoing [06/01/24]

I received the answer from Georgi Gerganov, developer of llama.cpp project.

You may want to run a large language model locally on your own cluster for many reasons. I'm doing it because I want to gain a deeper understanding of LLMs and clusters, including how to operate them more efficiently and quickly. And I am deeply curious about the process and love playing with it. You may have your own reasons for doing it.

It would be a great idea to consider the limitations of my Slurm cluster. Here are some of those limitations:

Limitations

Hardware and Software Inconsistency: Due to the heterogeneous nature of my cluster, which consists of Nvidia Jetson boards that have fewer cores and possibly less computational power than standalone GPU/CPU servers, achieving even workload distribution may not be possible. MPI typically assumes identical hardware configurations and distributes work equally by default, which may not be optimal in this situation. Ideally, compute nodes should have similar computational characteristics and the same OS version. Additionally, precompiled software is not always fully optimized for the hardware of the HPC cluster. The Raspberry Pi 4B and NVIDIA Jetson Boards are ARM-based, which may limit the availability of certain pre-built software.
Network Overhead: When you distribute a task across multiple nodes, there is an inherent overhead associated with communication between these nodes. Data has to be sent back and forth, which can add significant time, especially if the network is not optimized for high throughput or low latency.
No Add GPU support. Although Nvidia Jetson boards have shared GPUs, they currently lack full integration with Slurm, the workload management system. This means Slurm cannot directly schedule tasks on Jetson GPUs, which limits your ability to leverage their processing power efficiently.
Amdal's law. Adding more computing nodes doesn't always guarantee proportional performance gains. Amdahl's Law states that the overall speedup of a system is limited by the sequential portion of the task, regardless of the amount of parallel processing power available. Simply adding more Jetson boards may not significantly improve performance if sequential bottlenecks exist in your workflow. Carefully analyze your workload to identify potential bottlenecks and address them before expanding the cluster further.

Thank you for reading!

References

Nurgaliyev Shakhizat

72 projects • 188 followers

I am a hardcore robotics and IoT enthusiast. Email: shahizat005@gmail.com

Contact

Comments

Please log in or sign up to comment.

Embed the widget on your own site

How to build Nvidia Jetson HPC Cluster Using SLURM

How to build Nvidia Jetson HPC Cluster Using SLURM

Things used in this project

Hardware components

Software apps and online services

Story

Introduction to Simple Linux Utility for Resource Management

Slurm Cluster's structure

Setting Up NFS share on a Orange Pi 5 Plus board

Installing the NFS Client and Mounting the Share Directory

Build Munge from Source

Installing MariaDB on Head node

Building and installing OpenPMIx from source

Download and install SLURM on the head node

Install SLURM on the compute nodes

Writing your first bash script for SLURM

Install Python

Build MPICH for parallel processing

Install mpi4py - MPI for python

(optional) Running LLM on the SLURM cluster using Llama.cpp

Limitations

References

Credits

Nurgaliyev Shakhizat

Comments

Related channels and tags