Setting up a Beowulf Cluster Using Open MPI on Linux
I have been doing a lot of work recently on Linear Genetic
Programming. This requires a great deal of processing power and to meet
this I have been using Open MPI to create a Linux cluster. What
follows is a quick guide to getting a cluster running. The basics
really are very simple and, depending on the size, you can get a simple
cluster running in less than half an hour, assuming you already have the
machines networked and running Linux.
The main way that it is used is to start a program from the master node and specify how many processes you want to use, the program is then started for each of those process whether this is on the same machine or over multiple machines. The processes can communicate using the MPI library and all data sent to STDOUT from the slave nodes is piped to STDOUT on the master node.
The names may vary on your distribution, but the most important thing
to remember is that you should have the same version of Open MPI on all
the machines, for it to work properly.
Under Debian Lenny, the Open MPI executables install into
To be able to control the slave nodes from the master node you need an SSH client installed. Under Debian, install the following package:
While logged in as
Make sure that each machine knows that this user is authorized to log into them.
If you are not sharing the
Correct file permissions (This will also need to be done on the slave nodes if not sharing the home folder):
To test that this has worked try using SSH to connect to your slaves
from the master, using the password you entered when creating the key
pairs above:
If you want to use
If you want to see what will be passed to your 'C' compiler when compiling and linking:
Use
Tell
Test by logging into one of the slave nodes, while logged in as mpiuser; this shouldn't ask for a password:
To run
To run
If you save the above source code to a file called
Then run it either over 20 processes on the local machine:
Or over 5 processes on the cluster:
You should see on the master node that the output to STDOUT from the slave nodes is redirected to
STDOUT on the master node. The master node will output what it is sending and each slave node
will output what it has received or if it is waiting. Once each node has finished, the master node
will exit and return you to the command prompt.
What is a Beowulf Cluster?
A Beowulf Cluster is a collection of privately networked computers which can be used for parallel computations. They consist of commodity hardware running open source software such as Linux or a BSD, often coupled with PVM (Parallel Virtual Machine) or MPI (Message Passing Interface). A standard set up will consist of one master node which will control a number of slave nodes. The slave nodes are typically headless and generally all access the same files from a server. They have been referred to as Beowulf Clusters since before 1998 when the original Beowulf HOWTO was released.What is Open MPI?
Open MPI is one of the leading MPI-2 implementations used by many of the TOP 500 Supercomputers. It is open source and is developed and maintained by a consortium of academic, research, and industry partners. The Open MPI project has the stated aim of building the best Message Passing Interface (MPI) library available, which judging by where it is used, I would say they are well on their way to doing.The main way that it is used is to start a program from the master node and specify how many processes you want to use, the program is then started for each of those process whether this is on the same machine or over multiple machines. The processes can communicate using the MPI library and all data sent to STDOUT from the slave nodes is piped to STDOUT on the master node.
Setting up the Machines
I am making several assumptions in this article:- You have at least two machines connected via a TCP/IP network.
- Each machine is using the same processor architecture, to ease sharing your program executables.
- The machines all have Linux installed.
- Each machine has a common login user name, such as
mpiuser
. I will refer to the common login asmpiuser
for the rest of this article. - Each machine has
mpiuser
sharing the same home folder, usually through NFS, or has the relevant parts of the home folder synchronised, perhaps withrsync
. - I will refer to each of the slaves nodes as
slave1
,slave2
, etc. You, however, can choose whatever host names you like.
Installing and Configuring Open MPI on all Nodes
Under Debian Lenny you need to install the following packages:- openmpi-bin
- openmpi-common
- libopenmpi1
- libopenmpi-dev (Not actually needed for a slave node.)
$ apt-get install openmpi-bin openmpi-common libopenmpi1 libopenmpi-dev
Under Debian Lenny, the Open MPI executables install into
/usr/bin
. If they install into a different location, then you will need to ensure that this location is in the PATH for mpiuser
. You may also need to ensure that LD_LIBRARY_PATH points to /usr/lib
Slave Nodes
So that the master node can control the slave nodes, each slave node needs an SSH server installed. Under Debian, install the following package:- openssh-server
$ apt-get install openssh-server
Master Node
Setting up SSH
Before continuing with setting up Open SSH please refer to their site. I am providing just enough information here to get the cluster working and make no guarantees as to how secure this method is.To be able to control the slave nodes from the master node you need an SSH client installed. Under Debian, install the following package:
- openssh-client
$ apt-get install openssh-client
mpiuser
, create SSH public/private key pairs with password: h475k2!4553ffe
(Choose your own password) using file: /home/mpiuser/.ssh/id_dsa
.$ ssh-keygen -t dsa
$ cp /home/mpiuser/.ssh/id_dsa.pub /home/mpiuser/.ssh/authorized_keys
/home/mpiuser
folder then make sure that each slave node knows that mpiuser
on the master node is authorized to log into them. For slave1
use:$ scp /home/mpiuser/.ssh/id_dsa.pub mpiuser@slave1:.ssh/authorized_keys
$ chmod 700 /home/mpiuser/.ssh
$ chmod 600 /home/mpiuser/.ssh/authorized_keys
$ ssh slave1
Configuring Open MPI
To let Open MPI know which machines to run your programs on, you can create a file to store this. I will call this file/home/mpiuser/.mpi_hostfile
and it could contain the following:# The Hostfile for Open MPI
# The master node, 'slots=2' is used because it is a dual-processor machine.
localhost slots=2
# The following slave nodes are single processor machines:
slave1
slave2
slave3
Compiling Programs
OpenMPI can be used with a variety of languages, two of the most popular are FORTRAN and 'C'. If your programs are written in 'C', then you can either usempicc
instead of your normal 'C' compiler
or you can pass the additional arguments directly to your 'C' compiler. With mpicc
the arguments
you pass are passed to your normal 'C' compiler.If you want to use
mpicc
to compile a 'C' source file called testprogram.c
:$ mpicc testprogram.c
$ mpicc -showme
Running Your Programs on the Cluster
Ensure That SSH Doesn't Ask For a Password
Because Open MPI will use SSH to connect to each of the machines and run your program, you need to ensure that the password doesn't have to be entered for each connection.Use
ssh-agent
to remember the password while logged in as mpiuser
:$ eval `ssh-agent`
ssh-agent
the password for the SSH key:$ ssh-add ~/.ssh/id_dsa
$ ssh slave1
Starting a Program on the Cluster
You can start a program on just one machine, and have it execute on multiple processors/cores. This is handy if you have a powerful machine or are debugging/testing a program. Alternatively, you can run the program over your cluster and request as many processes as you want to be run over it.To run
myprogram
on two processes on the local machine:$ mpirun -np 2 ./myprogram
myprogram
over five processes on the cluster using the .mpi_hostfile
created above (Note that myprogram
must be in the same location on each machine):$ mpirun -np 2 --hostfile .mpi_hostfile ./myprogram
An Example Test Program
The following program will send a random number from the master node to all the slave nodes.#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
const int MASTER = 0;
const int TAG_GENERAL = 1;
int numTasks;
int rank;
int source;
int dest;
int rc;
int count;
int dataWaitingFlag;
char inMsg;
char outMsg;
MPI_Status Stat;
// Initialize the MPI stack and pass 'argc' and 'argv' to each slave node
MPI_Init(&argc,&argv);
// Gets number of tasks/processes that this program is running on
MPI_Comm_size(MPI_COMM_WORLD, &numTasks);
// Gets the rank (process/task number) that this program is running on
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// If the master node
if (rank == MASTER) {
// Send out messages to all the sub-processes
for (dest = 1; dest < numTasks; dest++) {
outMsg = rand() % 256; // Generate random message to send to slave nodes
// Send a message to the destination
rc = MPI_Send(&outMsg, 1, MPI_CHAR, dest, TAG_GENERAL, MPI_COMM_WORLD);
printf("Task %d: Sent message %d to task %d with tag %d\n",
rank, outMsg, dest, TAG_GENERAL);
}
}
// Else a slave node
else {
// Wait until a message is there to be received
do {
MPI_Iprobe(MASTER, 1, MPI_COMM_WORLD, &dataWaitingFlag, MPI_STATUS_IGNORE);
printf("Waiting\n");
} while (!dataWaitingFlag);
// Get the message and put it in 'inMsg'
rc = MPI_Recv(&inMsg, 1, MPI_CHAR, MASTER, TAG_GENERAL, MPI_COMM_WORLD, &Stat);
// Get how big the message is and put it in 'count'
rc = MPI_Get_count(&Stat, MPI_CHAR, &count);
printf("Task %d: Received %d char(s) (%d) from task %d with tag %d \n",
rank, count, inMsg, Stat.MPI_SOURCE, Stat.MPI_TAG);
}
MPI_Finalize();
}
testprogram.c
then compile it using:$ mpicc testprogram.c
$ mpirun -np 20 ./a.out
$ mpirun -np 5 --hostfile .mpi_hostfile ./a.out