MPI/OpenMP hybrid parallelization in LIGGGHTS

Motivation

The MPI/OpenMP hybrid parallelization in LIGGGHTS is a new way to achieve better load-balancing within simulations. Traditional MPI simulations can only be load-balanced using fix/balance. The following figure illustrates how static 8x1 decomposition might look like. In this example colors are used to show areas which require high amount of computational work (red) in contract to other areas which leave processors idle (blue). Using a static decomposition processes the boundaries of the domain can become idle quickly. To avoid this, MPI dynamic load balancing was introduced with fix/balance. It adjusts the boundaries making sure each subdomain has an equal amount of particles in them.

images/1Ddecomposition.png

MPI load-balancing works very well with simple decompositions. However load-imbalance remains a dominant issue if higher number of cores used. At some point decomposing along one dimension alone will no longer add any benefit. Adding a cut to additional dimensions will only work well if the load is symmetric. The following figure shows how the symmetric testcase can easily be computed using 16 processes in stead of 8 by adding another decomposition along the y-Axis.

images/2Ddecomposition.png

Unsymmetrical loads are very difficult to decompose. The margin of error is significant. Simulations can take 2x-4x more time if decompositions are chosen poorly and MPI load-balancing used improperly. The following illustrates a simulation which is harder to decompose because the cuts in space span the entire domain.

images/2Ddecomposition_unsymmetric.png

A hybrid parallelization which uses MPI and OpenMP allows us to reduce the amount of MPI processes and let automatic partitioning figure out a good decomposition inside our subdomains. In our next Figure we show how the hybrid can use only two MPI processes to cut the domain in half along the y-Axis and then partitions the subdomains among 8 threads. In other words, our hybrid parallelization adds a second layer of parallelization to LIGGGHTS which is more flexible and does automatic load-balancing.

images/2Ddecomposition_hybrid.png

The main idea behind using a hybrid parallelization therefore is to first split the domain into large, potentially MPI-load-balanced portions. Each MPI subdomain is then further partitioned for a given number of threads.

images/Hybrid_Overview.png

Installation

The hybrid parallelization is implemented using the OpenMP standard. This standard for threading is supported by most compiler vendors by now. Some still compilers only support it in their development versions right now. To compile LIGGGHTS with hybrid parallelization we need additional compiler flags and additional implementation files which replace some core components of the usual LIGGGHTS integration loop with threaded versions.

Prerequisites

  • GCC >= 4.7

  • Zoltan Library 3.6

Compiling Zoltan & Installing Zoltan

Zoltan is a library containing many useful partitioning and load-balancing algorithms. We utilize this library in our implementation. Before compiling LIGGGHTS with hybrid parallelization, we need to compile and install this library as follows:

cd LIGGGHTS-PFM/
git submodule init
git submodule update
cd lib/zoltan/
mkdir BUILD
cd BUILD
../configure
make everything
make install

Compiling LIGGGHTS with hybrid parallelization

Once all the prerequisites are met, we can compile a hybrid version of LIGGGHTS using the hybrid makefile.

cd LIGGGHTS-PFM/src/
make -j 4 hybrid

Basic Usage

Using this new parallelization of LIGGGHTS requires the compilation of a LIGGGHTS binary as mentioned in the previous section. This version of LIGGGHTS supports additional styles and fixes which all end with an additional /omp suffix.

Package OMP

Prior to using any OpenMP styles or fixes, one must enable OpenMP by using the package command.

package omp 8 force/neigh thread-binding verbose

The thread-binding option will force each thread to be bound to a CPU core using SMP-style numbering. This means that it assumes that cores which are next to each other are numbered with consecutive numbers.

Partitioning of Data

Pair styles and wall fixes require particle data to be partitioned. Each thread will then operate on one of the partitions. Currently there is only a single partitioner implemented using the Zoltan library. Key-Value pairs passed as arguments to the partitioner_style are passed 1:1 to the Zoltan library.

partitioner_style zoltan RCB_REUSE 1

By default, the partitioner uses the following options:

DEBUG_LEVEL

0

Controls Debug output

LB_METHOD

RCB

Selection of RCB as partitioning algorithm

NUM_GID_ENTRIES

1

number of global ids per element DO NOT CHANGE

NUM_LID_ENTRIES

1

number of local ids per element DO NOT CHANGE

NUM_LOCAL_PARTS

NUM_THREADS

number of partitions generated. this should always be equal to the number of threads. This is done by default, therefore DO NOT CHANGE

RETURN_LISTS

PARTS

tells Zoltan to return a list containing the mapping of particles to parts DO NOT CHANGE

KEEP_CUTS

1

Zoltan should keep information of past partitionings DO NOT CHANGE

RCB_REUSE

0

Zoltan should use past partitioning information for repartitioning Recommended to set to 1

RCB_OUTPUT_LEVEL

0

RCB specific debug output

RCB_RECTILINEAR_BLOCKS

0

RCB specific option

Additional information can be found here.

Pair Styles

All granular pair styles have a OpenMP implementation. To select them simply use gran/omp instead of gran as pair_style.

pair_style gran/omp model hertz tangential history

Meshes

Meshes of type mesh/surface which are used by a wall fix are required to be replaced by their OpenMP version of mesh/surface/omp.

fix cadMix1 all mesh/surface/omp file meshes/Mixer.stl type 1

Walls

Walls of type wall/gran should be replaced by their OpenMP version of wall/gran/omp.

fix meshes all wall/gran/omp model hertz tangential history mesh n_meshes 12 meshes cadShaft cadBlade1 cadBlade2 cadBlade3 cadBlade4 cadMix1 cadMix2 cadMix3 cadMix4 cadMix5 cadMix6 cadDrum

Other Fixes

The implementation of the hybrid parallelization is not feature complete. Some fixes might not even need an OpenMP version at all. Here is a list of other fixes which have been optimized so far.

Gravity

The gravity fix should be replaced by gravity/omp.

fix gravi all gravity/omp 9.81 vector 0.0 0.0 -1.0

Integration

The nve/sphere integration fix should be replaced by nve/sphere/omp.

fix integr nve_group nve/sphere/omp

Restrictions

The MPI/OpenMP hybrid implementation can only be used if LIGGGHTS was built with the USER-OMP and USER-ZOLTAN package. See the Making LAMMPS section for more info.

Insertion of particles is currently not optimized with OpenMP.