Parallelization
Introduction
\texttt{COMBI} is parallelized using a hybrid \texttt{MPI}-\texttt{OpenMP} algorithm. The different processes that communicate using \texttt{MPI} contain one or multiple bunches. The different processes do different calculations simultaneously to different bunches (MIMD). Each \texttt{MPI} process is multithreaded using \texttt{OpenMP} to parallelize the calculation of the same action applied to all the macro-particles in a bunch (SIMD).
Causality caused challenges
The importance of preserving causality in the simulations can create a bottleneck in coherent multi-bunch simulations. This bottleneck can occur when there are both inter-beam and intra-beam bunch-bunch interactions (the difference is explained in physics). The source of the bottleneck is explained in greater detail in [1]. To keep it short, the problem is that:
- inter-beam bunch-bunch interactions (beam-beam) can efficiently be parallelized by having various bunches do different actions simultaneously
- intra-beam bunch-bunch interactions (e.g. wakefields) can efficiently be parallelized by having all the bunches do the same action simultaneously
When a simulation contains both, which is the purpose of \texttt{COMBI}, these contradicting strategies can obviously not both be followed. The parallel \texttt{MPI} algorithm in \texttt{COMBI} is designed to be as efficient as possible, by trying to expand and circumvent this bottleneck.
Parallel algorithm
The key points of the parallel algorithm in \texttt{COMBI} are:
- The actions for each bunch are put in pipelines: This controls the order and details of the actions per bunch.
- The bunches are autonomous: Each bunch knows what actions it has to do, in which order it has to do them, and with which bunches it has to communicate.
- The communication between bunches is minimal and asynchronous: A bunch only communicates with other bunches when necessary to preserve causality. Messages can be sent even if the recipient is not ready to receive.
- The actions requiring communication are split into two, first sending then receiving messages: The impact of an interaction on a bunch will not be calculated before the bunch has received the required messages from the other bunch(es). This preserves causality.
- There can be multiple bunches per process: Due to causality, a bunch may at times have to wait for messages from other bunches. By having multiple bunches per process, the stalling of a bunch does not necessarily lead to a stalling of a CPU and loss of efficiency.
How this works in practice in the code can be understood by the following pseudo code for the calculations done by one \texttt{MPI} process
Pipeline(rank)
PARSE input_files
CREATE bunches(rank)
for bunch in bunches do
INITIALIZE bunch.step to 0
INITIALIZE bunch.bunch_is_done to false
CREATE bunch.pipeline
while (all_bunches_are_not_done) do
for bunch in bunches do
// Check if the bunch is done (with all actions in all turns)
if (bunch.bunch_is_done) do
CONTINUE
SET action to bunch.pipeline[bunch.step]
// Check if the action can be done
if (need_to_send AND cannot_send) do
CONTINUE
else if (need_to_receive AND cannot_receive) do
CONTINUE
// Do the action
EXECUTE action
INCREMENT bunch.step
UPDATE bunch.bunch_is_done
Bunch distribution on processes
The bunches are distributed on the processes to best expand (and circumvent) the bottleneck discussed above and in [1]. The exact algorithm can be extracted from the function \texttt{distributeOnProcs} in \texttt{src/c/pipehelper.c} in the Git repository, but some relevant examples will be explained here.
If there is only one process, all bunches will be on the same process (rank=0).
If the constant \texttt{SEPARATE_BEAMS} is set to 1 (true) and there are more than one \texttt{MPI} process, the bunches of the two beams will be kept separate on different processes. This has become the default approach.
If \texttt{SEPARATE_BEAMS}==1, it is advised to simulate n_1 bunches in beam 1 and n_2 bunches in beam 2 by P processes such that (n_1+n_2)/P is an integer. As an example, imagine n_1=n_2=4=P, i.e. 4 bunches per beam and 4 processes. The bunches will then be organized on the various ranks as following:
Bunch # | Beam 1 | Beam 2 |
---|---|---|
1 | rank 0 | rank 2 |
2 | rank 1 | rank 3 |
3 | rank 0 | rank 2 |
4 | rank 1 | rank 3 |
The nearest neighbors in a beam are kept on different processes to best distribute the required calculations of beam-beam interactions that can be done simultaneously.
References
[1] S.V. Furuseth and X. Buffat, Computer Physics Communications 244, pp. 180-186 (2019)