Skip to content

Parallelization

Introduction

\texttt{COMBI} is parallelized using a hybrid \texttt{MPI}-\texttt{OpenMP} algorithm. The different processes that communicate using \texttt{MPI} contain one or multiple bunches. The different processes do different calculations simultaneously to different bunches (MIMD). Each \texttt{MPI} process is multithreaded using \texttt{OpenMP} to parallelize the calculation of the same action applied to all the macro-particles in a bunch (SIMD).

Causality caused challenges

The importance of preserving causality in the simulations can create a bottleneck in coherent multi-bunch simulations. This bottleneck can occur when there are both inter-beam and intra-beam bunch-bunch interactions (the difference is explained in physics). The source of the bottleneck is explained in greater detail in [1]. To keep it short, the problem is that:

  • inter-beam bunch-bunch interactions (beam-beam) can efficiently be parallelized by having various bunches do different actions simultaneously
  • intra-beam bunch-bunch interactions (e.g. wakefields) can efficiently be parallelized by having all the bunches do the same action simultaneously

When a simulation contains both, which is the purpose of \texttt{COMBI}, these contradicting strategies can obviously not both be followed. The parallel \texttt{MPI} algorithm in \texttt{COMBI} is designed to be as efficient as possible, by trying to expand and circumvent this bottleneck.

Parallel algorithm

The key points of the parallel algorithm in \texttt{COMBI} are:

  • The actions for each bunch are put in pipelines: This controls the order and details of the actions per bunch.
  • The bunches are autonomous: Each bunch knows what actions it has to do, in which order it has to do them, and with which bunches it has to communicate.
  • The communication between bunches is minimal and asynchronous: A bunch only communicates with other bunches when necessary to preserve causality. Messages can be sent even if the recipient is not ready to receive.
  • The actions requiring communication are split into two, first sending then receiving messages: The impact of an interaction on a bunch will not be calculated before the bunch has received the required messages from the other bunch(es). This preserves causality.
  • There can be multiple bunches per process: Due to causality, a bunch may at times have to wait for messages from other bunches. By having multiple bunches per process, the stalling of a bunch does not necessarily lead to a stalling of a CPU and loss of efficiency.

How this works in practice in the code can be understood by the following pseudo code for the calculations done by one \texttt{MPI} process

Pipeline(rank)
    PARSE input_files
    CREATE bunches(rank)
    for bunch in bunches do
        INITIALIZE bunch.step to 0
        INITIALIZE bunch.bunch_is_done to false
        CREATE bunch.pipeline

    while (all_bunches_are_not_done) do
        for bunch in bunches do
            // Check if the bunch is done (with all actions in all turns)
            if (bunch.bunch_is_done) do
                CONTINUE

            SET action to bunch.pipeline[bunch.step]

            // Check if the action can be done
            if (need_to_send AND cannot_send) do
                CONTINUE
            else if (need_to_receive AND cannot_receive) do
                CONTINUE

            // Do the action
            EXECUTE action
            INCREMENT bunch.step
            UPDATE bunch.bunch_is_done

Bunch distribution on processes

The bunches are distributed on the processes to best expand (and circumvent) the bottleneck discussed above and in [1]. The exact algorithm can be extracted from the function \texttt{distributeOnProcs} in \texttt{src/c/pipehelper.c} in the Git repository, but some relevant examples will be explained here.

If there is only one process, all bunches will be on the same process (rank=0).

If the constant \texttt{SEPARATE_BEAMS} is set to 1 (true) and there are more than one \texttt{MPI} process, the bunches of the two beams will be kept separate on different processes. This has become the default approach.

If \texttt{SEPARATE_BEAMS}==1, it is advised to simulate n_1 bunches in beam 1 and n_2 bunches in beam 2 by P processes such that (n_1+n_2)/P is an integer. As an example, imagine n_1=n_2=4=P, i.e. 4 bunches per beam and 4 processes. The bunches will then be organized on the various ranks as following:

Bunch # Beam 1 Beam 2
1 rank 0 rank 2
2 rank 1 rank 3
3 rank 0 rank 2
4 rank 1 rank 3

The nearest neighbors in a beam are kept on different processes to best distribute the required calculations of beam-beam interactions that can be done simultaneously.

References

[1] S.V. Furuseth and X. Buffat, Computer Physics Communications 244, pp. 180-186 (2019)