Parallelization

Introduction

$\texttt{COMBI}$ is parallelized using a hybrid $\texttt{MPI}$ - $\texttt{OpenMP}$ algorithm. The different processes that communicate using $\texttt{MPI}$ contain one or multiple bunches. The different processes do different calculations simultaneously to different bunches (MIMD). Each $\texttt{MPI}$ process is multithreaded using $\texttt{OpenMP}$ to parallelize the calculation of the same action applied to all the macro-particles in a bunch (SIMD).

Causality caused challenges

The importance of preserving causality in the simulations can create a bottleneck in coherent multi-bunch simulations. This bottleneck can occur when there are both inter-beam and intra-beam bunch-bunch interactions (the difference is explained in physics). The source of the bottleneck is explained in greater detail in [1]. To keep it short, the problem is that:

inter-beam bunch-bunch interactions (beam-beam) can efficiently be parallelized by having various bunches do different actions simultaneously
intra-beam bunch-bunch interactions (e.g. wakefields) can efficiently be parallelized by having all the bunches do the same action simultaneously

When a simulation contains both, which is the purpose of $\texttt{COMBI}$ , these contradicting strategies can obviously not both be followed. The parallel $\texttt{MPI}$ algorithm in $\texttt{COMBI}$ is designed to be as efficient as possible, by trying to expand and circumvent this bottleneck.

Parallel algorithm

The key points of the parallel algorithm in $\texttt{COMBI}$ are:

The actions for each bunch are put in pipelines: This controls the order and details of the actions per bunch.
The bunches are autonomous: Each bunch knows what actions it has to do, in which order it has to do them, and with which bunches it has to communicate.
The communication between bunches is minimal and asynchronous: A bunch only communicates with other bunches when necessary to preserve causality. Messages can be sent even if the recipient is not ready to receive.
The actions requiring communication are split into two, first sending then receiving messages: The impact of an interaction on a bunch will not be calculated before the bunch has received the required messages from the other bunch(es). This preserves causality.
There can be multiple bunches per process: Due to causality, a bunch may at times have to wait for messages from other bunches. By having multiple bunches per process, the stalling of a bunch does not necessarily lead to a stalling of a CPU and loss of efficiency.

How this works in practice in the code can be understood by the following pseudo code for the calculations done by one $\texttt{MPI}$ process

Pipeline(rank)
    PARSE input_files
    CREATE bunches(rank)
    for bunch in bunches do
        INITIALIZE bunch.step to 0
        INITIALIZE bunch.bunch_is_done to false
        CREATE bunch.pipeline

    while (all_bunches_are_not_done) do
        for bunch in bunches do
            // Check if the bunch is done (with all actions in all turns)
            if (bunch.bunch_is_done) do
                CONTINUE

            SET action to bunch.pipeline[bunch.step]

            // Check if the action can be done
            if (need_to_send AND cannot_send) do
                CONTINUE
            else if (need_to_receive AND cannot_receive) do
                CONTINUE

            // Do the action
            EXECUTE action
            INCREMENT bunch.step
            UPDATE bunch.bunch_is_done

Bunch distribution on processes

The bunches are distributed on the processes to best expand (and circumvent) the bottleneck discussed above and in [1]. The exact algorithm can be extracted from the function $\texttt{distributeOnProcs}$ in $\texttt{src/c/pipehelper.c}$ in the Git repository, but some relevant examples will be explained here.

If there is only one process, all bunches will be on the same process (rank=0).

If the constant $\texttt{SEPARATE_BEAMS}$ is set to 1 (true) and there are more than one $\texttt{MPI}$ process, the bunches of the two beams will be kept separate on different processes. This has become the default approach.

If $\texttt{SEPARATE_BEAMS}$ ==1, it is advised to simulate $n_1$ bunches in beam 1 and $n_2$ bunches in beam 2 by $P$ processes such that $(n_1+n_2)/P$ is an integer. As an example, imagine $n_1=n_2=4=P$ , i.e. 4 bunches per beam and 4 processes. The bunches will then be organized on the various ranks as following:

Bunch #	Beam 1	Beam 2
1	rank 0	rank 2
2	rank 1	rank 3
3	rank 0	rank 2
4	rank 1	rank 3

The nearest neighbors in a beam are kept on different processes to best distribute the required calculations of beam-beam interactions that can be done simultaneously.

References

[1] S.V. Furuseth and X. Buffat, Computer Physics Communications 244, pp. 180-186 (2019)