Most modern processors (CPU - Central Processor Unit) are multi-core. They include several independent physical cores that can work simultaneously. Also, the inclusion of the “Hyper-threading” technology allows you to double the number of logical cores determined by the operating system.

FDS can be run in multi-threaded mode with parallel data processing, when all work is divided between several processor cores of one computer, or between the processors of several computers integrated in one network.

As the number of threads increases, usually, but not always, the performance increases and the simulation execution time decreases. An increase in performance can be observed only if the number of threads does not exceed the number of logical cores.

There are two technologies that can be used together to run FDS in multithreaded mode:

  1. OpenMP (Open Multi-Processing). It helps to use several processor cores of the computer on which the simulation is carried out.

  2. MPI (Message Passing Interface). It helps to use several processor cores of the computer on which the simulation is carried out, or the processors of several computers located in the local network.

Before moving on to the analysis of the FDS launch strategy in Fenix + 3, a few words about how the simulation works in FDS.

Modeling in FDS is realized in the computational domain - the volume of space in which the development of the fire is of interest. The computational domain consists of one or several elementary volumes - meshes. A mesh is a volume of space in the form of a rectangular parallelepiped:

Image 20210119 115345

Each mesh is divided into rectangular cells. The size of the cells (and, accordingly, their number) determines how accurately the objects will be represented in FDS and how much more accurate the modeling will be. The smaller the cell size, the more accurately objects are represented in FDS:

Cell Size

In the input file for FDS, each mesh is represented by a group of MESH parameters:

Open M P 16 C P U

where the IJK parameter determines the number of cells by which the mesh is divided along each side, and XB determines the linear dimensions of the mesh. Accordingly, in the example above, the mesh is divided into 6000 (10 x 20 x 30) cells.

The number of cells in the mesh is one of the most important parameters that determines the simulation run time. The larger the number of cells, the longer the simulation execution time.

Further, by the size of the mesh we mean not the linear dimensions of the mesh, but the number of cells in it. That is, a “large mesh” is a mesh with a large number of cells.

Multithreading with OpenMP

To use OpenMP technology for multithreaded modeling, it is enough to set the OMP_NUM_THREADS environment variable to the desired number of threads before starting FDS.

After starting FDS in multithreaded mode using OpenMP technology in Windows OS, the Resource Monitor will show one process “fds” and several threads:

(image)

If the computational domain consists of several meshes, they will be sequentially processed within one process.

Running in multithreaded mode with MPI

To use MPI for multithreaded modeling, the entire computational domain must be split into multiple meshes. Moreover, each mesh is processed by one MPI process assigned to it. Moreover, it is possible to assign the same MPI process to several meshes. This is useful when the computational domain is represented by several large meshes and many small ones. Large meshes will be processed by their own MPI processes, while several small meshes will be processed by the same MPI process. Processing multiple meshes with one MPI thread reduces the amount of interaction between MPI processes. The MPI process that will process a particular mesh is determined by the MPI_PROCESS parameter of the MESH parameter group:

(image)

After running FDS in multithreaded mode using MPI on one computer in Windows, you will see several “fds” processes in the Resource Monitor:

(image)

Comparison of Simulation Acceleration with OpenMP and MPI

Acceleration of simulation is meant to be how many times the execution time of the simulation in the multithreaded mode is less than the execution time of the simulation without the use of multithreaded processing technologies.

To determine which multithreaded processing technology allows achieving the greatest acceleration of modeling, the following scenario was considered:

the design domain is a rectangular parallelepiped with dimensions of 50 x 50 x 5 m. mesh cell size is 0.125 m. in the center of the calculated domain is a fire load with an area of ​​16 sq. m. simulation time is 10 sec. when simulating using MPI, the computational domain was divided into identical meshes, each of which was assigned its own MPI process.

There are 6,400,000 cells in the calculated domain for this scenario. This scenario requires approximately 6.6 GB of RAM to simulate.

(image)

Simulation was carried out on a computer with an 8-core AMD Ryzen 7 2700 processor (16 logical processors). The simulation time was from 30 to 100 minutes.

The graph below shows the change in simulation acceleration with an increase in the number of MPI processes or OpenMP threads.

(image)

It can be seen that at first an increase in the number of threads leads to an acceleration of the simulation. However, when the number of threads is more than 8 for both MPI and OpenMP, the acceleration decreases. This is primarily due to an increase in the volume of interaction between processes.

The obtained result does not mean that for any scenario, when performing simulation on any processor, the optimal number of threads is 8.

It can only be argued that for the considered scenario, when simulating on such a processor, the optimal number of threads is 8.

For each scenario, when simulating on a specific processor, the optimal number of threads will be different. The optimal number of threads depends on the size of the meshes and their mutual arrangement, and the number of processor cores.

It makes no sense to set the number of threads for simulation in excess of the number of logical processor cores. In this case, the acceleration is guaranteed to be less than the maximum possible.

The carried out experiment shows that it is more expedient to use MPI to speed up the simulation. In this case, one should strive to ensure that the total size of the meshes processed by each MPI process is approximately the same.

The issues of running FDS, including in multithreaded mode, using MPI and OpenMP, are discussed in detail in the FDS user manual. In addition, the guide also discusses the effectiveness of each approach, and makes a similar conclusion about the preference for MPI. We recommend to read Chapter 3 Running FDS of the FDS User Manual for more information on this topic.

Running FDS in multithreaded mode in Fenix+ 3

As shown above, MPI is the preferred way to run simulations in multithreaded mode. With the help of MPI Fenix+ 3 starts the simulation using as many MPI processes as the user specified in the fire simulation parameters of the scenario. Before starting FDS, Fenix​​+ 3 prepares an input file for FDS in such a way that the computational domain is divided into the required number of meshes and each mesh is assigned its own MPI process. Next, we will dwell in more detail on the procedure for preparing an input file for multithreaded modeling. It will allow us to choose a more optimal number of threads for modeling.

MESH groups formation algorithm

To determine the volume in which it is necessary to simulate the dynamics of the development of a fire in Fenix+ 3, the “Calculation area” tool is intended. You can place multiple calculation areas in a script. Their sizes and other parameters (cell size, state of open faces) can be different. The location of the calculation areas relative to each other can be absolutely arbitrary: they can intersect, touch, or have no common points at all.

In some way, the computation area in the Fenix+ 3 project script corresponds to the mesh (MESH group) in the FDS input file. In the simplest scenarios, the match can be one-to-one. But in the general case, the correspondence is only approximate, since, for example, one calculation area can form several MESH groups in the input FDS file, or, on the contrary, several calculation areas can be transformed into one MESH group.

Converting calculation areas to meshes consists of the following main steps:

“An association”. All calculation areas with a cell of the same size are combined into larger areas, if possible. Wherein: the intersections of the calculation areas are eliminated (this may lead to incorrect results in the intersection area). If areas with different cell sizes intersect, then the intersection with the calculation area with a small cell size is excluded from the area with a large cell size, and the rest is divided into several rectangular areas; the calculation areas increase in directions where the number of cells is less than 3 (this case is very rare and in practice can arise only due to an error in the placement of the calculation area). “Partitioning”. If the number of streams that the user wants to use for modeling is greater than the number of regions obtained at the first stage, then those of them with the most cells are split in half in the direction with the largest number of cells. The split is interrupted if: the number of available areas is equal to or greater than the desired number of threads; the boundaries of the areas that would be obtained by splitting the largest area fall into one of the VENT groups in the input FDS file (the VENT groups are located at the location of the fire center or smoke exhaust valve); there are no more MESHs to break. After this stage, regions are obtained that completely correspond to the meshes in the input FDS file. Since the split can be interrupted, the number of meshes can be less than the desired number of threads. “Balancing”. All the resulting meshes are assigned an MPI process that will process them. For this: the largest mesh is selected, which has not yet been assigned an MPI process; the MPI process is selected, which is assigned to process the mesh with the smallest total volume; the selected mesh is assigned the selected MPI process.

Examples of forming MESH groups

In the examples below, it is assumed that the scenario has a calculation area with a cell size that is two times different.

The examples below show the result of executing the “Union” stage when forming MESH groups for different relative positions of the calculation areas.

(image)

The following examples show the result of performing the “Split” and “Balancing” steps for a different number of modeling threads (the number above) if the number of meshes after the “Split” step does not exceed the number of threads.

Above each mesh is the MPI process number that is assigned to process that mesh.

The most balanced cases when each MPI thread processes meshes of the same size (with the same number of cells) are circled in blue.

(image)

The following examples show the result of the “Balancing” step for two modeling threads in the case when the number of meshes after the “Split” step exceeds the number of threads.

Above each mesh is the MPI process number that is assigned to process that mesh.

The most balanced cases when each MPI thread processes meshes of the same size (with the same number of cells) are circled in blue.

(image)

In all the examples above, the meshing algorithm is shown in a two-dimensional case. But it should be borne in mind that the algorithm works the same in all three directions.