Most up-to-date processors (CPU - Central Processor Unit) are multi-core, meaning they include several independent physical cores that can work simultaneously. Additionally, enabling Hyper-threading technology allows you to double the number of logical cores that are determined by the operating system.
FDS can be run in multithreaded mode (i.e., with parallel data processing), where the workload is divided between multiple processor cores of a single computer or between the processors of several computers connected in a network.
As the number of threads increases, performance generally (but not always) improves and the simulation time decreases. Performance gains can be observed only if the number of threads does not exceed the number of logical cores.
There are two features that allow FDS to run in multithreaded mode, which can be used together:
- OpenMP (Open Multi-Processing): This feature enables the use of multiple processor cores of the computer on which the simulation is being performed.
- MPI (Message Passing Interface): This feature enables the use of multiple processor cores of the computer on which the simulation is being performed, or the processors of several computers in a local network.
Before discussing the strategy for launching FDS in Fenix+ 3, it is necessary to explain how simulation in FDS works.
Simulation in FDS occurs in the computational domain - the volume of space where fire development is of interest. The computational domain consists of one or more elementary volumes - meshes. A mesh is a volume of space in the shape of a rectangular parallelepiped:
Each mesh is split into rectangular cells. The cell size (and consequently their number) determines how accurately objects are represented in FDS and how accurate the simulation is. The smaller the cell size, the more accurately objects are represented in FDS:
In the FDS input file, each mesh is represented by the MESH parameter group:
&MESH IJK=10,20,30 XB=0.0,1.0,0.0,2.0,0.0,3.0 /
where the IJK parameter defines the number of cells into which the mesh is split along each side, and XB defines the linear dimensions of the mesh. Accordingly, in the example above, the mesh is split into 6000 (10*20*30) cells.
The number of cells in the mesh is one of the most important parameters that determines the simulation time. The greater the number of cells, the longer the simulation time.
Hereafter, mesh size refers not to the linear dimensions of the mesh but to the number of cells in it. Thus, a “large mesh” is a mesh with a large number of cells.
Running in Multithreaded Mode with OpenMP
To use OpenMP feature for multithreaded simulation, set the environment variable OMP_NUM_THREADS to the desired number of threads before starting FDS.
After starting FDS in multithreaded mode using OpenMP feature, in the Task manager in Windows OS, one can see a single “fds” process and several threads associated with it:
If the computational domain consists of several meshes, they are processed sequentially within a single process.
Running in Multithreaded Mode with MPI
To use MPI for multithreaded simulation, the entire computational domain must be split into several meshes. Each mesh is processed by one MPI process assigned to it. Additionally, it is possible to assign the same MPI process to multiple meshes. This is useful when the computational domain consists of several large meshes and many small ones. Large meshes are processed by their own MPI processes, while several small meshes are processed by the same MPI process. Processing multiple meshes with one MPI process helps reduce the amount of interaction between MPI processes. The MPI process that handles a specific mesh is determined by the MPI_PROCESS parameter in the MESH parameter group:
&MESH ID='mesh1', IJK=..., XB=..., MPI_PROCESS=0 /
&MESH ID='mesh2', IJK=..., XB=..., MPI_PROCESS=1 /
&MESH ID='mesh3', IJK=..., XB=..., MPI_PROCESS=1 /
&MESH ID='mesh4', IJK=..., XB=..., MPI_PROCESS=2 /
&MESH ID='mesh5', IJK=..., XB=..., MPI_PROCESS=3 /
&MESH ID='mesh6', IJK=..., XB=..., MPI_PROCESS=3 /
After starting FDS in multithreaded mode using MPI on a single computer in Windows OS, you can see multiple “fds” processes in the Task manager:
Comparison of Simulation Acceleration Using OpenMP and MPI
By “simulation acceleration,” we mean how much the simulation time in multithreaded mode is shorter than the simulation time without using multithreading.
To determine which multithreading technique provides the greatest simulation acceleration, we use the following example scenario:
- The computational domain is a rectangular parallelepiped measuring 50*50*5 meters.
- The mesh cell size is 0.125 meters.
- A fire load with an area of 16 square meters is located in the center of the computational domain.
- The simulation time is 10 seconds.
- When using MPI for the simulation, the computational domain was split into equal meshes, each assigned its own MPI process.
Thus, the computational domain for this scenario contains 6,400,000 cells. Approximately 6.6 GB of RAM is required to simulate this scenario.
The simulation was conducted on a computer with an 8-core AMD Ryzen 7 2700 processor (16 logical processors). The simulation time ranged from 30 to 100 minutes.
The graph below shows the change in simulation acceleration with an increasing number of MPI processes or OpenMP threads.
We can see that initially increasing the number of threads leads to simulation acceleration. However, with more than 8 threads, both for MPI and OpenMP, the acceleration decreases. This is primarily due to the increased amount of interaction between processes.
The obtained result does not mean that for any scenario and any processor, the optimal number of threads is 8.
We can only assert that for the scenario in question and on this processor, the optimal number of threads is 8.
For each scenario and specific processor, the optimal number of threads is different. The optimal number of threads depends on the size and arrangement of the meshes and the number of processor cores.
Setting the number of threads for simulation that exceeds number of processor logical cores does not have a desired effect. In this case, the acceleration will never reach the maximum possible.
The conducted experiment shows that using MPI is more practical for accelerating simulations. It is best practices when the total size of the meshes processed by each MPI process is approximately equal.
The issues of running FDS, including in multithreaded mode using MPI and OpenMP, are detailed in the FDS user guide. Additionally, the guide also discusses the efficiency of each approach, drawing a similar conclusion favoring MPI. For more information, see Chapter 3 “Running FDS” of the FDS User Guide.
Launching FDS in Multithreaded Mode in Fenix+ 3
As shown above, the preferred method for running simulations in multithreaded mode is MPI. Fenix+ 3 runs simulations using MPI with the number of MPI processes specified by the user in the fire simulation parameters for the scenario. Before starting FDS, Fenix+ 3 prepares the input file for FDS in such a way that the computational domain is split into the required number of meshes, and each mesh is assigned its own MPI process. Below we describe the procedure for preparing the input file for multithreaded simulation. The use of this procedure allows you to select an optimal number of threads for simulation.
MESH Group Formation Algorithm
Use the Calculation Area tool in Fenix+ 3 to define the volume in which you need to simulate the fire dynamics. You can place multiple calculation areas in a scenario, each with different sizes and parameters (cell size, state of open boundaries). The positioning of calculation areas relative to each other can be entirely arbitrary: they can overlap, touch, or have no intersection points at all.
In a way, a calculation area in a Fenix+ 3 project scenario corresponds to a mesh (MESH group) in the FDS input file. In simple scenarios, this corelation can be straightforward. However, in general, it is only approximate, as, for example, one calculation area may form several MESH groups in the FDS input file, or alternatively, several calculation areas may transform into one MESH group.
The transformation of calculation areas into meshes consists of the following main steps:
- Merging: All calculation areas with the same cell size are merged into larger areas if possible. During this process:
- Intersections of calculation areas are eliminated (this can lead to incorrect results in the intersection area). If areas with different cell sizes intersect, the intersection is excluded from the area with the larger cell size, and the remaining part is split into several rectangular areas.
- Calculation areas are expanded in directions where the number of cells is less than 3 (this case is very rare and can practically only occur due to an error in placing a calculation area).
- Splitting: If the number of threads the user wants to use for simulation is greater than the number of areas obtained in the first stage, then those with the most cells are split in half in the direction with the most cells. Splitting is stopped if:
- The number of available areas becomes equal to or greater than the desired number of threads.
- The boundaries of the areas that would result from splitting the largest area fall on one of the VENT groups in the FDS input file (VENT groups are located at the fire source or smoke removal valve).
- There are no more MESH groups that can be split. After this stage, the areas fully correspond to the meshes in the FDS input file. Since splitting can be interrupted, the number of meshes may be less than the desired number of threads.
- Balancing: All resulting meshes are assigned an MPI process to handle them. For this:
- The largest mesh not yet assigned an MPI process is selected.
- The MPI process assigned to handle the meshes with the smallest total volume is chosen.
- The selected mesh is assigned to the chosen MPI process.
Examples of MESH Group Forming
The examples below assume that the cell sizes in the scenario’s calculation areas differ by a factor of two.
The following examples show the result of the Merging stage when forming MESH groups for different relative positions of the calculation areas.
The following examples show the results of the Splitting and Balancing stages for different numbers of simulation threads (number on top) when the number of meshes after the Splitting stage does not exceed the number of threads.
Above each mesh, the MPI process number assigned for that mesh is shown.
The best balanced cases, where each MPI thread processes meshes of the same size (with the same number of cells), are highlighted in blue.
The following examples show the result of the Balancing step for 2 simulation threads when the number of meshes after the Splitting step exceeds the number of threads.
Above each mesh, the MPI process number assigned to that mesh is shown.
The best balanced cases, where each MPI thread processes meshes of the same size (with the same number of cells), are highlighted in blue.
In all the examples above, the mesh formation algorithm is shown in a two-dimensional case. However, note that the algorithm works the same in all three dimensions.