Most modern processors (CPU - Central Processing Unit) are multi-core, containing several independent physical cores that can operate simultaneously. Additionally, the inclusion of Hyper-threading technology allows operating system to double the number of recognized logical cores.
You can operate the FDS in a multi-threaded mode with parallel data processing. In this mode all work is distributed among several cores of a single computer, or across the processors of multiple computers connected in a network.
Generally, but not always, performance improves as the number of threads increases. At the same time the simulation time decreases. An increase in performance is only possible if the number of threads does not exceed the number of logical cores.
There are two technologies that you can use together to run FDS in multithreaded mode:
- OpenMP (Open Multi-Processing). This technology allows the use of several processor cores on the computer where the simulation is conducted.
- MPI (Message Passing Interface). This technology enables the use of multiple cores of a single processor or the processors of several computers linked in a local network.
Before analyzing of the FDS launch strategy in Fenix + 3, let’s look into how simulation is carried out in FDS.
The simulation in FDS takes place in a certain computational domain: a limited space where you want to imitate the fire development. The computational domain consists of one or several elementary volumes - meshes. A mesh is a volume of space shaped like a rectangular parallelepiped:
Each mesh is divided into rectangular cells. The size of these cells (and thus their number) defines the precision level of how objects are represented in FDS and the accuracy of the simulation. The smaller the size of the cells, the more precisely the objects are depicted in FDS:
In the input file for FDS, each mesh is represented by a group of MESH parameters:
where the IJK parameter specifies the number of cells into which the mesh is divided along each side, and XB sets the linear dimensions of the mesh. Consequently, in the example above, the mesh is divided into 6000 (10 x 20 x 30) cells.
The number of cells in a mesh is one of the most crucial parameters that determines the simulation runtime. The greater the number of cells, the longer the simulation takes to run.
Furthermore, when referring to the ‘size of the mesh,’ we are not talking about its linear dimensions, but rather the number of cells it contains. Thus, a ’large mesh’ refers to a mesh with a high number of cells.
Multithreading with OpenMP
To use OpenMP technology for multithreaded simulation, simply set the OMP_NUM_THREADS environment variable to the desired number of threads before starting FDS.
After starting FDS in multithreaded mode with OpenMP technology on Windows OS, the Resource Monitor will display one ‘fds’ process and several associated threads:
If the computational domain contains multiple meshes, they will be processed sequentially within a single process.
Running in Multithreaded Mode with MPI
To employ MPI for multithreaded simulation, the entire computational domain must be divided into several meshes. Each mesh is processed by an assigned MPI process. Furthermore, it is possible to assign the same MPI process to several meshes. This approach is beneficial when the computational domain consists of a few large meshes along with many small ones. Large meshes are processed by individual MPI processes, whereas several small meshes can be handled by a single MPI process. This method of processing multiple meshes with one MPI process reduces the interaction between MPI processes. The specific MPI instance to process a particular mesh is determined by the MPI_PROCESS parameter in the MESH parameter group:
&MESH ID='mesh1', IJK=..., XB=..., MPI_PROCESS=0 /
&MESH ID='mesh2', IJK=..., XB=..., MPI_PROCESS=1 /
&MESH ID='mesh3', IJK=..., XB=..., MPI_PROCESS=1 /
&MESH ID='mesh4', IJK=..., XB=..., MPI_PROCESS=2 /
&MESH ID='mesh5', IJK=..., XB=..., MPI_PROCESS=3 /
&MESH ID='mesh6', IJK=..., XB=..., MPI_PROCESS=3 /
After initiating FDS in multithreaded mode using MPI on a single computer running Windows, the Resource Monitor will display several ‘fds’ processes:
Comparison of Simulation Time Optimization Using OpenMP and MPI
By simulation time optimization we mean how much faster the simulation runtime in multithreaded mode becomes as compared to that without multithreaded processing technologies.
To determine which multithreaded processing technology enables the greatest simulation speed up, we use the following scenario:
- The computational domain is a rectangular parallelepiped measuring 50 x 50 x 5 meters.
- The mesh cell size is 0.125 m.
- At the center of the computational domain, there is a fire load covering an area of 16 square meters.
- The simulation duration is 10 seconds.
- When simulating with MPI, the computational domain was divided into identical meshes, each assigned its own MPI process.
For this scenario, the computational domain contains 6,400,000 cells. Approximately 6.6 GB of RAM is required to simulate this scenario.
The simulation was conducted on a computer equipped with an 8-core AMD Ryzen 7 2700 processor (16 logical processors). The simulation duration ranged from 30 to 100 minutes.
The graph below illustrates how the speeding up of the simulation time varies with an increase in the number of MPI processes or OpenMP threads.
It is evident that initially, an increase in the number of threads leads to faster simulations. However, with more than eight threads, the acceleration diminishes, regardless of MPI or OpenMP use. This reduction is primarily due to an increase in the interaction volume between processes.
This outcome does not imply that for every scenario and on any processor, the optimal number of threads is always eight.
It can only be asserted that for the given scenario, when simulating on such a processor, the optimal number of threads is eight.
For each scenario, when simulating on a specific processor, the optimal number of threads will vary. The optimal number depends on the size and arrangement of the meshes and the number of processor cores.
Setting the number of threads for simulation beyond the number of logical processor cores has no particular value. In such cases, you will not save as much simulation time as possibly achievable.
The experiment described above demonstrates that using MPI is more advantageous for cutting down simulation runtime. It is best practices to ensure that the total size of the meshes processed by each MPI process is approximately equal.
The use of FDS, including runing it in multithreaded mode using MPI and OpenMP, is thoroughly described in the FDS User Guide. Additionally, the manual assesses the efficiency of each approach, concluding a preference for MPI. Refer to Chapter 3, ‘Running FDS,’ of the FDS User Guide for further details on this topic.
Running FDS in Multithreaded Mode in Fenix+ 3
As demonstrated above, MPI is the preferred method for running simulations in multithreaded mode. Fenix+ 3 utilizes MPI to run simulations using as many MPI processes as the user has specified in the fire dynamics simulation scenario settings. Before launching FDS, Fenix+ 3 prepares the FDS input file so that the computational domain is divided into the necessary number of meshes, with each mesh assigned its own MPI process. Next, we explore in more detail the process of preparing an input file for multithreaded simulation. This will help in selecting the most optimal number of threads for simulation.
MESH Group Formation Algorithm
To determine the space volume in which to simulate fire dynamics, Fenix+ 3 uses the ‘Calculation Area’ tool. Multiple calculation areas can be placed within a scenario. These areas may vary in size and other parameters, such as cell size and the state of open facets. The placement of calculation areas relative to each other can be completely arbitrary: they may overlap, touch, or have no common intersection points.
In some ways, a calculation area in a Fenix+ 3 project scenario corresponds to a mesh (MESH group) in an FDS input file. In the simplest scenarios, this matching may be complete. However, generally, the match is only approximate because, for example, one calculation area can form several MESH groups in the FDS input file, or alternatively, multiple calculation areas can transform into a single MESH group.
The conversion of calculation areas into FDS meshes involves the following main steps:
- Combining: All calculation areas with the same cell size are combined into larger areas where possible. During this:
- Intersections of calculation areas are eliminated. This can lead to incorrect results in the intersection area. If areas with different cell sizes intersect, the area with the larger cell size excludes the intersection area with the calculation area that has the smaller cell size, and the remainder is divided into several rectangular areas.
- Calculation areas are expanded in directions where the number of cells is less than three, a situation that is very rare and typically only occurs due to a placement error.
- Partitioning: If the number of threads you intend to use for simulation exceeds the number of areas obtained during stage one, those with the most cells are split in half along the direction that has largest number of cells. The splitting process is halted if:
- The number of existing areas equals or exceeds the desired number of threads.
- The area boundaries that would result from splitting the largest area would coincide with a VENT group in the FDS input file (VENT groups are located at the site of the fire source or smoke exhaust valve).
- There are no more MESHes available to split. After this stage, the new calculation areas fully correspond to meshes in the FDS input file. However, since splitting can be interrupted, the number of meshes may be smaller than the desired number of threads.
- Balancing: Each resulting mesh is assigned an MPI process that will handle its computations. This is done by:
- Selecting the largest mesh that has not yet been assigned an MPI process.
- Choosing an MPI process that has been assigned the least total volume of meshes to process.
- Assigning the selected MPI process to the specified mesh.
Examples of Forming MESH Groups
The examples below assume that the scenario has calculation areas with cell sizes differing by a factor of two.
The examples below demonstrate the outcome of the Combining stage in forming MESH groups for various relative placements of calculation areas.
The following examples illustrate the results of the Partitioning and Balancing stages for various numbers of simulation threads (the number shown above) when the number of meshes after the Partitioning stage does not exceed the number of threads.
Above each mesh, the number of the MPI process assigned to that mesh is displayed.
The most balanced cases, where each MPI thread processes meshes of the same size (with an equal number of cells), are highlighted with in blue.
The subsequent examples show the result of the Balancing stage for two simulation threads when the number of meshes after the Partitioning stage exceeds the number of threads.
Above each mesh, the number of the MPI process assigned to that mesh is shown.
The most balanced cases, where each MPI thread processes meshes of equal size (with an equal number of cells), are highlighted in blue.
In all the above examples, the mesh formation algorithm is presented two-dimensionally. However, it should be noted that the algorithm functions identically in all three dimensions.