Parallel MxN Repartitioning
I. Overview
This note outlines the algorithm for performing MxN repartitioning of
mesh data as the time data is being loaded from a file. It's key
features include:
-
The algorithm presented is to be replicated in each process of a
parallelized program.
-
The algorithm is generalized to finite element or general
polyhedral mesh forms and can support load balanced or
h-adaptive mesh data. Structured mesh partitioning must be
based on rectangles or the resulting partitioned mesh will have to
be a finite element or general polyhedral form.
-
The algorithm is generalized to handle any partitioning
description, such as those generated for minimizing communications,
load balancing, or partitioning based on assembly parts.
-
The algorithm uses a partitioning form that can include ghost
regions or the algorithm implementation can be easily varied to
generate ghost regions.
-
By performing the repartitioning at file load time each time step
may be easily load-balanced for optimum performance by simply
adjusting the element partitioning description to reflect the
desired load-balanced configuration.
-
By performing the repartitioning at file load time the algorithm
may be easily adapted to use separate smart I/O processes which may
then minimize the data transfer volumes to and from the compute
processes with a minimal subset extraction being performed on the
I/O processor. This also minimizes the communications overhead for
streaming I/O from a running data source, such as from a running
simulation, or to a running data sink, such as to a visualizer or
other post-processing activity.
Variations on the MxN repartitioning have been implemented and are used
for different tasks, and these variations are listed here:
- Single process forms --
- A global mesh to N parallel partitions
- M partition mesh files (or file sets) to one global mesh (commonly
called a JOIN)
- M partitions to a single global mesh then to N partitions (JOIN,
then partition, the common form of MxN repartitioning)
- N process forms (one partition per process) --
- A global mesh to 1 partition
- M partitions to a single global mesh then to 1 partition
- Only m (much smaller than M) partitions to a global mesh subset
then to 1 partition (minimize JOIN sources per process)
- Only m (much smaller than M) partitions to a single partition
(minimized computations)
Once the last form is implemented in a well-designed set of routines
any of the other forms can be implemented using those routines.
The MxN repartitioning algorithm details are given, then the key
algorithm for determining the minimal collection of sets needed to
cover all the members of another set is presented.
Although references are made to structures and routines in the Data
Object Library (DOL) this algorithm is by no means limited to use
with the DOL.
II. MxN Repartitioning Algorithm for Each Process
This algorithm is broken into three parts. The first part is a
sequence of one-time metadata generation steps with references to the
DOL metadata structures involved. The second part is the collection
of steps required to actually compute the detail data for data
extraction, which may be done once for static mesh data, but must be
repeated each time step for dynamic mesh forms. The third part is the
data extraction that must be performed each time step.
A. Metadata Steps
Only one set of mesh metadata needs generated, and that is the metadata
for the desired partition.
-
Read in the overall mesh metadata, getting the logical structure of
sets, subsets, and relations. The DOL main structures included are:
- The mesh
- The sequence (time) set and its time field,
- the mesh blocks,
- the mesh cell sets and their connectivity, and
- any other sets and their relations.
This may be done by reading in a single partition file's metadata,
then cloning its metadata, as each partition should mirror the
overall mesh metadata structure of sets, subsets and relations.
-
Obtain the field metadata only for those fields which are to be
represented in the end product, including all coordinate fields.
This supports field subsetting.
-
Obtain the description of the desired time steps and build the time
steps mapping from new to old instance indices.
-
Generate the set partitioning and communications maps metadata. The
DOL main structures included are:
- The mesh partitioning structures,
- The partitioning structures for each mesh block, and
- The partitioning structures for each cell set.
This includes determining the subsets shared among the partitions,
and calculating the sharing patterns.
-
Generate the partitioning crosslink structures. The DOL main
structures included are:
- The partitioning descriptions crosslinks context structure,
- The crosslink base structure for each base cell set, and
- The crosslink structures for each cell set.
These structures are later used to store the set extraction and
insertion (join) mappings so they can be applied to the relevant
relations and fields.
B. Set and Relation Extraction
Each process must receive the data for one of the N
partition domains (Pi) from the desired partitioning of
B, where each Pi, for i=1,...,N is a
subset of B. The file partition domains (Fj) for
j=1,...,M reflect a preexisting partitioning of B as it
exists in the M files or sets of files, where each
Fj is a subset of B. Neither of the two
sets of partitions (Pi's and
Fj's) need be strict partitionings of B,
as they may have non-empty intersections between (overlap among)
their partition domain sets. However the set of all
Pi's and the set of all Fj's
each must include all the members of B (form a cover for
B).
-
For the element cell type either read or calculate the subset
descriptions for all of the desired partitions
(Pi) of B. If this partitioning does not
include the ghost regions to be included in each partition then
this is where the desired number of ghost levels should be
determined and the ghost level extensions to the partition extents
calculated and included in (Pi).
-
Read in or calculate the communications maps for the current
process's element cell set (Pi), including
any ghost regions.
-
Propagate the element partitioning (Pi) for the
current process to all the element cell set's subsets.
-
Propagate the element partitioning (Pi) to each
the other cell type present and to their subsets. This may be
complicated by convoluted connectivity storage approaches used for
the various mesh representations and the available file format
conventions:
-
All needed connectivity among the cell sets is in the
domain block, and between the domain cell sets (the simplest
form to partition).
-
The connectivity is from element subsets in element blocks
to the node set in the domain block (a common finite
elements form, and the domain cell set connectivity must
be generated from the union of the element block subset
connectivities for easy partitioning).
-
There is NO connectivity TO a given non-element cell set,
only FROM that non-element cell set to nodes or elements.
(This form is common for a set of external faces or edges
with connectivity to element sets and/or node sets, and
the easiest way to handle the deficiency is to calculate
the inverse of either of the other connectivities).
-
Indirect connectivity, such as zones to faces, faces to
edges, and edges to nodes, and missing the commonly needed
zones to edges, zones to nodes, and faces to nodes
connectivity (common in general polyhedral representations,
and it may be handled by merely propagating the partitioning
successively to each of the cell sets).
-
Implicit connectivity based on reference ordering, such as
zones to nodes, where the faces to nodes and edges to nodes
can be assumed because of a PATRAN ordering of each zone's
node references (common to finite element representations,
and must be handled by generating an "elements to faces"
and/or "elements to edges" relation to propagate the
partitioning to existing face sets or edge sets, other
difficulties arise in finding matching faces or edges of
adjacent elements).
-
Implicit connectivity based on index patterns, where adjacent
element indices imply adjacent elements and adjacent node
indices imply adjacent nodes, and where an element's indices
and the element's indices plus one are the indices of the
nodes for that element. Connectivity to faces or edges
is more complex but similarly derived (common for structured
mesh representations).
Read in the partition domain subset information for each file
(Fj). Each process must have this information for all
the files before the next step can be performed.
Obtain the minimal cover set of file indices (j) for which the file
subsets (Fj) cover the current process's domain partition
(Pi). Also obtain the extraction mapping
(MFi,j) and insertion (join) mapping (MPi,j)
for each file in the cover set. Each extraction set
(Ei,j) is a subset of B, of the corresponding
Pi, and of the corresponding Fj. Obtaining
the extraction subsets also gives the extraction mapping from
Ei,j to the members of Fj
(as MFi,j) for easy subset extraction, and the insertion
mapping from Ei,j to the members of Pi
(as MPi,j) for easy insertion of the extracted partition
subset. These are needed for performing the MxN repartitioning of
both relations and fields.
Each relation (subset, connectivity, or other association between
members of sets) whose domain set has been partitioned must have its
entries extracted from the file version of the relation via the
above calculated MFi,j and joined (inserted) into the
memory version of the relation via the above calculated
MPi,j for the domain set. Each relation whose range
set has been partitioned must have its reference indices remapped
to reflect the renumbering of the partitioned range set members.
Each relation whose domain and range sets are not partitioned may be
copied intact.
C. Extracting the Field Data Each Time Step
-
Each field whose domain set has been partitioned must have its data
values extracted from the file version of the field representation
via the above calculated MFi,j and joined (inserted)
into the memory version of the field representation via the above
calculated MPi,j for the domain set.
-
Each field whose domain set is not partitioned may be copied intact.
III. Getting the Minimal Cover for a Partition Domain (Pi)
A. Given:
| Pi
|
| The partition subset for partition i.
|
| Fj
|
| The partition subset for file j.
|
B. Define:
| Rk,i
|
| The remnant of partition Pi not yet covered after the
kth step.
|
| Dk,i,j
|
| The difference between Rk-1,i and Fj, or the
part of the k-1st step remnant not covered by Fj.
|
| Sk,i
|
| The set of values of j which resulted in Rk,i for
partition i.
|
| Ei,j
|
| The subset needing extracted from Fj to cover
Pi.
|
| MFi,j
|
| The mapping from the members of Ei,j to the members of
Fj.
|
| MPi,j
|
| The mapping from the members of Ei,j to the members of
Pi.
|
C. Initialize:
| R0,i = Pi
|
| The full partition is the initial remnant to cover.
|
| S0,i = {}
|
| The empty set, as no files have been selected yet.
|
D. Steps:
-
Set k = 1;
-
Calculate:
Dk,i,j = Rk-1,i -
Fj
for each j not in Sk-1,i.
-
Select the value of j where Dk,i,j has the smallest
number of members, indicating Fj is the best cover for
Rk-1,i (what is currently left to be covered in
Pi).
-
Calculate:
Sk,i = UNION({j}, Sk-1,i)
for the selected j.
-
Calculate:
Ei,j = INTERSECT(Rk-1,i,
Fj)
for the selected j. By using the call swIVSetIntersect() this
also give MFi,j.
-
Calculate:
E'i,j = INTERSECT(Ei,j,
Pi)
for the current partition i. By using the call swIVSetIntersect() this
also give MPi,j.
-
Calculate:
Rk,i = Dk,i,j
for the selected j.
-
Increment k and repeat until Rk,i has no members.