Parallel MxN Repartitioning

by
James R. Holten III, PhD
(SAIC)
March 18, 2003
jrholte@sandia.gov or james.r.holten.iii@saic.com

I. Overview

This note outlines the algorithm for performing MxN repartitioning of mesh data as the time data is being loaded from a file. It's key features include:

The algorithm presented is to be replicated in each process of a parallelized program.
The algorithm is generalized to finite element or general polyhedral mesh forms and can support load balanced or h-adaptive mesh data. Structured mesh partitioning must be based on rectangles or the resulting partitioned mesh will have to be a finite element or general polyhedral form.
The algorithm is generalized to handle any partitioning description, such as those generated for minimizing communications, load balancing, or partitioning based on assembly parts.
The algorithm uses a partitioning form that can include ghost regions or the algorithm implementation can be easily varied to generate ghost regions.
By performing the repartitioning at file load time each time step may be easily load-balanced for optimum performance by simply adjusting the element partitioning description to reflect the desired load-balanced configuration.
By performing the repartitioning at file load time the algorithm may be easily adapted to use separate smart I/O processes which may then minimize the data transfer volumes to and from the compute processes with a minimal subset extraction being performed on the I/O processor. This also minimizes the communications overhead for streaming I/O from a running data source, such as from a running simulation, or to a running data sink, such as to a visualizer or other post-processing activity.

Variations on the MxN repartitioning have been implemented and are used for different tasks, and these variations are listed here:

Single process forms --
1. A global mesh to N parallel partitions
2. M partition mesh files (or file sets) to one global mesh (commonly called a JOIN)
3. M partitions to a single global mesh then to N partitions (JOIN, then partition, the common form of MxN repartitioning)
N process forms (one partition per process) --
1. A global mesh to 1 partition
2. M partitions to a single global mesh then to 1 partition
3. Only m (much smaller than M) partitions to a global mesh subset then to 1 partition (minimize JOIN sources per process)
4. Only m (much smaller than M) partitions to a single partition (minimized computations)

Once the last form is implemented in a well-designed set of routines any of the other forms can be implemented using those routines.

The MxN repartitioning algorithm details are given, then the key algorithm for determining the minimal collection of sets needed to cover all the members of another set is presented.

Although references are made to structures and routines in the Data Object Library (DOL) this algorithm is by no means limited to use with the DOL.

II. MxN Repartitioning Algorithm for Each Process

This algorithm is broken into three parts. The first part is a sequence of one-time metadata generation steps with references to the DOL metadata structures involved. The second part is the collection of steps required to actually compute the detail data for data extraction, which may be done once for static mesh data, but must be repeated each time step for dynamic mesh forms. The third part is the data extraction that must be performed each time step.

A. Metadata Steps

Only one set of mesh metadata needs generated, and that is the metadata for the desired partition.

Read in the overall mesh metadata, getting the logical structure of sets, subsets, and relations. The DOL main structures included are:
- The mesh
- The sequence (time) set and its time field,
- the mesh blocks,
- the mesh cell sets and their connectivity, and
- any other sets and their relations.
This may be done by reading in a single partition file's metadata, then cloning its metadata, as each partition should mirror the overall mesh metadata structure of sets, subsets and relations.
Obtain the field metadata only for those fields which are to be represented in the end product, including all coordinate fields. This supports field subsetting.
Obtain the description of the desired time steps and build the time steps mapping from new to old instance indices.
Generate the set partitioning and communications maps metadata. The DOL main structures included are:
- The mesh partitioning structures,
- The partitioning structures for each mesh block, and
- The partitioning structures for each cell set.
This includes determining the subsets shared among the partitions, and calculating the sharing patterns.
Generate the partitioning crosslink structures. The DOL main structures included are:
- The partitioning descriptions crosslinks context structure,
- The crosslink base structure for each base cell set, and
- The crosslink structures for each cell set.
These structures are later used to store the set extraction and insertion (join) mappings so they can be applied to the relevant relations and fields.

B. Set and Relation Extraction

Each process must receive the data for one of the N partition domains (P_i) from the desired partitioning of B, where each P_i, for i=1,...,N is a subset of B. The file partition domains (F_j) for j=1,...,M reflect a preexisting partitioning of B as it exists in the M files or sets of files, where each F_j is a subset of B. Neither of the two sets of partitions (P_i's and F_j's) need be strict partitionings of B, as they may have non-empty intersections between (overlap among) their partition domain sets. However the set of all P_i's and the set of all F_j's each must include all the members of B (form a cover for B).

For the element cell type either read or calculate the subset descriptions for all of the desired partitions (P_i) of B. If this partitioning does not include the ghost regions to be included in each partition then this is where the desired number of ghost levels should be determined and the ghost level extensions to the partition extents calculated and included in (P_i).
Read in or calculate the communications maps for the current process's element cell set (P_i), including any ghost regions.
Propagate the element partitioning (P_i) for the current process to all the element cell set's subsets.
Propagate the element partitioning (P_i) to each the other cell type present and to their subsets. This may be complicated by convoluted connectivity storage approaches used for the various mesh representations and the available file format conventions:
- All needed connectivity among the cell sets is in the domain block, and between the domain cell sets (the simplest form to partition).
- The connectivity is from element subsets in element blocks to the node set in the domain block (a common finite elements form, and the domain cell set connectivity must be generated from the union of the element block subset connectivities for easy partitioning).
- There is NO connectivity TO a given non-element cell set, only FROM that non-element cell set to nodes or elements. (This form is common for a set of external faces or edges with connectivity to element sets and/or node sets, and the easiest way to handle the deficiency is to calculate the inverse of either of the other connectivities).
- Indirect connectivity, such as zones to faces, faces to edges, and edges to nodes, and missing the commonly needed zones to edges, zones to nodes, and faces to nodes connectivity (common in general polyhedral representations, and it may be handled by merely propagating the partitioning successively to each of the cell sets).
- Implicit connectivity based on reference ordering, such as zones to nodes, where the faces to nodes and edges to nodes can be assumed because of a PATRAN ordering of each zone's node references (common to finite element representations, and must be handled by generating an "elements to faces" and/or "elements to edges" relation to propagate the partitioning to existing face sets or edge sets, other difficulties arise in finding matching faces or edges of adjacent elements).
- Implicit connectivity based on index patterns, where adjacent element indices imply adjacent elements and adjacent node indices imply adjacent nodes, and where an element's indices and the element's indices plus one are the indices of the nodes for that element. Connectivity to faces or edges is more complex but similarly derived (common for structured mesh representations).

Read in the partition domain subset information for each file (F_j). Each process must have this information for all the files before the next step can be performed.

Obtain the minimal cover set of file indices (j) for which the file subsets (F_j) cover the current process's domain partition (P_i). Also obtain the extraction mapping (MF_i,j) and insertion (join) mapping (MP_i,j) for each file in the cover set. Each extraction set (E_i,j) is a subset of B, of the corresponding P_i, and of the corresponding F_j. Obtaining the extraction subsets also gives the extraction mapping from E_i,j to the members of F_j (as MF_i,j) for easy subset extraction, and the insertion mapping from E_i,j to the members of P_i (as MP_i,j) for easy insertion of the extracted partition subset. These are needed for performing the MxN repartitioning of both relations and fields.

Each relation (subset, connectivity, or other association between members of sets) whose domain set has been partitioned must have its entries extracted from the file version of the relation via the above calculated MF_i,j and joined (inserted) into the memory version of the relation via the above calculated MP_i,j for the domain set. Each relation whose range set has been partitioned must have its reference indices remapped to reflect the renumbering of the partitioned range set members.

Each relation whose domain and range sets are not partitioned may be copied intact.

C. Extracting the Field Data Each Time Step

Each field whose domain set has been partitioned must have its data values extracted from the file version of the field representation via the above calculated MF_i,j and joined (inserted) into the memory version of the field representation via the above calculated MP_i,j for the domain set.
Each field whose domain set is not partitioned may be copied intact.

III. Getting the Minimal Cover for a Partition Domain (P_i)

A. Given:

P_i The partition subset for partition i.
F_j The partition subset for file j.

P_i		The partition subset for partition i.
F_j		The partition subset for file j.

B. Define:

R_k,i The remnant of partition P_i not yet covered after the kth step.
D_k,i,j The difference between R_k-1,i and F_j, or the part of the k-1st step remnant not covered by F_j.
S_k,i The set of values of j which resulted in R_k,i for partition i.
E_i,j The subset needing extracted from F_j to cover P_i.
MF_i,j The mapping from the members of E_i,j to the members of F_j.
MP_i,j The mapping from the members of E_i,j to the members of P_i.

R_k,i		The remnant of partition P_i not yet covered after the kth step.
D_k,i,j		The difference between R_k-1,i and F_j, or the part of the k-1st step remnant not covered by F_j.
S_k,i		The set of values of j which resulted in R_k,i for partition i.
E_i,j		The subset needing extracted from F_j to cover P_i.
MF_i,j		The mapping from the members of E_i,j to the members of F_j.
MP_i,j		The mapping from the members of E_i,j to the members of P_i.

C. Initialize:

R_0,i = P_i The full partition is the initial remnant to cover.

S_0,i = {} The empty set, as no files have been selected yet.

R_0,i = P_i		The full partition is the initial remnant to cover.
S_0,i = {}		The empty set, as no files have been selected yet.

D. Steps:

Set k = 1;
Calculate:
D_k,i,j = R_k-1,i - F_j
for each j not in S_k-1,i.
Select the value of j where D_k,i,j has the smallest number of members, indicating F_j is the best cover for R_k-1,i (what is currently left to be covered in P_i).
Calculate:
S_k,i = UNION({j}, S_k-1,i)
for the selected j.
Calculate:
E_i,j = INTERSECT(R_k-1,i, F_j)
for the selected j. By using the call swIVSetIntersect() this also give MF_i,j.
Calculate:
E'_i,j = INTERSECT(E_i,j, P_i)
for the current partition i. By using the call swIVSetIntersect() this also give MP_i,j.
Calculate:
R_k,i = D_k,i,j
for the selected j.
Increment k and repeat until R_k,i has no members.