Parallel MxN Repartitioning

by
James R. Holten III, PhD
(SAIC)
March 18, 2003
jrholte@sandia.gov or james.r.holten.iii@saic.com


I. Overview

This note outlines the algorithm for performing MxN repartitioning of mesh data as the time data is being loaded from a file. It's key features include:

  1. The algorithm presented is to be replicated in each process of a parallelized program.
  2. The algorithm is generalized to finite element or general polyhedral mesh forms and can support load balanced or h-adaptive mesh data. Structured mesh partitioning must be based on rectangles or the resulting partitioned mesh will have to be a finite element or general polyhedral form.
  3. The algorithm is generalized to handle any partitioning description, such as those generated for minimizing communications, load balancing, or partitioning based on assembly parts.
  4. The algorithm uses a partitioning form that can include ghost regions or the algorithm implementation can be easily varied to generate ghost regions.
  5. By performing the repartitioning at file load time each time step may be easily load-balanced for optimum performance by simply adjusting the element partitioning description to reflect the desired load-balanced configuration.
  6. By performing the repartitioning at file load time the algorithm may be easily adapted to use separate smart I/O processes which may then minimize the data transfer volumes to and from the compute processes with a minimal subset extraction being performed on the I/O processor. This also minimizes the communications overhead for streaming I/O from a running data source, such as from a running simulation, or to a running data sink, such as to a visualizer or other post-processing activity.

Variations on the MxN repartitioning have been implemented and are used for different tasks, and these variations are listed here:

Once the last form is implemented in a well-designed set of routines any of the other forms can be implemented using those routines.

The MxN repartitioning algorithm details are given, then the key algorithm for determining the minimal collection of sets needed to cover all the members of another set is presented.

Although references are made to structures and routines in the Data Object Library (DOL) this algorithm is by no means limited to use with the DOL.

II. MxN Repartitioning Algorithm for Each Process

This algorithm is broken into three parts. The first part is a sequence of one-time metadata generation steps with references to the DOL metadata structures involved. The second part is the collection of steps required to actually compute the detail data for data extraction, which may be done once for static mesh data, but must be repeated each time step for dynamic mesh forms. The third part is the data extraction that must be performed each time step.

A. Metadata Steps

Only one set of mesh metadata needs generated, and that is the metadata for the desired partition.

  1. Read in the overall mesh metadata, getting the logical structure of sets, subsets, and relations. The DOL main structures included are: This may be done by reading in a single partition file's metadata, then cloning its metadata, as each partition should mirror the overall mesh metadata structure of sets, subsets and relations.

  2. Obtain the field metadata only for those fields which are to be represented in the end product, including all coordinate fields. This supports field subsetting.
  3. Obtain the description of the desired time steps and build the time steps mapping from new to old instance indices.

  4. Generate the set partitioning and communications maps metadata. The DOL main structures included are: This includes determining the subsets shared among the partitions, and calculating the sharing patterns.

  5. Generate the partitioning crosslink structures. The DOL main structures included are: These structures are later used to store the set extraction and insertion (join) mappings so they can be applied to the relevant relations and fields.

B. Set and Relation Extraction

Each process must receive the data for one of the N partition domains (Pi) from the desired partitioning of B, where each Pi, for i=1,...,N is a subset of B. The file partition domains (Fj) for j=1,...,M reflect a preexisting partitioning of B as it exists in the M files or sets of files, where each Fj is a subset of B. Neither of the two sets of partitions (Pi's and Fj's) need be strict partitionings of B, as they may have non-empty intersections between (overlap among) their partition domain sets. However the set of all Pi's and the set of all Fj's each must include all the members of B (form a cover for B).

  1. For the element cell type either read or calculate the subset descriptions for all of the desired partitions (Pi) of B. If this partitioning does not include the ghost regions to be included in each partition then this is where the desired number of ghost levels should be determined and the ghost level extensions to the partition extents calculated and included in (Pi).

  2. Read in or calculate the communications maps for the current process's element cell set (Pi), including any ghost regions.

  3. Propagate the element partitioning (Pi) for the current process to all the element cell set's subsets.

  4. Propagate the element partitioning (Pi) to each the other cell type present and to their subsets. This may be complicated by convoluted connectivity storage approaches used for the various mesh representations and the available file format conventions:

  • Read in the partition domain subset information for each file (Fj). Each process must have this information for all the files before the next step can be performed.

  • Obtain the minimal cover set of file indices (j) for which the file subsets (Fj) cover the current process's domain partition (Pi). Also obtain the extraction mapping (MFi,j) and insertion (join) mapping (MPi,j) for each file in the cover set. Each extraction set (Ei,j) is a subset of B, of the corresponding Pi, and of the corresponding Fj. Obtaining the extraction subsets also gives the extraction mapping from Ei,j to the members of Fj (as MFi,j) for easy subset extraction, and the insertion mapping from Ei,j to the members of Pi (as MPi,j) for easy insertion of the extracted partition subset. These are needed for performing the MxN repartitioning of both relations and fields.

  • Each relation (subset, connectivity, or other association between members of sets) whose domain set has been partitioned must have its entries extracted from the file version of the relation via the above calculated MFi,j and joined (inserted) into the memory version of the relation via the above calculated MPi,j for the domain set. Each relation whose range set has been partitioned must have its reference indices remapped to reflect the renumbering of the partitioned range set members.

  • Each relation whose domain and range sets are not partitioned may be copied intact.

    C. Extracting the Field Data Each Time Step

    1. Each field whose domain set has been partitioned must have its data values extracted from the file version of the field representation via the above calculated MFi,j and joined (inserted) into the memory version of the field representation via the above calculated MPi,j for the domain set.
    2. Each field whose domain set is not partitioned may be copied intact.

    III. Getting the Minimal Cover for a Partition Domain (Pi)

    A. Given:

    Pi The partition subset for partition i.
    Fj The partition subset for file j.

    B. Define:

    Rk,i The remnant of partition Pi not yet covered after the kth step.
    Dk,i,j The difference between Rk-1,i and Fj, or the part of the k-1st step remnant not covered by Fj.
    Sk,i The set of values of j which resulted in Rk,i for partition i.
    Ei,j The subset needing extracted from Fj to cover Pi.
    MFi,j The mapping from the members of Ei,j to the members of Fj.
    MPi,j The mapping from the members of Ei,j to the members of Pi.

    C. Initialize:

    R0,i = Pi The full partition is the initial remnant to cover.
    S0,i = {} The empty set, as no files have been selected yet.

    D. Steps:

    1. Set k = 1;

    2. Calculate:
      Dk,i,j = Rk-1,i - Fj
      for each j not in Sk-1,i.

    3. Select the value of j where Dk,i,j has the smallest number of members, indicating Fj is the best cover for Rk-1,i (what is currently left to be covered in Pi).

    4. Calculate:
      Sk,i = UNION({j}, Sk-1,i)
      for the selected j.

    5. Calculate:
      Ei,j = INTERSECT(Rk-1,i, Fj)
      for the selected j. By using the call swIVSetIntersect() this also give MFi,j.

    6. Calculate:
      E'i,j = INTERSECT(Ei,j, Pi)
      for the current partition i. By using the call swIVSetIntersect() this also give MPi,j.

    7. Calculate:
      Rk,i = Dk,i,j
      for the selected j.

    8. Increment k and repeat until Rk,i has no members.