# Tensor-Based Operator Assembly and Evaluation

## Overview

The high performance (HPC) versions of the example codes in the `miniapps/performance`

directory
use a set of templated classes to efficiently implement the inner-most portion
($B^T D B$) of the fundamental finite element (FE) operator decomposition:

We take advantage of the tensor-product structure of both the finite element
basis and the quadrature rule to efficiently apply the action of $B$ without
necessarily computing its entries. This is generally know as *sum
factorization*.
In the case where we pre-compute and store the $D$ matrix, we call the algorithm
*partial assembly*.

## Template implementation

Below is a short description of the header files that are part of the initial reference implementation of the tensor-based finite element assembly and evaluation algorithms.

### config/tconfig.hpp

Configuration macros including the specification of the template blocking size (currently used in the matrix-matrix multiply).

### general/tassign.hpp

Helper templated function to allow single implementation of multiple assignment operations on tensor entries.

### linalg/tlayout.hpp

Classes describing fixed size tensor layouts. Implemented are standard strided layouts for 1D/2D/3D/4D tensors. Layouts support reshape and sub-tensor operations independent of the actual data storage.

This header also contains another set of "vector layout" classes for
converting scalar data indices into multi-component (vector) data indices.
They are used to describe the layout of vector `GridFunctions`

on global
degrees of freedom (similar to the `Ordering`

class). In the FE operator
decomposition, these classes are used by the templated `*_FiniteElementSpace`

classes (see `fem/tfespace.hpp`

below) to implement the actions of $G$ and
$G^T$ in the case of vector (multi-component) input and/or output fields.

### linalg/tmatrix.hpp

Small matrix operations, defined by specializations: determinant, adjugate,
etc. Matrix-matrix multiply, `C = A.B`

, simple and blocked version.

### linalg/ttensor.hpp

Fixed-size container classes `TVector`

, `TMatrix`

, `TTensor3`

, `TTensor4`

for
1D/2D/3D/4D tensors stored in column-major layout.
Element-wise tensor operations: `A {=,+=,*=} scalar`

; `A {=,+=,*=} B`

. The
latter allows different input and output layouts. With suitable layouts this
can be used to permute (transpose) tensors, extract sub-tensors, etc.
The tensor contraction/product operations:

`Mult_1_2`

: $C_{i,j,k} = \sum_s A_{s,j} B_{i,s,k}$`Mult_2_1`

: $C_{i,j,k} = \sum_s A_{i,s} B_{s,j,k}$`TensorAssemble`

: $C_{i,k,j,l} = \sum_s A_{s,i} A_{s,j} B_{k,s,l}~~$ and $~~D_{i,k,j,l} = \sum_s A_{i,s} B_{s,j} C_{k,s,l}$`TensorProduct`

: $C_{i,j,k,l} = A_{i,j,k} B_{j,l}$

### mesh/tmesh.hpp

The Mesh object templated by the finite element space and layout of the
nodes. Provides `MatchesGeometry()`

and `MatchesNodes()`

functions to verify
if compiled and runtime mesh agree.

### fem/tintrules.hpp

Quadrature rules templated by geometry (triangles, quads, etc.) and integration order.

### fem/tfe.hpp

H1 and L2 finite elements templated by geometry and polynomial order.

### fem/tfespace.hpp

Template `*_FiniteElementSpace`

classes providing the mappings between global
and local (element) degrees of freedom for H1 continuous and L2 discontinuous
spaces. In the FE operator decomposition, these classes provide the element
local action of $G$ (`Extract`

methods) and $G^T$ (`Assemble`

methods).

### fem/tcoefficient.hpp

Templated versions of classes derived from the abstract class `Coefficient`

.
It encapsulates physical quantities like material properties, sources,
boundary/initial conditions, etc. Its main functionality is to evaluate the
coefficient at all quadrature points in an element, which is then used in the
evaluation of the $D$ matrix.

### fem/teltrans.hpp

Element transformation class, templated on a mesh type and an integration
rule. It is constructed from a mesh (e.g. class `TMesh`

) and shape evaluator
(e.g. class `ShapeEvaluator`

) objects. Allows computation of physical
coordinates and Jacobian matrices corresponding to the reference integration
points. The desired result (a combination of coordinates and/or Jacobian
matrices at quadrature points, element attribute and/or element index) is
specified through the template sub-class `Result`

and stored in an object of
the same type. The idea of this approach is to eliminate unnecessary
evaluations if they are not needed. The need is determined based on what the
particular "users" need. The "users" are the templated `Coefficient`

and
`Kernel`

(see `fem/tbilininteg.hpp`

below) classes which specify what they
need through static constant boolean variables, e.g. `uses_coordinates`

,
`uses_Jacobians`

, etc.

### fem/tevaluator.hpp

Classes for evaluating FE basis, `ShapeEvaluator`

, and finite element
functions, `FieldEvaluator`

, and their derivatives at quadrature points,
templated by a finite element class and an integration rule class. These
correspond to the $B$ and $BG$ matrices above. Quads and hexes use the
tensor-product structure for fast evaluation.

### fem/tbilininteg.hpp

`Kernel`

classes (e.g. mass, diffusion) that represent the matrix $D$ from
the above FE operator decomposition. These classes also specify the *type* of
the local operator that needs to be applied before and after the $D$ matrix -
these are the $B_{in}$ and $B^T_{out}$ matrices, respectively. The product
$B^T_{out} D B_{in}$ is the local element matrix, which is the result when
using the `BilinearFormIntegrator`

classes. This specifications of the types
are given by static constant boolean variables, e.g. `in_values`

and
`out_values`

. The `Kernel`

classes provide the following methods:

`Action`

: evaluate the action of $D$*without*explicitly storing the partially assembled data; this is needed for matrix-free action.`Assemble`

: evaluate the partially assembled data, $D$, which is kernel-specific: e.g., for mass, the data is one scalar per quadrature point; for diffusion, the data is one $d\times d$ matrix (in $d$-dimensions) per quadrature point.`MultAssembled`

: perform the action of $D$ using the pre-computed partially assembled data.

### fem/tbilinearform.hpp

Bilinear form operator, templated on the mesh, finite element space, integration rule and bilinear form integrator. Corresponds to the $A$ matrix above. Provides various assembly and evaluation schemes:

`MultUnassembled`

: matrix-free action using the mesh nodes and the input vector.`Assemble`

,`MultAssembled`

: partial assembly and operator action using the partially assembled data at quadrature points and the input vector.`AssembleMatrix(DenseTensor &)`

: assemble the local element matrices and store them as`DenseTensor`

.`AssembleMatrix(SparseMatrix &)`

: assemble the operator in a global (CSR)`SparseMatrix`

.`AssembleBilinearForm(BilinearForm &)`

: assemble element matrices and add them to the bilinear form.

### miniapps/performance/makefile

By default `make`

builds the example drivers with the compiler used to
compile MFEM. If `g++`

was used, a pseudo-code dump file with the optimized
code will be generated (option `-fdump-tree-optimized-blocks`

). The `g++`

option `--param max-completely-peel-times=3`

prevents the compiler from
unrolling innermost loops (of size greater than 3), allowing the compiler to
vectorize them. Some options for optimization/vectorization with the clang
compiler are also included.

### miniapps/performance/ex*.cpp

High-performance templated versions of the corresponding `examples/ex*.cpp`

example codes.