Table Of Contents

Previous topic

1. Introduction

Next topic

3. General use of z-Pares

2. Getting started

2.1. Installation

Download z-Pares from

http://zpares.cs.tsukuba.ac.jp/download/zpares_0.9.5.tar.gz

then unarchive the tar.gz file as

% tar zxf zpares_0.9.5.tar.gz

In what follows we refer the extracted directry to as $ZPARES_HOME. Then

% cd $ZPARES_HOME

Copy a sample of make.inc in directry $ZPARES_HOME/Makefile.inc/ to $ZPARES_HOME/make.inc. For example:

% cp Makefile.inc/make.inc.par make.inc

Then modify make.inc according to the user’s preferense. If the user needs to make MPI version, specify

USE_MPI = 1

in make.inc. Otherwize (sequencial version),

USE_MPI = 0

If the user needs sparse CSR interface,

USE_MUMPS = 1

should be added. When USE_MUMPS = 1, before the user types command make, the user should install MUMPS available at

http://mumps.enseeiht.fr/

and specify the MUMPS installed directry path to MUMPS_DIR.

Now

% make

to build z-Pares.

Now you have libzpares.a in $ZPARES_HOME/lib/ , and zpares.mod in $ZPARES_HOME/include/. If you set USE_MUMPS = 1 in make.inc, you also have libzpares_mumps.a in $ZPAERS_HOME/lib/

For advanced users

If the user installed MUMPS with external ordering packages (e.g. METIS, SCOTCH), specify the paths to the package libraries to variable MUMPS_DEPEND_LIBS in make.inc in order to compile the examples.

2.2. Example

Here, we describe an example to briefly show how a user code using z-Pares looks like. The problem in this example is a generalized eigenvalue problem with

\[\begin{split}A = \begin{pmatrix} 1 + \mathrm{i} & 1 & & & 0 \\ & 2 + \mathrm{i}& 1 & & \\ & & \ddots & \ddots & \\ & & & \ddots & 1 \\ 0 & & & & 100 + \mathrm{i} \end{pmatrix} \in \mathbb{C}^{100 \times 100}\end{split}\]

and

\[\begin{split}B = \begin{pmatrix} I_{50} & & & 0 \\ & 0 & & \\ & & \ddots & \\ 0 & & & 0 \end{pmatrix} \in \mathbb{R}^{100 \times 100},\end{split}\]

where \(\mathrm{i}\) is the imaginary unit and \(I_{50}\) is the identity matrix of order 50. The finite eigenvalues are \(\{ 1+\mathrm{i},2+\mathrm{i},\dots,50+\mathrm{i} \}\). This example uses the z-Pares with the following conditions:

  • A generalized eigenproblem with complex non-Hermitian matrices is solved
  • Dense interface is used
  • The higher-level parallelism (described in Section 2.4) is only used

We set a circle contour path so that it encloses eigenvalues \(\{ 1+\mathrm{i},2+\mathrm{i},\dots, 12 + \mathrm{i} \}\).

program main
  use zpares
  implicit none
  include 'mpif.h'
  integer, parameter :: mat_size = 100
  integer :: num_ev, info, i, j, L, N, M, Lmax, ncv, ierr, myrank
  double precision :: right
  double precision, allocatable :: res(:)
  complex(kind(0d0)) :: left
  complex(kind(0d0)), allocatable :: A(:,:), B(:,:), eigval(:), X(:,:)
  type(zpares_prm) :: prm
  
  call MPI_INIT(ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)

  allocate(A(mat_size,mat_size), B(mat_size,mat_size)) 

  ! Setting the matrices
  A = (0d0, 0d0)
  B = (0d0, 0d0)
  do i = 1, mat_size
     A(i,i) = cmplx(i, 1d0)
     if ( i /= 1 )  A(i-1,i) = 1d0
     if ( i <= 50 )  B(i,i) = 1d0
  end do
  
  ! Initialization
  call zpares_init(prm)

  ! Change parameters from default values
  prm%high_comm = MPI_COMM_WORLD
  prm%L = 16
  prm%M = 6
 
  ! Get the number of column vectors
  ncv = zpares_get_ncv(prm)
  ! Memory allocation
  allocate(eigval(ncv), X(mat_size, ncv), res(ncv))

  ! Setting the circle
  left = cmplx(0d0, 1d0) ! The position of the left edge of the circle
  right = 12.5d0 ! The real part of the position of the right edge of the circle

  ! Call a z-Pares main subroutine
  call zpares_zdnsgegv &
  (prm, mat_size, A, mat_size, B, mat_size, left, right, num_ev, eigval, X, res, info)

  ! Finalization
  call zpares_finalize(prm)

  ! Show the result
  if ( myrank == 0 ) then
     write(*,*) 'This is an example for a non-Hermitian complex generalized'
     write(*,*) 'eigenvalue problem using a dense interface.'; write(*,*)
     write(*,*) 'Result : '
     write(*,'(A6,1X,"|",A41,1X,"|",A10)') 'Index', 'Eigenvalue', 'Residual'
     do i = 1, num_ev
        write(*,'(I6,2X,F18.15,1X,"+",1X,F18.15,1X,"i",2X,1pe10.1)') &
             i, real(eigval(i)), aimag(eigval(i)), res(i)
     end do
  end if

  call MPI_FINALIZE(ierr)
  deallocate(A, B, eigval, X, res)
end program main

The result will be:

 This is an example for a non-Hermitian complex generalized
 eigenvalue problem using a dense interface.

 Result :
 Index |                               Eigenvalue |  Residual
     1   0.999999999999999 +  1.000000000000001 i     2.4E-13
     2   1.999999999999999 +  0.999999999999999 i     1.1E-13
     3   3.000000000000001 +  1.000000000000001 i     1.3E-13
     4   4.000000000000002 +  1.000000000000001 i     1.4E-13
     5   5.000000000000013 +  1.000000000000000 i     2.0E-13
     6   5.999999999999999 +  0.999999999999998 i     6.0E-14
     7   7.000000000000010 +  1.000000000000000 i     2.4E-14
     8   7.999999999999995 +  0.999999999999997 i     2.7E-14
     9   9.000000000000012 +  1.000000000000000 i     6.9E-14
    10  10.000000000000002 +  0.999999999999998 i     4.8E-14
    11  11.000000000000018 +  1.000000000000006 i     6.7E-14
    12  12.000000000000021 +  0.999999999999998 i     1.3E-13

2.3. Basic concepts in z-Pares

Figure 2.1: z-Pares computes eigenvalues located inside a contour path on the complex plane. (blue cross)

z-Pares implements an eigensolver which uses the contour integration and the complex moment. The eigensolver computes eigenvalues located inside a contour path on the complex plane. The corresponding eigenvectors are also computed. As described in Figure 2.1 numerical quadrature with \(N\) quadrature points is used to approximate the contour integral. The basis of the subspace which is used for extracting eigenpairs are computed by solving linear systems with mulitple right hand sides

\[(z_j B-A) Y_j = BV \quad ( j = 1,2,\dots,N )\]

for \(Y_j\), where \(z_j\) is a quadrature points, and \(V\) and \(Y_j\) are \(n\)-by-\(L\) matrices. \(V\) is called source matrix and its column vector is called source vector.

Since the linear systems can be solved independently, the computations can be embarrassingly parallelized. Addtionaly, each linear system can be solved in parallel.

2.4. Two-level MPI parallelism

Above the parallelism of quadrature points, there is independent parallelism if multiple contour paths are given. Here we define three levels of parallelism:

  • Top level : Parallelism of computations on contour paths
  • Middle level : Parallelism of computaitons on quadrature points
  • Bottom level : Parallelism of computations for solving linear systems

z-Pares utilizes the parallelism at the middle level and the bottom level using a pair of MPI communicators. The MPI communicator which manages the middle level and the bottom level is refered to as the higher-level communicator, and the lower-level communicator, respectively. Since the top-level parallelism can be implemented completely without communications, we have not added the implementation of this level to the feature of z-Pares. The users should manage the top level parallelism theirselves if needed.

The above descriptions are illustrated in Figure 2.2.

Figure 2.2: Three levels of parallelism and two-level MPI communicator

For the Rayleigh-Ritz procedure (described in the Section 3.4) and the residual calculations, matrix-vector multiplecations (mat-vec) of \(A\) and \(B\) for multiple vectors need to be done. The higher-level communicator manages the parallelization in performing mat-vec for different vectors. The lower-level communicator manages the parallelization for one mat-vec.

2.5. zpares_prm derived type

The derived type zpares_prm plays a central roll for the use of z-Pares. zpares_prm consists of components that represent several input and output parameters and inner variables. See Section 3.3 for more details. In the rest of this users’ guide, an entry of zpares_prm is refered to as prm.

2.6. zpares module

z-Pares subroutines can be accessed by using MODULE (feature from Fortran 90). To use z-Pares, the user needs to insert use zpares at the first line of a program unit as

subroutine user_sub
  use zpares
  implicit none
  ! user code
end subroutine

Note that, the user needs to add $ZPARES_HOME/include/ to the include path when compring the user program.

If the user wants to use the sparse CSR MUMPS routines, the user needs to add use zpares_mumps in their code.

2.7. Initialization and finalization

Before calling any z-Pares main routines,

zpares_init

should be called with only one argement zpares_prm as like

call zpares_init(prm)

zpares_init initializes an entity of zpares_prm. This subroutine should be called before any modifications of the components of zpares_prm.

After the z-Pares main routine finishes, regardless of whether it succeed or not,

zpares_finalize

should be called with only one argement zpares_prm as like

call zpares_finalize(prm)

zpares_finalize deallocates the memory space for the internally managed variables.

2.8. Memory allocation

The arguments of z-Pares routine eigval, X, and res should be allocated before the user passes them to the routine. The user needs to allocate them with the return value of zpares_get_ncv. zpares_get_ncv can be called after setting the parameters in prm. Here is an example of the real valued problem in double precision.

subroutine user_sub
  use zpares
  implicit none

  type(zpares_prm) :: prm
  integer :: mat_size ! matrix size
  integer :: ncv
  double precision, allocatable :: eigval(:), X(:,:), res(:)
  ! Declare here

  call zpares_init(prm)
  ! The user can set parameters in prm here

  ncv = zpares_get_ncv(prm)
  ! The user should not change parameters in prm below
  allocate(eigval(ncv))
  allocate(X(mat_size, ncv))
  allocate(res(ncv))
  ! Now the user can call a z-Pares subroutine

zpares_get_ncv returns the the value of prm%Lmax * prm%M in the current release. In the above code, matrix_size is the matrix size \(n\). This size specifier for X is not the matrix size when the lower-level MPI parallelism is used. See Section 2.12.3 for more details.

2.9. Setting a circle

The default shape of the contour is an ellipse. To indicate the position and the shape of the ellipse on the complex plane, the left and right edge are set in the input arguments left and right. The aspect ratio of ellipse can be set with prm%asp_ratio. The default value of prm%asp_ratio is 1.0 (precise circle). 2.3 illustrates the relations between the ellipse and the parameters.

Figure 2.3: The ellipse and the parameters

The most frequent eigenvalue problems appear with real symmetric (or complex Hermitian) \(A\), and real symmetric (or complex Hermitian) and positive definite \(B\). In these cases the eigenvalues lies on the real axis. We provide special subroutines for these sub-class of eigenvalue problem. For these subroutines, left and right are replaced by real typed input argument emin and emax for simplicity. The ellipse encloses targeted eigenvalues is simply regarded as the search interval \([\) emin \(,\) emax \(]\). In this case, a vertically compressed ellipse provides better accuracy e.g. prm%asp_ratio == 0.1.

2.10. Relative residual norm

In z-Pares routines, relative residual norms are returned in res. The relative residual norms are defined as

\[\mathrm{res(i)} = \frac{\| A \boldsymbol{x}_i - \lambda_i B \boldsymbol{x}_i \|_2 }{\|A \boldsymbol{x}_i\|_2 + | \lambda_i | \| B \boldsymbol{x}_i \|_2},\]

where \(\lambda\) and \(\boldsymbol{x}_i\) are eigval(i) and X(:,i), respectively.

If the user does not need the residual norms, specify prm%calc_res = .false.. If prm%calc_res = .false., res will not be touched by the z-Pares routine and the matrix-vector multiplecations for computing the residuals will be avoided.

2.11. Reverse communication interface

z-Pares basically delegates tasks of

  • Solving linear systems with multiple right hand sides \((z_j B - A) Y_j = BV\)
  • Performing matrix-vector multiplications of \(A\) and \(B\)

to the user, since efficient algorithm and matrix data structure are seriously problem dependent.

z-Pares delegates the tasks by using reverse communicaiton interface (RCI) rather than modern procedure pointer or external subroutine.

In the use of RCI, the user code communicate with the z-Pares subroutine in the following manner:

  1. Reverse communication flag prm%itask is initilized with zpares_init before entering the loop of 2.
  2. The z-Pares subroutine is repeatedly called until prm%itask == ZPARES_TASK_FINISH
  3. In the loop of 2., the tasks indicated by prm%itask are completed with the user’s implementation

By using RCI, the user does not have to define global, COMMON, or module variables to share the information (such as matrix data) with the subroutine given to the package, in contrust to manners using the procedure pointer or external subroutine. RCI is also used in eigensolver packages such as ARPACK and FEAST.

To briefly describe how a user code using RCI looks like, a skeleton code for solving complex non-Hermitian problem are given below.

  do while ( prm%itask /= ZPARES_TASK_FINISH )
     call zpares_zrcigegv &
          (prm, nrow_local, z, mwork, cwork, left, right, num_ev, eigval, X, res, info)

     select case (prm%itask)
     case(ZPARES_TASK_FACTO)

        ! Here user factorizes (z*B - A)
        ! At the next return from zpares_zrcigegv,
        ! prm%itask==ZPARES_TASK_SOLVE with the same z is returned
        
     case(ZPARES_TASK_SOLVE)

        ! i = prm%ws; j = prm%ws+prm%nc-1
        ! Here user solves (z*B - A) X = cwork(:,i:j) 
        ! The solution X should be stored in cwork(:,i:j)

     case(ZPARES_TASK_MULT_A)

        ! iw = prm%ws; jw = prm%ws+prm%nc-1
        ! ix = prm%xs; jx = prm%xs+prm%nc-1
        ! Here the user performs matrix-vector multiplications:
        ! mwork(:,iw:jw) = A*X(:,ix:jx)

     case(ZPARES_TASK_MULT_B)
        
        ! iw = prm%ws; jw = prm%ws+prm%nc-1
        ! ix = prm%xs; jx = prm%xs+prm%nc-1
        ! Here the user performs matrix-vector multiplications:
        ! mwork(:,iw:jw) = B*X(:,ix:jx)

     end select
  end do

ZPARES_TASK_FINISH, ZPARES_TASK_FACTO, ZPARES_TASK_SOLVE, ZPARES_TASK_MULT_A and ZPARES_TASK_MULT_B are defined as module variables typed integer,parameter of the zpares module. Tasks delegated to the user are indicated with these values.

Implementing a linear solver is heavy task for users. To allow users to easily be get-started with z-Pares, we provide two interfaces for specific matrix data structure:

  • Dense interface using LAPACK
  • Sparse CSR interface using MUMPS

2.12. Reverse communicaton on two-level MPI parallelism

This section describes the manner of reverse communication on the two-level MPI parallelism by step-by-step instruction using a trivial toy problem.

Let \(A\) and \(B\) be,

\[\begin{split}A = \begin{pmatrix} 2 & & & & 0 \\ & 4 & & & \\ & & 6 & & \\ & & & \ddots & \\ 0 & & & & 1400 \end{pmatrix} \in \mathbb{R}^{700 \times 700}\end{split}\]

and

\[\begin{split}B = \begin{pmatrix} 2 & & & 0 \\ & 2 & & \\ & & \ddots & \\ 0 & & & 2 \end{pmatrix} \in \mathbb{R}^{700 \times 700}.\end{split}\]

Obviously, the eigenvalues of the generalized eigenvalue problem \(A \boldsymbol{x} = \lambda B \boldsymbol{x}\) are \(\{1,2,\dots,700\}\) and the eigenvectors are multiples of the unit vectors.

2.12.1. Sequential code

Now we get started with a sequential program. Assume that diagonal elements of \(A\) and \(B\) are stored in 1D array A and B. An integer variable mat_size is set to the matrix size \(n\) of \(A\) and \(B\).

For a sequential code, X and working space should be allocated as

allocate(X(mat_size,ncv), mwork(mat_size,prm%Lmax), rwork(mat_size,prm%Lmax))

Reverse communication loop can be written as follows.

  do while ( prm%itask /= ZPARES_TASK_FINISH )
     call zpares_zrcigegv &
          (prm, mat_size, z, mwork, cwork, left, right, num_ev, eigval, X, res, info)

     select case (prm%itask)
     case(ZPARES_TASK_FACTO)

        ! Do nothing
        
     case(ZPARES_TASK_SOLVE)

        ! i = prm%ws; j = prm%ws+prm%nc-1
        ! Here the user solves (z*B - A) X = cwork(:,i:j) 
        ! The solution X should be stored in cwork(:,i:j)
        do k = 1, mat_size           
           do jj = prm%ws, prm%ws+prm%nc-1
              cwork(k,jj) = cwork(k,jj) / (z*B(k) - A(k))
           end do
        end do

     case(ZPARES_TASK_MULT_A)

        ! iw = prm%ws; jw = prm%ws+prm%nc-1
        ! ix = prm%xs; jx = prm%xs+prm%nc-1
        ! Here the user performs matrix-vector multiplications:
        ! mwork(:,iw:jw) = A*X(:,ix:jx)
        do k = 1, mat_size
           do jj = 1, prm%nc
              jjw = prm%ws + jj - 1
              jjx = prm%xs + jj - 1
              mwork(k,jjw) = A(k)*X(k,jjx)
           end do
        end do

     case(ZPARES_TASK_MULT_B)
        
        ! iw = prm%ws; jw = prm%ws+prm%nc-1
        ! ix = prm%xs; jx = prm%xs+prm%nc-1
        ! Here the user performs matrix-vector multiplications: 
        ! mwork(:,iw:jw) = B*X(:,ix:jx)
        do k = 1, mat_size
           do jj = 1, prm%nc
              jjw = prm%ws + jj - 1
              jjx = prm%xs + jj - 1
              mwork(k,jjw) = A(k)*X(k,jjx)
           end do
        end do

     end select
  end do

2.12.2. Parallel code with the higher-level parallelisim

For a parallel code with the higher-level parallelism, the reverse communication loop is the same as the sequential one. The user only has to set prm%high_comm to the global communicator (in most cases MPI_COMM_WORLD), and make sure that prm%low_comm == MPI_COMM_SELF (defalut value).

2.12.3. Parallel code with the lower-level parallelisim

Apart from the matrix data, the memory spaces for X, mwork and cwork are dominant (especially for X). Thus they should be problematic for large scale problems. To avoid this problem, z-Pares allows user to let the arrays X, mwork, cwork be row-wise distributed with the lower-level communicator. Now assume that we have 2 MPI processes. Let the number of the rows associated to each process be nrow_local. For example, nrow_local == 400 for rank 0 and nrow_local == 300 for rank 1. The sum of nrow_locals along all processors should be the matrix size, but the matrix size do not have to be divisible by nrow_local. Memory spaces for X and the working arrays should be allocated as

allocate(X(nrow_local,ncv), mwork(nrow_local,prm%Lmax), rwork(nrow_local,prm%Lmax))

Addtionaly, assume that the arrays X, mwork and cwork are row-wise distributed with the block distribution. The rank 0 manages the first 400 rows of the arrays and the rank 1 manages the remaining 300 rows of the arrays. For ease of coding, we also assumed that the rank 0 has the first 400 diagonal elements of \(A\) and \(B\) and the rank 1 has the remaining diagonals. In particular, \(A_{i,i} (i=1,2,\dots,400)\) stored in A(1:400) of rank-0 and \(A_{i,i} (i=401,402,\dots,700)\) stored in A(1:300) of rank-1. Likewise for \(B\).

According to the above settings, the reverse communication loop can be written as follows.

  if ( myrank == 0 ) then
     nrow_local = 400
  else if ( myrank == 1 ) then
     nrow_local = 300
  end if

  ! Here the user set the values of the distributed A and B
  ! and memory allocations for X, mwork and cwork

  do while ( prm%itask /= ZPARES_TASK_FINISH )
     call zpares_zrcigegv &
        (prm, nrow_local, z, mwork, cwork, left, right, num_ev, eigval, X, res, info)

     select case (prm%itask)
     case(ZPARES_TASK_FACTO)

        ! Do nothing
        
     case(ZPARES_TASK_SOLVE)

        ! i = prm%ws; j = prm%ws+prm%nc-1
        ! Here the user solves (z*B - A) X = cwork(:,i:j) 
        ! The solution X should be stored in cwork(:,i:j)
        do k = 1, nrow_local
           do jj = prm%ws, prm%ws+prm%nc-1
              cwork(k,jj) = cwork(k,jj) / (z*B(k) - A(k))
           end do
        end do

     case(ZPARES_TASK_MULT_A)

        ! iw = prm%ws; jw = prm%ws+prm%nc-1
        ! ix = prm%xs; jx = prm%xs+prm%nc-1
        ! Here the user performs matrix-vector multiplications:
        ! mwork(:,iw:jw) = A*X(:,ix:jx)
        do k = 1, nrow_local
           do jj = 1, prm%nc
              jjw = prm%ws + jj - 1
              jjx = prm%xs + jj - 1
              mwork(k,jjw) = A(k)*X(k,jjx)
           end do
        end do

     case(ZPARES_TASK_MULT_B)
        
        ! iw = prm%ws; jw = prm%ws+prm%nc-1
        ! ix = prm%xs; jx = prm%xs+prm%nc-1
        ! Here the user performs matrix-vector multiplications: 
        ! mwork(:,iw:jw) = B*X(:,ix:jx)
        do k = 1, nrow_local
           do jj = 1, prm%nc
              jjw = prm%ws + jj - 1
              jjx = prm%xs + jj - 1
              mwork(k,jjw) = B(k)*X(k,jjx)
           end do
        end do

     end select
  end do

In this case, we can see that nrow_local is passed to the second argument of the z-Pares routine instead of mat_size. Note that, the row distributions of X, mwork and cwork do not have to be the block distribution. It can be the cyclic, the block cyclic or a much more complicated distribution, provided that the manners of distributions of X, mwork and cwork agree. To use the lower-level parallelism without the higher level parallelism, the user has to set prm%low_comm to the global communicator (in most cases MPI_COMM_WORLD), and make sure that prm%high_comm == MPI_COMM_SELF (defalut value).

Of cource, for general matrices, computations for solving linear system and matrix-vector multiplecation require communications. The user has to manage the communications and the matrix data distribution with the lower-level communicator.

2.12.4. Parallel code with the two-level parallelisim

Now we consider to use both the high-level and lower-level parallelism. For a code with the two-level parallelism, the reverse communication loop is the same as that of the lower-level one.

The user needs to create two MPI-communicators that manage each level of parallelism by splitting the global communicator (in most cases MPI_COMM_WORLD). An example on the splitting is shown as follows.

  n_procs_low = 8
  call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
  call MPI_COMM_SIZE(MPI_COMM_WORLD, n_procs, ierr)
  high_color = mod(myrank, n_procs_low)
  high_key = myrank
  low_color = myrank / n_procs_low
  low_key = myrank
  call MPI_COMM_SPLIT(MPI_COMM_WORLD, high_color, high_key, prm%high_comm, ierr)
  call MPI_COMM_SPLIT(MPI_COMM_WORLD, low_color, low_key, prm%low_comm, ierr)

Note that this code only works if n_procs is divisible by n_procs_low. Here, let n_procs_high and n_procs_low be

call MPI_COMM_SIZE(prm%high_comm, n_procs_high, ierr)
call MPI_COMM_SIZE(prm%low_comm, n_procs_low, ierr)

When the two-level parallelism is used, following conditions must be satisfied:

  1. n_procs_high*n_procs_low == n_procs at the all MPI processes
  2. the manners of row-distributions of X, mwork and cwork agree in the lower-level communicator
  3. the manners in 2. of the all processes of the higher-level communicator should also agree.

2.13. Efficient implementations for specific problems

In the above descriptions, we have used the subroutines for complex non-Hermitian problems. In z-Pares, the effcienct implematations are given for exploiting specific features of the problem. Following features are regarded:

  • Symmetry or hermiticity of matrix \(A\) and \(B\)
  • Positive definiteness of \(B\)
  • \(B = I\) (Standard eigenvalue problem)

We recommend the user to let z-Pares regard these features by setting appropriate parameters to obtain maximum efficiency.