Download z-Pares from

http://zpares.cs.tsukuba.ac.jp/download/zpares_0.9.5.tar.gz

then unarchive the tar.gz file as

`% tar zxf zpares_0.9.5.tar.gz`

In what follows we refer the extracted directry to as `$ZPARES_HOME`.
Then

`% cd $ZPARES_HOME`

Copy a sample of `make.inc`
in directry `$ZPARES_HOME/Makefile.inc/` to `$ZPARES_HOME/make.inc`.
For example:

`% cp Makefile.inc/make.inc.par make.inc`

Then modify `make.inc` according to the user’s preferense.
If the user needs to make MPI version, specify

```
USE_MPI = 1
```

in `make.inc`.
Otherwize (sequencial version),

```
USE_MPI = 0
```

If the user needs sparse CSR interface,

```
USE_MUMPS = 1
```

should be added.
When `USE_MUMPS = 1`,
before the user types command `make`, the user should install MUMPS available at

and specify the MUMPS installed directry path to `MUMPS_DIR`.

Now

`% make`

to build z-Pares.

Now you have `libzpares.a` in `$ZPARES_HOME/lib/`
, and `zpares.mod` in `$ZPARES_HOME/include/`.
If you set `USE_MUMPS = 1` in `make.inc`,
you also have `libzpares_mumps.a` in `$ZPAERS_HOME/lib/`

*For advanced users*

If the user installed MUMPS with external ordering packages (e.g. METIS, SCOTCH),
specify the paths to the package libraries to variable `MUMPS_DEPEND_LIBS` in `make.inc`
in order to compile the examples.

Here, we describe an example to briefly show how a user code using z-Pares looks like. The problem in this example is a generalized eigenvalue problem with

\[\begin{split}A =
\begin{pmatrix}
1 + \mathrm{i} & 1 & & & 0 \\
& 2 + \mathrm{i}& 1 & & \\
& & \ddots & \ddots & \\
& & & \ddots & 1 \\
0 & & & & 100 + \mathrm{i}
\end{pmatrix}
\in \mathbb{C}^{100 \times 100}\end{split}\]

and

\[\begin{split}B =
\begin{pmatrix}
I_{50} & & & 0 \\
& 0 & & \\
& & \ddots & \\
0 & & & 0
\end{pmatrix}
\in \mathbb{R}^{100 \times 100},\end{split}\]

where \(\mathrm{i}\) is the imaginary unit and \(I_{50}\) is the identity matrix of order 50. The finite eigenvalues are \(\{ 1+\mathrm{i},2+\mathrm{i},\dots,50+\mathrm{i} \}\). This example uses the z-Pares with the following conditions:

- A generalized eigenproblem with complex non-Hermitian matrices is solved
- Dense interface is used
- The higher-level parallelism (described in Section 2.4) is only used

We set a circle contour path so that it encloses eigenvalues \(\{ 1+\mathrm{i},2+\mathrm{i},\dots, 12 + \mathrm{i} \}\).

```
program main
use zpares
implicit none
include 'mpif.h'
integer, parameter :: mat_size = 100
integer :: num_ev, info, i, j, L, N, M, Lmax, ncv, ierr, myrank
double precision :: right
double precision, allocatable :: res(:)
complex(kind(0d0)) :: left
complex(kind(0d0)), allocatable :: A(:,:), B(:,:), eigval(:), X(:,:)
type(zpares_prm) :: prm
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
allocate(A(mat_size,mat_size), B(mat_size,mat_size))
! Setting the matrices
A = (0d0, 0d0)
B = (0d0, 0d0)
do i = 1, mat_size
A(i,i) = cmplx(i, 1d0)
if ( i /= 1 ) A(i-1,i) = 1d0
if ( i <= 50 ) B(i,i) = 1d0
end do
! Initialization
call zpares_init(prm)
! Change parameters from default values
prm%high_comm = MPI_COMM_WORLD
prm%L = 16
prm%M = 6
! Get the number of column vectors
ncv = zpares_get_ncv(prm)
! Memory allocation
allocate(eigval(ncv), X(mat_size, ncv), res(ncv))
! Setting the circle
left = cmplx(0d0, 1d0) ! The position of the left edge of the circle
right = 12.5d0 ! The real part of the position of the right edge of the circle
! Call a z-Pares main subroutine
call zpares_zdnsgegv &
(prm, mat_size, A, mat_size, B, mat_size, left, right, num_ev, eigval, X, res, info)
! Finalization
call zpares_finalize(prm)
! Show the result
if ( myrank == 0 ) then
write(*,*) 'This is an example for a non-Hermitian complex generalized'
write(*,*) 'eigenvalue problem using a dense interface.'; write(*,*)
write(*,*) 'Result : '
write(*,'(A6,1X,"|",A41,1X,"|",A10)') 'Index', 'Eigenvalue', 'Residual'
do i = 1, num_ev
write(*,'(I6,2X,F18.15,1X,"+",1X,F18.15,1X,"i",2X,1pe10.1)') &
i, real(eigval(i)), aimag(eigval(i)), res(i)
end do
end if
call MPI_FINALIZE(ierr)
deallocate(A, B, eigval, X, res)
end program main
```

The result will be:

```
This is an example for a non-Hermitian complex generalized
eigenvalue problem using a dense interface.
Result :
Index | Eigenvalue | Residual
1 0.999999999999999 + 1.000000000000001 i 2.4E-13
2 1.999999999999999 + 0.999999999999999 i 1.1E-13
3 3.000000000000001 + 1.000000000000001 i 1.3E-13
4 4.000000000000002 + 1.000000000000001 i 1.4E-13
5 5.000000000000013 + 1.000000000000000 i 2.0E-13
6 5.999999999999999 + 0.999999999999998 i 6.0E-14
7 7.000000000000010 + 1.000000000000000 i 2.4E-14
8 7.999999999999995 + 0.999999999999997 i 2.7E-14
9 9.000000000000012 + 1.000000000000000 i 6.9E-14
10 10.000000000000002 + 0.999999999999998 i 4.8E-14
11 11.000000000000018 + 1.000000000000006 i 6.7E-14
12 12.000000000000021 + 0.999999999999998 i 1.3E-13
```

z-Pares implements an eigensolver which uses the contour integration and the complex moment. The eigensolver computes eigenvalues located inside a contour path on the complex plane. The corresponding eigenvectors are also computed. As described in Figure 2.1 numerical quadrature with \(N\) quadrature points is used to approximate the contour integral. The basis of the subspace which is used for extracting eigenpairs are computed by solving linear systems with mulitple right hand sides

\[(z_j B-A) Y_j = BV \quad ( j = 1,2,\dots,N )\]

for \(Y_j\), where \(z_j\) is a quadrature points, and \(V\) and \(Y_j\) are \(n\)-by-\(L\) matrices. \(V\) is called source matrix and its column vector is called source vector.

Since the linear systems can be solved independently, the computations can be embarrassingly parallelized. Addtionaly, each linear system can be solved in parallel.

Above the parallelism of quadrature points, there is independent parallelism if multiple contour paths are given. Here we define three levels of parallelism:

**Top level**: Parallelism of computations on contour paths**Middle level**: Parallelism of computaitons on quadrature points**Bottom level**: Parallelism of computations for solving linear systems

z-Pares utilizes the parallelism at the middle level and the bottom level using a pair of MPI communicators.
The MPI communicator which manages the middle level and the bottom level is refered to as *the higher-level communicator*,
and *the lower-level communicator*, respectively.
Since the top-level parallelism can be implemented completely without communications,
we have not added the implementation of this level to the feature of z-Pares.
The users should manage the top level parallelism theirselves if needed.

The above descriptions are illustrated in Figure 2.2.

For the Rayleigh-Ritz procedure (described in the Section 3.4) and the residual calculations, matrix-vector multiplecations (mat-vec) of \(A\) and \(B\) for multiple vectors need to be done. The higher-level communicator manages the parallelization in performing mat-vec for different vectors. The lower-level communicator manages the parallelization for one mat-vec.

The derived type `zpares_prm` plays a central roll for the use of z-Pares.
`zpares_prm` consists of components that represent several input and output parameters and inner variables.
See Section 3.3 for more details.
In the rest of this users’ guide, an entry of `zpares_prm` is refered to as `prm`.

z-Pares subroutines can be accessed by
using MODULE (feature from Fortran 90).
To use z-Pares, the user needs to insert `use zpares`
at the first line of a program unit as

```
subroutine user_sub
use zpares
implicit none
! user code
end subroutine
```

Note that, the user needs to add `$ZPARES_HOME/include/`
to the include path when compring the user program.

If the user wants to use the sparse CSR MUMPS routines,
the user needs to add `use zpares_mumps` in their code.

Before calling any z-Pares main routines,

**zpares_init**

should be called with only one argement `zpares_prm` as like

```
call zpares_init(prm)
```

`zpares_init` initializes an entity of `zpares_prm`.
This subroutine should be called before any modifications of the components of `zpares_prm`.

After the z-Pares main routine finishes, regardless of whether it succeed or not,

**zpares_finalize**

should be called with only one argement `zpares_prm` as like

```
call zpares_finalize(prm)
```

`zpares_finalize` deallocates the memory space for the internally managed variables.

The arguments of z-Pares routine
`eigval`, `X`, and `res` should be allocated
before the user passes them to the routine.
The user needs to allocate them with the return value of `zpares_get_ncv`.
`zpares_get_ncv` can be called after setting the parameters in `prm`.
Here is an example of the real valued problem in double precision.

```
subroutine user_sub
use zpares
implicit none
type(zpares_prm) :: prm
integer :: mat_size ! matrix size
integer :: ncv
double precision, allocatable :: eigval(:), X(:,:), res(:)
! Declare here
call zpares_init(prm)
! The user can set parameters in prm here
ncv = zpares_get_ncv(prm)
! The user should not change parameters in prm below
allocate(eigval(ncv))
allocate(X(mat_size, ncv))
allocate(res(ncv))
! Now the user can call a z-Pares subroutine
```

`zpares_get_ncv` returns the the value of `prm%Lmax * prm%M` in the current release.
In the above code, `matrix_size` is the matrix size \(n\).
This size specifier for `X` is not the matrix size when the lower-level MPI parallelism is used.
See Section 2.12.3 for more details.

The default shape of the contour is an ellipse.
To indicate the position and the shape of the ellipse on the complex plane,
the left and right edge are set in the input arguments `left` and `right`.
The aspect ratio of ellipse can be set with `prm%asp_ratio`.
The default value of `prm%asp_ratio` is 1.0 (precise circle).
2.3 illustrates the relations between the ellipse and the parameters.

The most frequent eigenvalue problems appear with
real symmetric (or complex Hermitian) \(A\), and
real symmetric (or complex Hermitian) and positive definite \(B\).
In these cases the eigenvalues lies on the real axis.
We provide special subroutines for these sub-class of eigenvalue problem.
For these subroutines, `left` and `right` are replaced by real typed
input argument `emin` and `emax` for simplicity.
The ellipse encloses targeted eigenvalues is simply regarded as
the search interval \([\) `emin` \(,\) `emax` \(]\).
In this case, a vertically compressed ellipse provides better accuracy e.g. `prm%asp_ratio == 0.1`.

In z-Pares routines, relative residual norms are returned in `res`.
The relative residual norms are defined as

\[\mathrm{res(i)} = \frac{\| A \boldsymbol{x}_i - \lambda_i B \boldsymbol{x}_i \|_2 }{\|A \boldsymbol{x}_i\|_2 + | \lambda_i | \| B \boldsymbol{x}_i \|_2},\]

where \(\lambda\) and \(\boldsymbol{x}_i\) are `eigval(i)` and `X(:,i)`, respectively.

If the user does not need the residual norms, specify `prm%calc_res = .false.`.
If `prm%calc_res = .false.`, `res` will not be touched by the z-Pares routine
and the matrix-vector multiplecations for computing the residuals will be avoided.

z-Pares basically delegates tasks of

- Solving linear systems with multiple right hand sides \((z_j B - A) Y_j = BV\)
- Performing matrix-vector multiplications of \(A\) and \(B\)

to the user, since efficient algorithm and matrix data structure are seriously problem dependent.

z-Pares delegates the tasks by using reverse communicaiton interface (RCI) rather than modern procedure pointer or external subroutine.

In the use of RCI, the user code communicate with the z-Pares subroutine in the following manner:

- Reverse communication flag
`prm%itask`is initilized with`zpares_init`before entering the loop of 2. - The z-Pares subroutine is repeatedly called until
`prm%itask == ZPARES_TASK_FINISH` - In the loop of 2., the tasks indicated by
`prm%itask`are completed with the user’s implementation

By using RCI, the user does not have to define global, COMMON, or module variables to share the information (such as matrix data) with the subroutine given to the package, in contrust to manners using the procedure pointer or external subroutine. RCI is also used in eigensolver packages such as ARPACK and FEAST.

To briefly describe how a user code using RCI looks like, a skeleton code for solving complex non-Hermitian problem are given below.

```
do while ( prm%itask /= ZPARES_TASK_FINISH )
call zpares_zrcigegv &
(prm, nrow_local, z, mwork, cwork, left, right, num_ev, eigval, X, res, info)
select case (prm%itask)
case(ZPARES_TASK_FACTO)
! Here user factorizes (z*B - A)
! At the next return from zpares_zrcigegv,
! prm%itask==ZPARES_TASK_SOLVE with the same z is returned
case(ZPARES_TASK_SOLVE)
! i = prm%ws; j = prm%ws+prm%nc-1
! Here user solves (z*B - A) X = cwork(:,i:j)
! The solution X should be stored in cwork(:,i:j)
case(ZPARES_TASK_MULT_A)
! iw = prm%ws; jw = prm%ws+prm%nc-1
! ix = prm%xs; jx = prm%xs+prm%nc-1
! Here the user performs matrix-vector multiplications:
! mwork(:,iw:jw) = A*X(:,ix:jx)
case(ZPARES_TASK_MULT_B)
! iw = prm%ws; jw = prm%ws+prm%nc-1
! ix = prm%xs; jx = prm%xs+prm%nc-1
! Here the user performs matrix-vector multiplications:
! mwork(:,iw:jw) = B*X(:,ix:jx)
end select
end do
```

`ZPARES_TASK_FINISH`, `ZPARES_TASK_FACTO`, `ZPARES_TASK_SOLVE`, `ZPARES_TASK_MULT_A` and `ZPARES_TASK_MULT_B`
are defined as module variables typed `integer,parameter` of the `zpares` module.
Tasks delegated to the user are indicated with these values.

Implementing a linear solver is heavy task for users. To allow users to easily be get-started with z-Pares, we provide two interfaces for specific matrix data structure:

- Dense interface using LAPACK
- Sparse CSR interface using MUMPS

This section describes the manner of reverse communication on the two-level MPI parallelism by step-by-step instruction using a trivial toy problem.

Let \(A\) and \(B\) be,

\[\begin{split}A =
\begin{pmatrix}
2 & & & & 0 \\
& 4 & & & \\
& & 6 & & \\
& & & \ddots & \\
0 & & & & 1400
\end{pmatrix}
\in \mathbb{R}^{700 \times 700}\end{split}\]

and

\[\begin{split}B =
\begin{pmatrix}
2 & & & 0 \\
& 2 & & \\
& & \ddots & \\
0 & & & 2
\end{pmatrix}
\in \mathbb{R}^{700 \times 700}.\end{split}\]

Obviously, the eigenvalues of the generalized eigenvalue problem \(A \boldsymbol{x} = \lambda B \boldsymbol{x}\) are \(\{1,2,\dots,700\}\) and the eigenvectors are multiples of the unit vectors.

Now we get started with a sequential program.
Assume that diagonal elements of \(A\) and \(B\)
are stored in 1D array `A` and `B`.
An integer variable `mat_size` is set to
the matrix size \(n\) of \(A\) and \(B\).

For a sequential code,
`X` and working space should be allocated as

```
allocate(X(mat_size,ncv), mwork(mat_size,prm%Lmax), rwork(mat_size,prm%Lmax))
```

Reverse communication loop can be written as follows.

```
do while ( prm%itask /= ZPARES_TASK_FINISH )
call zpares_zrcigegv &
(prm, mat_size, z, mwork, cwork, left, right, num_ev, eigval, X, res, info)
select case (prm%itask)
case(ZPARES_TASK_FACTO)
! Do nothing
case(ZPARES_TASK_SOLVE)
! i = prm%ws; j = prm%ws+prm%nc-1
! Here the user solves (z*B - A) X = cwork(:,i:j)
! The solution X should be stored in cwork(:,i:j)
do k = 1, mat_size
do jj = prm%ws, prm%ws+prm%nc-1
cwork(k,jj) = cwork(k,jj) / (z*B(k) - A(k))
end do
end do
case(ZPARES_TASK_MULT_A)
! iw = prm%ws; jw = prm%ws+prm%nc-1
! ix = prm%xs; jx = prm%xs+prm%nc-1
! Here the user performs matrix-vector multiplications:
! mwork(:,iw:jw) = A*X(:,ix:jx)
do k = 1, mat_size
do jj = 1, prm%nc
jjw = prm%ws + jj - 1
jjx = prm%xs + jj - 1
mwork(k,jjw) = A(k)*X(k,jjx)
end do
end do
case(ZPARES_TASK_MULT_B)
! iw = prm%ws; jw = prm%ws+prm%nc-1
! ix = prm%xs; jx = prm%xs+prm%nc-1
! Here the user performs matrix-vector multiplications:
! mwork(:,iw:jw) = B*X(:,ix:jx)
do k = 1, mat_size
do jj = 1, prm%nc
jjw = prm%ws + jj - 1
jjx = prm%xs + jj - 1
mwork(k,jjw) = A(k)*X(k,jjx)
end do
end do
end select
end do
```

For a parallel code with the higher-level parallelism,
the reverse communication loop is the same as the sequential one.
The user only has to set `prm%high_comm`
to the global communicator (in most cases `MPI_COMM_WORLD`),
and make sure that `prm%low_comm == MPI_COMM_SELF` (defalut value).

Apart from the matrix data,
the memory spaces for `X`, `mwork` and `cwork` are dominant (especially for `X`).
Thus they should be problematic for large scale problems.
To avoid this problem, z-Pares allows user to let the arrays
`X`, `mwork`, `cwork` be row-wise distributed with the lower-level communicator.
Now assume that we have 2 MPI processes.
Let the number of the rows associated to each process be `nrow_local`.
For example, `nrow_local == 400` for rank 0 and `nrow_local == 300` for rank 1.
The sum of `nrow_local`s along all processors should be the matrix size,
but the matrix size do not have to be divisible by `nrow_local`.
Memory spaces for `X` and the working arrays should be allocated as

```
allocate(X(nrow_local,ncv), mwork(nrow_local,prm%Lmax), rwork(nrow_local,prm%Lmax))
```

Addtionaly, assume that the arrays `X`, `mwork` and `cwork` are row-wise distributed
with the block distribution.
The rank 0 manages the first 400 rows of the arrays
and the rank 1 manages the remaining 300 rows of the arrays.
For ease of coding, we also assumed that
the rank 0 has the first 400 diagonal elements of \(A\) and \(B\)
and the rank 1 has the remaining diagonals.
In particular, \(A_{i,i} (i=1,2,\dots,400)\) stored in `A(1:400)` of rank-0 and
\(A_{i,i} (i=401,402,\dots,700)\) stored in `A(1:300)` of rank-1.
Likewise for \(B\).

According to the above settings, the reverse communication loop can be written as follows.

```
if ( myrank == 0 ) then
nrow_local = 400
else if ( myrank == 1 ) then
nrow_local = 300
end if
! Here the user set the values of the distributed A and B
! and memory allocations for X, mwork and cwork
do while ( prm%itask /= ZPARES_TASK_FINISH )
call zpares_zrcigegv &
(prm, nrow_local, z, mwork, cwork, left, right, num_ev, eigval, X, res, info)
select case (prm%itask)
case(ZPARES_TASK_FACTO)
! Do nothing
case(ZPARES_TASK_SOLVE)
! i = prm%ws; j = prm%ws+prm%nc-1
! Here the user solves (z*B - A) X = cwork(:,i:j)
! The solution X should be stored in cwork(:,i:j)
do k = 1, nrow_local
do jj = prm%ws, prm%ws+prm%nc-1
cwork(k,jj) = cwork(k,jj) / (z*B(k) - A(k))
end do
end do
case(ZPARES_TASK_MULT_A)
! iw = prm%ws; jw = prm%ws+prm%nc-1
! ix = prm%xs; jx = prm%xs+prm%nc-1
! Here the user performs matrix-vector multiplications:
! mwork(:,iw:jw) = A*X(:,ix:jx)
do k = 1, nrow_local
do jj = 1, prm%nc
jjw = prm%ws + jj - 1
jjx = prm%xs + jj - 1
mwork(k,jjw) = A(k)*X(k,jjx)
end do
end do
case(ZPARES_TASK_MULT_B)
! iw = prm%ws; jw = prm%ws+prm%nc-1
! ix = prm%xs; jx = prm%xs+prm%nc-1
! Here the user performs matrix-vector multiplications:
! mwork(:,iw:jw) = B*X(:,ix:jx)
do k = 1, nrow_local
do jj = 1, prm%nc
jjw = prm%ws + jj - 1
jjx = prm%xs + jj - 1
mwork(k,jjw) = B(k)*X(k,jjx)
end do
end do
end select
end do
```

In this case, we can see that `nrow_local` is passed to the second argument of the z-Pares routine
instead of `mat_size`.
Note that, the row distributions of `X`, `mwork` and `cwork`
do not have to be the block distribution.
It can be the cyclic, the block cyclic or a much more complicated distribution,
provided that the manners of distributions of `X`, `mwork` and `cwork` agree.
To use the lower-level parallelism without the higher level parallelism,
the user has to set `prm%low_comm` to the global communicator
(in most cases `MPI_COMM_WORLD`),
and make sure that `prm%high_comm == MPI_COMM_SELF` (defalut value).

Of cource, for general matrices, computations for solving linear system and matrix-vector multiplecation require communications. The user has to manage the communications and the matrix data distribution with the lower-level communicator.

Now we consider to use both the high-level and lower-level parallelism. For a code with the two-level parallelism, the reverse communication loop is the same as that of the lower-level one.

The user needs to create two MPI-communicators
that manage each level of parallelism
by splitting the global communicator (in most cases `MPI_COMM_WORLD`).
An example on the splitting is shown as follows.

```
n_procs_low = 8
call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, n_procs, ierr)
high_color = mod(myrank, n_procs_low)
high_key = myrank
low_color = myrank / n_procs_low
low_key = myrank
call MPI_COMM_SPLIT(MPI_COMM_WORLD, high_color, high_key, prm%high_comm, ierr)
call MPI_COMM_SPLIT(MPI_COMM_WORLD, low_color, low_key, prm%low_comm, ierr)
```

Note that this code only works if `n_procs` is divisible by `n_procs_low`.
Here, let `n_procs_high` and `n_procs_low` be

```
call MPI_COMM_SIZE(prm%high_comm, n_procs_high, ierr)
call MPI_COMM_SIZE(prm%low_comm, n_procs_low, ierr)
```

When the two-level parallelism is used, following conditions must be satisfied:

`n_procs_high*n_procs_low == n_procs`at the all MPI processes- the manners of row-distributions of
`X`,`mwork`and`cwork`agree in the lower-level communicator - the manners in 2. of the all processes of the higher-level communicator should also agree.

In the above descriptions, we have used the subroutines for complex non-Hermitian problems. In z-Pares, the effcienct implematations are given for exploiting specific features of the problem. Following features are regarded:

- Symmetry or hermiticity of matrix \(A\) and \(B\)
- Positive definiteness of \(B\)
- \(B = I\) (Standard eigenvalue problem)

We recommend the user to let z-Pares regard these features by setting appropriate parameters to obtain maximum efficiency.