The Parallel RMTRX-I Package User's Guide

D. M. Mitnik and D. C. Griffin

Department of Physics, Rollins College, Winter Park, FL, USA

and N. R. Badnell

Department of Physics and Applied Physics, University of Strathclyde, Glasgow G4 0NG, UK

August 14, 2001

Abstract:

The purpose of this guide is to provide help in the use of the parallel ICFT R-Matrix codes. The operation of the serial ICFT R-matrix codes can be found in http://amdpp.phys.strath.ac.uk and no attempt is made to repeat it here.

Introduction
Use of the RMTRX-I codes on the IBM RS/6000 SP Supercomputer at NERSC

Introduction

There are three codes that have been modified for operation on distributed-memory parallel computers, and they are: PSTG3R, PSTGF (the preliminar working version is called PF), and PSTGICF (the preliminar working version is called PIC). These codes have been tested for use on Cray T3E-900 and IBM-SP supercomputers, and we confine our description to their operation on those machines. However, they should run equally well on a Beowulf (Sun) cluster and their use on such a machine should be more straight forward, since they can then be run in the interactive mode.

The code pstg3r requires installation of ScaLAPACK library, which is public and can be downloaded at


http://www.netlib.org/scalapack/slug/ .

We will explain in this guide how to use the parallel codes (and very briefly, how to use the parallel machines), for the two machines mentioned above. The parallel codes (and some auxiliary files) can be downloaded from



http://vanadium.rollins.edu/codes ,
or from

http://vanadium.rollins.edu/~dario/codes .

Use of the RMTRX-I codes on the IBM RS/6000 SP Supercomputer at NERSC

general
STG1 and STG2
PSTG3
PSTGF
PSTGICF

general

The best place to find information about the use of the parallel computers at NERSC (and to learn more about parallel computers in general, including MPI, parallelization, batch files, etc.) is the web site of NERSC at


http://hpcf.nersc.gov/.

Here we provide the basic primary information that you will need in order to begin working with the parallel ICFT R-matrix codes.

first login to the IBM-SP: ssh -P username@seaborg.nersc.gov
if you are going to run PSTG3R, you should next load the ScaLAPACK library:
module load scalapack
the general form of the commands to compile on the IBM-SP are as follows:
- serial F77 codes:
  xlf -qfixed=72 source.f -o executable.x -O4
- serial F90 codes:
  xlf90 -qfixed=72 source.f -o executable.x -O4
- Parallel codes:
  mpxlf90 -qfixed=72 source.f -o executable.x -O4
- Parallel codes with ScaLAPACK:
  mpxlf90 $BLACS $PBLAS $SCALAPACK -lessl -qfixed=72 pstg3rnew.f -o pstg3rnew.x -O4
to execute a serial code: ./executable
to execute a parallel code: poe ./executable -procs XX (!! XX are the number of processors !!)

The IBM SP machine has 16 processors per node. It is recommended to work with a number of processors multiple of 16. It is possible to work with an arbitrary number of processors, but requires some previous redefinition of the Loadlevel keywords. Remember, you are charged for all 16 CPUs per node, no matter how many of them are you using.

STG1 and STG2

Since STG1 and STG2 run efficiently as serial codes, there are no parallel versions. The codes have comments in several places for those fortran statements that are machine dependent, so it is necessary to search for the words SUN, CRAY and IBM and change comments according to the working machine. The only modifications required in these two codes are for timing. For the IBM-SP, the timing subroutine is called by:


c...... IBM TIMING
c.......at the beginning of the code:
        timei = rtc()
        .....
        .....
c.......at the end of the code:
        timef = rtc()
        time = timef - timei

Beware that the IBM-SP does not recognize &END to signal the end of data in namelist input; instead it uses a /. Here is an example of an input file:


S.S.  11-term, 20-level (no CI) R-matrix ICFT calculation of Ne5+ excitation
 &STG1A RELOP='MVD' /
 &STG1B MAXLA=2 MAXLT=14 MAXE=24 MAXC=25  /

When running these codes at NERSC in the interactive mode (non-batch), it is important to remember that there are strict limits on time, number of processors, and file size. Information of these limits can be obtained at:


http://hpcf.nersc.gov/running_jobs/ibm/#Resource.

On the IBM-SP, the maximum time allowed for an interactive session is 30 minutes, the maximum number of processors is 64 (4 nodes), and the maximum size for open files is 2 Gb. In order to achieve this maximum memory available, add these options to your compile lines (see http://hpcf.nersc.gov/computers/SP/#about):


-bmaxdata:0x80000000 -bmaxstack:0x10000000

It is better to work in a temporary directory (faster input/output, larger number of processors, and larger working files); the way to change to this directory is with


mkdir /$SCRATCH/namedir
cd /$SCRATCH/namedir

WARNING !! The environment variable $SCRATCH refers to a directory /scratch/scratchdirs/yourusername. $SCRATCH, provides 204 GB of memory and 12,000 inodes. The contents of $SCRATCH may be deleted at any time after the job finishes if the system's disks near capacity. In general, files in $SCRATCH will persist for at least 7 days, but users are ``taking chances" by using $SCRATCH to store after the job finishes and should not rely on it to be ``semi-permanent" file storage space. It is a good working practice to put a copy sentence (to your home directory) in the batch script file. Your home directory can (and should) always be referred to by the environment variable $HOME. The absolute path to your home directory (e.g., /u2/dmitnik/) may change at any time without notice, but the value of $HOME will always be correct (the same about $SCRATCH).

STG1R will normally run well in your home directory in the interactive mode. For cases in which STG2 requires more than 30 minutes for run, the job has to be submitted by batch. Here is an example of a batch script that works on your own home directory:

EXAMPLE: stg2r.batch


#!/usr/bin/csh
#@ job_name     = stg2r
#@ output       = stg2r.out
#@ error        = stg2r.error
#@ job_type     = serial
#@ notification = never
#@ class        = regular
#
#
#@ wall_clock_limit = 04:00:00
#
#
#@ queue

../rcodes/stg2r.x

In order to submit this batch script file, type


llsubmit stg2r.batch

The status of the job can be monitored (and also it is usefull to know the batch job number) by typing


llqs | grep yourusername

If for some reason the file has to be cancellated, type


llcancel yourbatchjobnuber

WARNING !! If STG2R will be followed by the parallel run of PSTG3, don't delete the file rout2r !! .

PSTG3

The code PSTG3 uses the subroutine PDSYEVX from the ScaLAPACK library. The second letter D in the name means that it is a double-precision subroutine. Be sure that the version of your code calls this subroutine, and not the PSSYEVX subroutine (single-precision), as in the CRAY version. Note that unlike STG3R, PSTG3R does not use STGLIB. The program PSTG3 is the only parallel program that requires a different input file than the corresponding serial code. Here is an example of a typical input file:

EXAMPLE: PSTG3 input file: dstg3p


S.S. 83-term 180-level R-matrix ICFT calculation of Ne5+ excitation
 &STG3A  /
 &STG3B INAST=0 NAST=83 IPWINIT=14 IPWFINAL=25 TOLER=0.01 /
 &matrixdat NB=16 nprow=6 npcol=6 /
   0.0000  3.6921  7.0220  7.7952  8.5603  8.6704  9.1556  9.3993
   9.8664 10.0832 10.2098 10.3887 10.8124 11.0598 11.3925  9.6058
   2.9225  9.8890  7.5953  8.4832  9.7043 10.0176 10.3137  3.2676
   8.4499  9.3798  9.7838 10.1002 10.3089 11.3546  8.4354  9.6703
  10.2976  8.6159  9.3053  9.6785  9.9798 10.3679 11.3468  8.3313
  10.2576  2.0958  6.5770  8.3289  8.9499  9.0501 10.2838 10.3977
  10.5268 11.2582  8.0723 10.1930  2.2688  7.9793  8.9943  9.5432
  10.0785 10.2051 10.4024 11.2207  0.9120  8.1504  9.3165 10.1553
  10.2385  1.6233  7.4312  8.1987  8.9779  9.2783  9.6171 10.2315
  10.3855 10.3471 11.1345 11.2389  7.9988 10.0704 10.1911 10.1494
  10.3771 10.0179 10.2691

The only difference between the dstg3 and the dstg3p files is the namelist matrixdat.
These parameters are needed for the partition of the matrix onto different processes, in the diagonalization subroutine. No attempt is made here to explain the use of the ScaLAPACK subroutines. The user interested in details, can read the ScalaPACK User's guide, at


http://www.netlib.org/scalapack/slug/.

The following is a brief explanation of the parameters needed for PSTG3:

NPROW and NPCOL:
The product of these two integer numbers, is the number of processors needed for the run. It is (strongly) recommended that you use the same number for nprow and npcol. The code should work anyway, even if these numbers are not equal, but it will not provide the best balancing. The code will not work if the product of nprow $\times$ npcol is not equal to the number of processors specified in the run. There is an internal check of this requirement, so it will stop immediatly.

NB
This parameter is an integer number that depends on the typical size of the matrices needed to be diagonalized. In general,

$\begin{displaymath} NB \leq \frac{MNP1}{number~of~processors} \end{displaymath}$

(1)

where MNP1 is the size of the largest matrix to be diagonalized (one can determine this number from the output of STG2R: file rout2r). As a rule of thumb, NB has to be an integer number of the form

. The optimal selection for this parameter depends on the number of matrices and the size of these matrices. If NB is too small, the ScaLAPACK subroutine will not use the Basic Linear Algebra (BLAS) package in the most efficient way. On the other hand, if NB is too large, it can lead to a situation in which many processors are not working, but are waiting for the others to finish the diagonalization. NB=16 will normally work fine; it is not recommended that you use a value of NB greater than 64.

Remember the limitation of 30 minutes for interactive sessions. A way to overcome this problem is to run the code specifying a particular group of partial waves (by using the IPWINIT and IPWFINAL parameters in the input file). In this mode the final H.DAT file is created or appended to an already existing file.
WARNING !! If your are using IPWINIT and IPWFINAL, be sure that you are not repeating partial waves.

An alternative way to overcome the limitation of time for interactive sessions, is by submiting a batch file, in this case to the debug queue. Here is an example of a batch script, you can learn from there other features, like how to bring/put files on the storage system (be sure you have an account there!):

EXAMPLE: pstg3.batch


#!/usr/bin/csh
#
#@ job_name     = pstg3
#@ output       = pstg3.out
#@ error        = pstg3.error
#@ job_type     = parallel
#@ network.MPI     = csss,not_shared,us
#@ notification = never
#@ class        = debug
#
#
#@ tasks_per_node       =  16
#@ node                 =   6             !!  96 processors
#@ wall_clock_limit = 00:20:00           !!   20 minutes
#
#@ queue
mkdir /$SCRATCH/stg3
cd /$SCRATCH/stg3
pwd
cp $HOME/ne5+180s/dstg3p .
cp $HOME/ne5+180s/rout2r .
cp $HOME/ne5+180s/AMAT.DAT .
cp $HOME/pcodes/pstg3.x .
echo " copy files OK "

# get STG2H.DAT from hpss
hsi hpss "cd ne5+180s ; get STG2H*.DAT "

# run pstg3
poe ./pstg3.x -procs 96

# copy files to original directory
cp H.DAT $HOME/ne5+180s/.
cp rout3r $HOME/ne5+180s/.
cp time96.dat $HOME/ne5+180s/.

# put H.DAT on hpss
hsi hpss "cd ne5+180s ; put H.DAT"

WARNING !! In order to figure out the size of the matrices involved, the code requires the presence (in the same directory) of rout2r. There is an internal check, and the code will stop immediatly if this file is not present.

PSTGF

The working (preliminary) version for this code is called PF.F. At this time, you should not use other versions. There are a number of options that have not yet been implemented from the serial codes; some of these will not be implemented in the parallel version, while some others will be implemented soon. In general, the code stops (or redefines the input variable, giving a message) when a non-implemented variable is given in the input file. In the current version, there is a limitation on the number of energy points in the input mesh. MXE has to be an exact multiple of the number of processors. The program will not stop if this requirement is not fulfilled, but it will not calculate all the energy points. The code generates the following output files:


routf
kmtls.dat000, kmtls.dat001, ... , kmtls.datNNN  
OMEGA000, OMEGA001, ... , OMEGANNN

Every file contains the corresponding data for the energy points calculated in a particular processor (NNN). A postprocessor is needed if one is interested in generating the OMEGA file. This program is called OMEGASERIAL, and requires the input file omegaprints.inp. Here is an example of the omegaprints.inp input file:

EXAMPLE: omegaprints.inp


!  input for omegaprintserial code  !
 &filedata coup='ls'  nproc=64 ibige=1  /

The files can be downloaded from the auxiliary on the web page. For most cases PSTGF requires more than 30 minutes and/or 64 processors to run, therefore, the job will have to be submitted as batch. Here is an example of a batch script (also in auxiliary directory):

EXAMPLE: pf.batch


#!/usr/bin/csh
#
#@ job_name     = pf1
#@ output       = pf1.out
#@ error        = pf1.error
#@ job_type     = parallel
#@ network.MPI     = csss,not_shared,us
#@ notification = never
#@ class        = regular
#
#
#@ tasks_per_node       =  16
#@ node                 =  16             !!  256 processors
#@ wall_clock_limit = 00:30:00           !!   30 minutes
#
#@ queue
mkdir /$SCRATCH/pf1
cd /$SCRATCH/pf1
pwd
cp $HOME/ne5cont/pf.x .
cp $HOME/ne5cont/omegaserial.x .
cp $HOME/ne5cont/dstgf dstgf
cp $HOME/ne5cont/omegaprints.inp .
cp $HOME/ne5cont/H.DAT .
echo " copy files OK "

poe ./pf.x -procs 256

# construct the final OMEGA file
./omegaserial.x

# put big files on hpss 
hsi hpss "cd ne5+180s/le1 ; put OMEGA ; mput jbinls*; mput kmtls.dat* "

# put files on the HOME directory 
cp routf $HOME/ne5cont/routf
cp strength.dat $HOME/ne5cont/.
cp term.dat $/HOME/ne5cont/.

WARNING !! Do not forget to copy your output files from the scratch directory to your home directory. They should dissapear after your batch job is finished !!!.

PSTGICF

The working (preliminary) version of this code is called PIC.F. Do not use other versions of this code. There are a number of options that are not implemented from the serial code; some of them will not be implemented (for example: IMODE=1), and some will be implemented soon (for example: IMODE=-1). In general, the code stops (or redefines the input variable, giving a message) when a non-implemented variable is given in the input file. For the current version, there is a limitation on the number of energy points in the input mesh. MXE has to be an exact multiple of the number of processors. The program will not stop if this requirement is not fulfilled, but it will not calculate all the energy points.

WARNING !! The code reads the kmtls.datNNN files generated from PF.F, therefore, the same number of processors used in the PF.F run, has to be used here.

The code generates the following output files:


routicf
omega000, omega001, ... , omegaNNN

Every file contains the corresponding data for the energy points calculated at the particular processor (NNN). The postprocessor program OMEGASERIAL.F is needed in order to generate the total omega file. Here is an example of the omegaprints.inp input file for a PIC.F run:

EXAMPLE: omegaprints.inp


!  input for omegaprintserial code  !
 &filedata coup='ic'  nproc=64 ibige=0  /

The files can be downloaded from the auxiliary directory on the web page. Here is an example of a batch script (submitted to premium queue):

EXAMPLE: pic.batch


#!/usr/bin/csh
#
#@ job_name     = pic1
#@ output       = pic1.out
#@ error        = pic1.error
#@ job_type     = parallel
#@ network.MPI     = csss,not_shared,us
#@ notification = never
#@ class        = premium
#
#
#@ tasks_per_node       =  16
#@ node                 =  16            !!  256 processors
#@ wall_clock_limit = 01:00:00           !!  1 hour
#
#@ queue
mkdir /$SCRATCH/pic1
cd /$SCRATCH/pic1
pwd
cp $HOME/ne5cont/pic.x .
cp $HOME/ne5cont/omegaserial.x .
cp $HOME/ne5cont/dstgicf .
cp $HOME/ne5cont/omegaprints.inp .
cp $HOME/ne5cont/strength.dat .
cp $HOME/ne5cont/term.dat .
echo " copy files OK "

# get files from hpss 
hsi hpss "cd ne5+180s/le1 ; mget jbinls*; mget kmtls.dat* "
echo " files from hpss OK "

poe ./pic.x -procs 256

# construct the final omega file
./omegaserial.x

# put files on hpss
hsi hpss "cd ne5+180s/le1 ; put omega; mput rout*; mput dst* "

# put files on the HOME directory
cp omega $HOME/ne5cont/omega
cp routicf $HOME/ne5cont/routicf

Here is an example of a batch script that combines both PF.F and PIC.F in the same run:

EXAMPLE: pfpic.batch


#!/usr/bin/csh
#
#@ job_name     = pfpicfle1
#@ output       = pfpicfle1.out
#@ error        = pfpicfle1.error
#@ job_type     = parallel
#@ network.MPI     = csss,not_shared,us
#@ notification = never
#@ class        = low
#
#
#@ tasks_per_node       =  16
#@ node                 =   4            !!  64 processors
#@ wall_clock_limit = 04:45:00
#
#@ queue
mkdir /$SCRATCH/pfpic1
cd /$SCRATCH/pfpic1
pwd
cp $HOME/ne5cont/pf.x .
cp $HOME/ne5cont/pic.x .
cp $HOME/ne5cont/omegaserial.x .
cp $HOME/ne5cont/dstgf.le1 dstgf
cp $HOME/ne5cont/dstgicf .
cp $HOME/ne5cont/TCCDW.DAT .
cp $HOME/ne5cont/omegaprintls.inp .
cp $HOME/ne5cont/omegaprintic.inp .
cp $HOME/ne5cont/H.DAT .
echo " copy files OK "

poe ./pf.x -procs 64

cp omegaprintls.inp omegaprints.inp
./omegaserial.x

poe ./pic.x -procs 64

cp omegaprintic.inp omegaprints.inp
./omegaserial.x

# put dstgf on hpss
hsi hpss "cd ne5+180s/le1 ; put omega; put OMEGA; mput rout*; mput dst* "

cp OMEGA $HOME/ne5cont/OMEGA.le1
cp omega $HOME/ne5cont/omega.le1
cp routf $HOME/ne5cont/routf.le1
cp routicf $HOME/ne5cont/routicf.le1
cp adasexj.in.form $HOME/ne5cont/.
cp strength.dat $HOME/ne5cont/.
cp term.dat $/HOME/ne5cont/.

WARNING !! Do not forget to copy your output files from the scratch directory to your home directory. They should dissapear after your batch job is finished !!!.

LaTeX Source:

writeup.tex

Dario Mitnik
2001-08-14