D. M. Mitnik and D. C. Griffin
Department of Physics, Rollins College, Winter Park, FL, USA
and N. R. Badnell
Department of Physics and Applied Physics, University of Strathclyde, Glasgow G4 0NG, UK
August 14, 2001
There are three codes that have been modified for operation on distributed-memory parallel computers, and they are: PSTG3R, PSTGF (the preliminar working version is called PF), and PSTGICF (the preliminar working version is called PIC). These codes have been tested for use on Cray T3E-900 and IBM-SP supercomputers, and we confine our description to their operation on those machines. However, they should run equally well on a Beowulf (Sun) cluster and their use on such a machine should be more straight forward, since they can then be run in the interactive mode.
The code pstg3r requires installation of ScaLAPACK library, which is public and can be downloaded at
http://www.netlib.org/scalapack/slug/ .We will explain in this guide how to use the parallel codes (and very briefly, how to use the parallel machines), for the two machines mentioned above. The parallel codes (and some auxiliary files) can be downloaded from
http://vanadium.rollins.edu/codes , or from http://vanadium.rollins.edu/~dario/codes .
Use of the RMTRX-I codes on the IBM RS/6000 SP
Supercomputer at NERSC
The best place to find information about the use of the parallel computers at NERSC (and to learn more about parallel computers in general, including MPI, parallelization, batch files, etc.) is the web site of NERSC at
http://hpcf.nersc.gov/.Here we provide the basic primary information that you will need in order to begin working with the parallel ICFT R-matrix codes.
The IBM SP machine has 16 processors per node. It is recommended to work with a number of processors multiple of 16. It is possible to work with an arbitrary number of processors, but requires some previous redefinition of the Loadlevel keywords. Remember, you are charged for all 16 CPUs per node, no matter how many of them are you using.
Since STG1 and STG2 run efficiently as serial codes, there are no parallel versions. The codes have comments in several places for those fortran statements that are machine dependent, so it is necessary to search for the words SUN, CRAY and IBM and change comments according to the working machine. The only modifications required in these two codes are for timing. For the IBM-SP, the timing subroutine is called by:
c...... IBM TIMING c.......at the beginning of the code: timei = rtc() ..... ..... c.......at the end of the code: timef = rtc() time = timef - timei
Beware that the IBM-SP does not recognize &END
to signal the end of
data in namelist input; instead it uses a /. Here is an example of an input
file:
S.S. 11-term, 20-level (no CI) R-matrix ICFT calculation of Ne5+ excitation &STG1A RELOP='MVD' / &STG1B MAXLA=2 MAXLT=14 MAXE=24 MAXC=25 /
When running these codes at NERSC in the interactive mode (non-batch), it is important to remember that there are strict limits on time, number of processors, and file size. Information of these limits can be obtained at:
http://hpcf.nersc.gov/running_jobs/ibm/#Resource.On the IBM-SP, the maximum time allowed for an interactive session is 30 minutes, the maximum number of processors is 64 (4 nodes), and the maximum size for open files is 2 Gb. In order to achieve this maximum memory available, add these options to your compile lines (see http://hpcf.nersc.gov/computers/SP/#about):
-bmaxdata:0x80000000 -bmaxstack:0x10000000
It is better to work in a temporary directory (faster input/output, larger number of processors, and larger working files); the way to change to this directory is with
mkdir /$SCRATCH/namedir cd /$SCRATCH/namedir
WARNING !! The environment variable $SCRATCH
refers to a directory /scratch/scratchdirs/yourusername
.
$SCRATCH
, provides 204 GB of memory and 12,000 inodes.
The contents of $SCRATCH
may be deleted at any time after the job
finishes if the system's disks near capacity. In general, files in
$SCRATCH
will persist for at least 7 days, but users are
``taking chances" by using $SCRATCH
to store after the job
finishes and should not rely on it to be ``semi-permanent" file storage space.
It is a good working practice to put a copy sentence (to your home directory)
in the batch script file.
Your home directory can (and should) always be referred to by the environment
variable $HOME
.
The absolute path to your home directory (e.g., /u2/dmitnik/
) may
change at any time without notice, but the value of $HOME
will
always be correct (the same about $SCRATCH
).
STG1R will normally run well in your home directory in the interactive
mode.
For cases in which STG2 requires more than 30 minutes for run,
the job has to be submitted by batch. Here is an example of a batch
script that works on your own home directory:
EXAMPLE:
stg2r.batch
#!/usr/bin/csh #@ job_name = stg2r #@ output = stg2r.out #@ error = stg2r.error #@ job_type = serial #@ notification = never #@ class = regular # # #@ wall_clock_limit = 04:00:00 # # #@ queue ../rcodes/stg2r.xIn order to submit this batch script file, type
llsubmit stg2r.batchThe status of the job can be monitored (and also it is usefull to know the batch job number) by typing
llqs | grep yourusernameIf for some reason the file has to be cancellated, type
llcancel yourbatchjobnuber
The code PSTG3 uses the subroutine PDSYEVX from the
ScaLAPACK library. The second letter D in the name means that
it is a double-precision subroutine. Be sure that the version of your
code calls this subroutine, and not the PSSYEVX subroutine
(single-precision), as in the CRAY version. Note that unlike STG3R,
PSTG3R does not use STGLIB. The program PSTG3 is the
only parallel program that requires a different input file than the
corresponding serial code. Here is an example of a typical
input file:
EXAMPLE: PSTG3 input file:
dstg3p
S.S. 83-term 180-level R-matrix ICFT calculation of Ne5+ excitation &STG3A / &STG3B INAST=0 NAST=83 IPWINIT=14 IPWFINAL=25 TOLER=0.01 / &matrixdat NB=16 nprow=6 npcol=6 / 0.0000 3.6921 7.0220 7.7952 8.5603 8.6704 9.1556 9.3993 9.8664 10.0832 10.2098 10.3887 10.8124 11.0598 11.3925 9.6058 2.9225 9.8890 7.5953 8.4832 9.7043 10.0176 10.3137 3.2676 8.4499 9.3798 9.7838 10.1002 10.3089 11.3546 8.4354 9.6703 10.2976 8.6159 9.3053 9.6785 9.9798 10.3679 11.3468 8.3313 10.2576 2.0958 6.5770 8.3289 8.9499 9.0501 10.2838 10.3977 10.5268 11.2582 8.0723 10.1930 2.2688 7.9793 8.9943 9.5432 10.0785 10.2051 10.4024 11.2207 0.9120 8.1504 9.3165 10.1553 10.2385 1.6233 7.4312 8.1987 8.9779 9.2783 9.6171 10.2315 10.3855 10.3471 11.1345 11.2389 7.9988 10.0704 10.1911 10.1494 10.3771 10.0179 10.2691The only difference between the dstg3 and the dstg3p files is the namelist matrixdat.
http://www.netlib.org/scalapack/slug/.The following is a brief explanation of the parameters needed for PSTG3:
Remember the limitation of 30 minutes for interactive sessions.
A way to overcome this problem is to run the code specifying a
particular group of partial waves (by using the IPWINIT and IPWFINAL
parameters in the input file). In this mode the final H.DAT file is
created or appended to an already existing file.
WARNING !! If your are using IPWINIT and IPWFINAL, be sure that
you are not repeating partial waves.
An alternative way to overcome the limitation of time for interactive
sessions, is by submiting a batch file, in this case
to the debug
queue.
Here is an example of a batch script, you can learn from there other
features, like how to bring/put files on the storage system
(be sure you have an account there!):
EXAMPLE:
pstg3.batch
#!/usr/bin/csh # #@ job_name = pstg3 #@ output = pstg3.out #@ error = pstg3.error #@ job_type = parallel #@ network.MPI = csss,not_shared,us #@ notification = never #@ class = debug # # #@ tasks_per_node = 16 #@ node = 6 !! 96 processors #@ wall_clock_limit = 00:20:00 !! 20 minutes # #@ queue mkdir /$SCRATCH/stg3 cd /$SCRATCH/stg3 pwd cp $HOME/ne5+180s/dstg3p . cp $HOME/ne5+180s/rout2r . cp $HOME/ne5+180s/AMAT.DAT . cp $HOME/pcodes/pstg3.x . echo " copy files OK " # get STG2H.DAT from hpss hsi hpss "cd ne5+180s ; get STG2H*.DAT " # run pstg3 poe ./pstg3.x -procs 96 # copy files to original directory cp H.DAT $HOME/ne5+180s/. cp rout3r $HOME/ne5+180s/. cp time96.dat $HOME/ne5+180s/. # put H.DAT on hpss hsi hpss "cd ne5+180s ; put H.DAT"
The working (preliminary) version for this code is called PF.F. At this time, you should not use other versions. There are a number of options that have not yet been implemented from the serial codes; some of these will not be implemented in the parallel version, while some others will be implemented soon. In general, the code stops (or redefines the input variable, giving a message) when a non-implemented variable is given in the input file. In the current version, there is a limitation on the number of energy points in the input mesh. MXE has to be an exact multiple of the number of processors. The program will not stop if this requirement is not fulfilled, but it will not calculate all the energy points. The code generates the following output files:
routf kmtls.dat000, kmtls.dat001, ... , kmtls.datNNN OMEGA000, OMEGA001, ... , OMEGANNN
Every file contains the corresponding data for the energy points
calculated in a particular processor (NNN).
A postprocessor is needed if one is interested in generating the OMEGA
file. This program is called OMEGASERIAL, and requires the input file
omegaprints.inp. Here is an example of the omegaprints.inp input
file:
EXAMPLE:
omegaprints.inp
! input for omegaprintserial code !
&filedata coup='ls' nproc=64 ibige=1 /
The files can be downloaded from the
auxiliary on the web page.
For most cases PSTGF requires more than 30 minutes and/or 64
processors to run, therefore, the job will have to be submitted as batch.
Here is an example of a batch script (also in
auxiliary directory):
EXAMPLE: pf.batch
#!/usr/bin/csh # #@ job_name = pf1 #@ output = pf1.out #@ error = pf1.error #@ job_type = parallel #@ network.MPI = csss,not_shared,us #@ notification = never #@ class = regular # # #@ tasks_per_node = 16 #@ node = 16 !! 256 processors #@ wall_clock_limit = 00:30:00 !! 30 minutes # #@ queue mkdir /$SCRATCH/pf1 cd /$SCRATCH/pf1 pwd cp $HOME/ne5cont/pf.x . cp $HOME/ne5cont/omegaserial.x . cp $HOME/ne5cont/dstgf dstgf cp $HOME/ne5cont/omegaprints.inp . cp $HOME/ne5cont/H.DAT . echo " copy files OK " poe ./pf.x -procs 256 # construct the final OMEGA file ./omegaserial.x # put big files on hpss hsi hpss "cd ne5+180s/le1 ; put OMEGA ; mput jbinls*; mput kmtls.dat* " # put files on the HOME directory cp routf $HOME/ne5cont/routf cp strength.dat $HOME/ne5cont/. cp term.dat $/HOME/ne5cont/.
The working (preliminary) version of this code is called PIC.F. Do not use other versions of this code. There are a number of options that are not implemented from the serial code; some of them will not be implemented (for example: IMODE=1), and some will be implemented soon (for example: IMODE=-1). In general, the code stops (or redefines the input variable, giving a message) when a non-implemented variable is given in the input file. For the current version, there is a limitation on the number of energy points in the input mesh. MXE has to be an exact multiple of the number of processors. The program will not stop if this requirement is not fulfilled, but it will not calculate all the energy points.
WARNING !! The code reads the kmtls.datNNN
files generated
from PF.F, therefore, the same number of processors used in
the PF.F run, has to be used here.
The code generates the following output files:
routicf omega000, omega001, ... , omegaNNN
Every file contains the corresponding data for the energy points
calculated at the particular processor (NNN).
The postprocessor program OMEGASERIAL.F is needed in order to generate
the total omega file.
Here is an example of the omegaprints.inp input
file for a PIC.F run:
EXAMPLE:
omegaprints.inp
! input for omegaprintserial code ! &filedata coup='ic' nproc=64 ibige=0 /
The files can be downloaded from the
auxiliary directory on the web page.
Here is an example of a batch script (submitted to premium
queue):
EXAMPLE:
pic.batch
#!/usr/bin/csh # #@ job_name = pic1 #@ output = pic1.out #@ error = pic1.error #@ job_type = parallel #@ network.MPI = csss,not_shared,us #@ notification = never #@ class = premium # # #@ tasks_per_node = 16 #@ node = 16 !! 256 processors #@ wall_clock_limit = 01:00:00 !! 1 hour # #@ queue mkdir /$SCRATCH/pic1 cd /$SCRATCH/pic1 pwd cp $HOME/ne5cont/pic.x . cp $HOME/ne5cont/omegaserial.x . cp $HOME/ne5cont/dstgicf . cp $HOME/ne5cont/omegaprints.inp . cp $HOME/ne5cont/strength.dat . cp $HOME/ne5cont/term.dat . echo " copy files OK " # get files from hpss hsi hpss "cd ne5+180s/le1 ; mget jbinls*; mget kmtls.dat* " echo " files from hpss OK " poe ./pic.x -procs 256 # construct the final omega file ./omegaserial.x # put files on hpss hsi hpss "cd ne5+180s/le1 ; put omega; mput rout*; mput dst* " # put files on the HOME directory cp omega $HOME/ne5cont/omega cp routicf $HOME/ne5cont/routicf
Here is an example of a batch script that combines both PF.F and
PIC.F in the same run:
EXAMPLE:
pfpic.batch
#!/usr/bin/csh # #@ job_name = pfpicfle1 #@ output = pfpicfle1.out #@ error = pfpicfle1.error #@ job_type = parallel #@ network.MPI = csss,not_shared,us #@ notification = never #@ class = low # # #@ tasks_per_node = 16 #@ node = 4 !! 64 processors #@ wall_clock_limit = 04:45:00 # #@ queue mkdir /$SCRATCH/pfpic1 cd /$SCRATCH/pfpic1 pwd cp $HOME/ne5cont/pf.x . cp $HOME/ne5cont/pic.x . cp $HOME/ne5cont/omegaserial.x . cp $HOME/ne5cont/dstgf.le1 dstgf cp $HOME/ne5cont/dstgicf . cp $HOME/ne5cont/TCCDW.DAT . cp $HOME/ne5cont/omegaprintls.inp . cp $HOME/ne5cont/omegaprintic.inp . cp $HOME/ne5cont/H.DAT . echo " copy files OK " poe ./pf.x -procs 64 cp omegaprintls.inp omegaprints.inp ./omegaserial.x poe ./pic.x -procs 64 cp omegaprintic.inp omegaprints.inp ./omegaserial.x # put dstgf on hpss hsi hpss "cd ne5+180s/le1 ; put omega; put OMEGA; mput rout*; mput dst* " cp OMEGA $HOME/ne5cont/OMEGA.le1 cp omega $HOME/ne5cont/omega.le1 cp routf $HOME/ne5cont/routf.le1 cp routicf $HOME/ne5cont/routicf.le1 cp adasexj.in.form $HOME/ne5cont/. cp strength.dat $HOME/ne5cont/. cp term.dat $/HOME/ne5cont/.