ADT Gordon

Table of Contents

1 Homepage

Set of notebooks for reproducible experiments validating the ADT Gordon application and algorithms, that is: Diodon-FMR-Chameleon-StarPU-NewMadeleine, see the main web page.

The repository is made of several subdirectories describing how to install and use the different components (from low to high layers) of the software stack:

  1. New Madeleine: Communication library
  2. StarPU: Runtime System library
  3. Chameleon: Parallel Dense Linear Algebra library
  4. FMR: Random SVD library
  5. Diodon: Multidimensional Scaling library

The application can be deployed on new machines from different ways:

  1. by using a package manager, GNU Guix Non Free, for which some Gordon packages are available
  2. by downloading, compiling, installing each component from sources

The advantage of using Guix is that the installation is guaranteed to work on compatible machines where Guix is installed e.g PlaFRIM. In addition Guix has the capability of making Docker and Singularity images which makes the packaging of the software stack possible on all the machines where Docker or Singularity is installed e.g. MCIA CURTA.

To install Guix and Gordon packages on a new system for which root access is available please refer to the section Guix packaging.

To get details about installation of each component of the Gordon packages refer to the section Installation guides.

See also:

2 Guix packaging

2.1 Installing Guix

Guix requires a running GNU/Linux system, GNU tar and Xz.

gpg --keyserver pgp.mit.edu --recv-keys 3CE464558A84FDC69DB40CFB090B11993D9AEBB5
wget https://git.savannah.gnu.org/cgit/guix.git/plain/etc/guix-install.sh
chmod +x guix-install.sh
sudo ./guix-install.sh

2.2 Add Guix HPC Non Free to get additional Gordon packages

The Gordon packages necessary are not official Guix packages (some softwares are not ready for an open source distribution (Diodon, FMR). It is then necessary to add a channel to get Gordon packages. Create a ~/.config/guix/channels.scm file with the following snippet:

(cons (channel
    (name 'guix-hpc-non-free)
    (url "https://gitlab.inria.fr/guix-hpc/guix-hpc-non-free.git"))
  %default-channels)

Update guix package definition

guix pull

Update new guix in the path

PATH="$HOME/.config/guix/current/bin${PATH:+:}$PATH"
hash guix

For further shell sessions, add this to the ~/.bash_profile file

export PATH="$HOME/.config/guix/current/bin${PATH:+:}$PATH"
export GUIX_LOCPATH="$HOME/.guix-profile/lib/locale"

Gordon packages are now available

guix search ^nmad
guix search ^starpu
guix search ^chameleon
guix search ^fmr
guix search ^diodon

2.3 Install Gordon packages with Guix

Open source projects such as New Madeleine, StarPU, and Chameleon can be directly installed

# to install NewMadeleine
guix install nmad hwloc rdma-core
# to install StarPU with NewMadeleine
guix install starpu --with-input=openmpi=nmad
# to install Chameleon (git master) with Intel MKL with StarPU with NewMadeleine
guix install chameleon --with-branch=chameleon=master --with-input=openblas=mkl --with-input=openmpi=nmad

By default the version installed is the last identified release, OpenMPI is used for the communication library and Openblas is used for the dense linear algebra kernels. To change this default behaviour we use the options --with-input and --with-branch. Notice that options --with-commit and --with-git-url can also be used to change the sources and versions. In the last example with substitute OpenMPI to NewMadeleine and OpenBLAS to Intel MKL. In addition we use the master git state of Chameleon to ensure there are the required functionnalities for Diodon (such as the Gram algorithm).

To check what is installed and available in the current profile

guix package -I

Consider setting the necessary environment variables by running

GUIX_PROFILE="$HOME/.guix-profile"
. "$GUIX_PROFILE/etc/profile"

Check that it works

NMAD=`guix package -I nmad |awk '{print $4}'`
STARPU=`guix package -I starpu |awk '{print $4}'`
CHAMELEON=`guix package -I chameleon |awk '{print $4}'`
$NMAD/bin/mpiexec -np 4 $STARPU/lib/starpu/mpi/comm
$NMAD/bin/mpiexec -np 4 $CHAMELEON/bin/new-testing/snew-testing -o gemm -m 6400 -m 6400 -k 6400 -t 2 -P 2

FMR and Diodon are not open source projects and are not freely available. An access to the git repositories should be granted first, see FMR and Diodon.

Then install FMR and Diodon

guix install diodon fmr

To install Diodon and FMR with Chameleon (git master) with Intel MKL with StarPU with NewMadeleine

guix install diodon-mpi fmr-mpi --with-branch=chameleon-mkl-mt=master --with-input=openmpi=nmad

Check that it works

NMAD=`guix package -I nmad |awk '{print $4}'`
DIODON=`guix package -I diodon-mpi |awk '{print $4}'`
$NMAD/bin/mpiexec -np 4 $DIODON/bin/mdsDriver -t 2 -gen 1 -rs 6400 -rk 3200 -os 0 -osvd 1

To remove packages

guix remove diodon fmr diodon-mpi fmr-mpi chameleon starpu nmad hwloc rdma-core

2.4 Singularity packaging of Diodon plus dependencies

To package Diodon and its dependencies within a singularity image (OpenMPI stack)

singularity_diodon=`guix pack -f squashfs diodon-mpi fmr-mpi chameleon-mkl-mt mkl starpu hwloc openmpi openssh slurm hdf5 zlib bash coreutils inetutils util-linux procps git grep tar sed gzip which gawk perl emacs-minimal vim gcc-toolchain make cmake pkg-config -S /bin=bin --entry-point=/bin/bash`
cp $singularity_diodon diodon-pack.gz.squashfs
scp diodon-pack.gz.squashfs plafrim:

On a machine where Singularity is installed Diodon can then be called as follows

mpiexec -np 2 singularity exec diodon-pack.gz.squashfs mdsDriver -t 2 -gen 1 -rs 640 -rk 320 -os 0 -osvd 1

2.5 Docker packaging of Diodon plus dependencies

To package Diodon and its dependencies within a docker image (OpenMPI stack)

docker_diodon=`guix pack -f docker diodon-mpi fmr-mpi chameleon-mkl-mt mkl starpu hwloc openmpi openssh slurm hdf5 zlib bash coreutils inetutils util-linux procps git grep tar sed gzip which gawk perl emacs-minimal vim gcc-toolchain make cmake pkg-config -S /bin=bin --entry-point=/bin/bash`
# Load the generated tarball as a docker image
docker_diodon_tag=`docker load --input $docker_diodon | grep "Loaded image: " | cut -d " " -f 3-`
# Change tag name, see the existing image name with "docker images" command, then change to a more simple name
docker tag $docker_diodon_tag guix/diodon-tmp

Create a Dockerfile inheriting from the image (renamed guix/diodon here):

FROM guix/diodon-tmp

# Create a directory for user 1000
RUN mkdir -p /builds
RUN chown -R 1000 /builds

ENTRYPOINT ["/bin/bash", "-l"]

# Enter the image as user 1000 in /builds
USER 1000
WORKDIR /builds
ENV HOME /builds

Then create the final docker image from this docker file.

docker build -t guix/diodon .

Test the image

docker run -it guix/diodon
# test starpu
STARPU=`pkg-config --variable=prefix libstarpu`
mpiexec -np 4 $STARPU/lib/starpu/mpi/comm
# test chameleon
CHAMELEON=`pkg-config --variable=prefix chameleon`
mpiexec -np 2 $CHAMELEON/bin/new-testing/snew-testing -H -o gemm -P 2 -t 2 -m 2000 -n 2000 -k 2000
# test diodon
mpiexec -np 4 mdsDriver -t 2 -gen 1 -rs 6400 -rk 3200 -os 0 -osvd 1

2.6 Save Guix state

Save the exact commit states of Guix and Guix HPC Non free

guix describe -f channels > channels-date.scm

This can be used later to set Guix to some states from the past

guix pull -C channels-date.scm

3 Installation guides

3.1 New Madeleine

3.1.1 Dependencies

  • autoconf (v 2.50 or later)
  • pkg-config
  • hwloc (optional, recommended)
  • libexpat XML parser
  • ibverbs/OFED for IB support (set $IBHOME if not installed in /usr)
  • MX for Myrinet support (set $MX_DIR if not installed in /usr)

3.1.2 Installation

To build all modules required by nmad, an installation script can be used with 2 different configuration files:

  • madmpi.conf: classical installation of nmad.
  • madmpi-mini.conf: install nmad without pioman.
git clone https://gitlab.inria.fr/pm2/pm2.git
cd pm2/scripts
./pm2-build-packages ./madmpi.conf --purge --prefix=$HOME/soft/x86_64

./madmpi-debug.conf file can be used to get debug symbols.

3.1.3 Installation with Guix

Guix and Guix-HPC should be installed here, see Guix.

Thanks to the Guix-HPC packages developed here, we can setup an environment containing the nmad and some of its dependencies. All the PM2 dependencies needed by nmad are installed alongside thus the user don't need to take care of it. Some optional dependencies should be explicitly exposed (hwloc, rdma-core, psm or psm2).

guix install nmad hwloc rdma-core

nmad-mini package is also available.

3.2 StarPU

StarPU with nmad comes in 2 flavours:

  • starpu-nmad: StarPU uses the nmad native interface (nmad installed with madmpi.conf). In this case, nmad is charge of communications progress through pioman.
  • starpu-madmpi: StarPU uses nmad as a classical MPI implementation (nmad installed with madmpi-mini.conf) and it's in charge of commmunications progress.

3.2.1 Install StarPU

git clone https://scm.gforge.inria.fr/anonscm/git/starpu/starpu.git
cd starpu
./autogen.sh
  1. Starpu-madmpi
    module load ~/path/to/nmad/share/modules/pm2/madmpi-mini
    ./configure --prefix=$HOME/soft/x86_64/starpu_madmpi --enable-blas-lib=none --disable-mlr --disable-cuda --disable-opencl
    make -j5 install
    
  2. Starpu-nmad
    module load ~/path/to/nmad/share/modules/pm2/madmpi
    ./configure --prefix=$HOME/soft/x86_64/starpu_nmad --enable-nmad --enable-blas-lib=none --disable-mlr --disable-cuda --disable-opencl
    make -j5 install
    

    --enable-debug flag can be used to get debug symbols.

3.2.2 Install StarPU with Guix

The following line allows to setup an environment where starpu and nmad are setup and ready to use, see 2

guix install starpu --with-input=openmpi=nmad nmad hwloc rdma-core

A starpu-madmpi package is also available with Guix:

guix install starpu --with-input=openmpi=nmad-mini nmad-mini hwloc rdma-core

3.3 Chameleon

3.3.1 Dependencies

  • git
  • gcc and gfortran
  • cmake
  • pkg-config
  • openmpi or new madeleine
  • starpu
  • netlib or openblas or mkl (intel)
sudo apt-get update
sudo apt-get install -y git build-essential cmake pkg-config
sudo apt-get install -y libopenmpi-dev
sudo apt-get install -y liblapacke-dev

For StarPU See 3.2.

3.3.2 Install Chameleon

git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git
cd chameleon
mkdir -p build
cd build
cmake .. -DCHAMELEON_USE_MPI=ON -DBUILD_SHARED_LIBS=ON -DCMAKE_INSTALL_PREFIX=$HOME/soft/x86_64
make -j5 install

3.3.3 Install Chameleon with Guix

The following line allows to setup an environment where chameleon is setup and ready to use, see 2.

Standard Chameleon, last release

guix install chameleon

Notice that there exist several build variants

  • chameleon (default) : with starpu - with mpi
  • chameleon-mkl-mt : with starpu - with mpi - with multithreaded MKL
  • chameleon-cuda : with starpu - with mpi - with cuda
  • chameleon-fxt : with starpu - with mpi - with fxt
  • chameleon-simgrid : with starpu - with mpi - with simgrid
  • chameleon-openmp : with openmp - without mpi
  • chameleon-parsec : with parsec - without mpi
  • chameleon-quark : with quark - without mpi

Change the version

guix install chameleon --with-branch=chameleon=master
guix install chameleon --with-commit=chameleon=b31d7575fb7d9c0e1ba2d8ec633e16cb83778e8b

Change some dependencies

guix install chameleon --with-input=openblas=mkl --with-input=openmpi=nmad

3.4 FMR and Diodon

3.4.1 Dependencies

  • git
  • gcc and gfortran
  • cmake
  • mkl (Intel)
  • zlib
  • hdf5
  • openmpi or new madeleine
  • starpu
  • chameleon

Example of debian packages

sudo apt-get update
sudo apt-get install -y git build-essential cmake pkg-config
sudo apt-get install -y zlib1g-dev libhdf5-dev
sudo apt-get install -y libopenmpi-dev
sudo apt-get install -y liblapacke-dev

For other components see 3.1, 3.2, 3.3.

3.4.2 Install FMR and Diodon

The access to the following git repositories (not public) is required for the following

HOME_DIR=$PWD
git clone --recursive -b diodon git@gitlab.inria.fr:piblanch/fmr.git # contains submodule
cd fmr
mkdir -p build
cd build
cmake .. -DFMR_BUILD_TESTS=ON -DFMR_USE_HDF5=ON -DFMR_USE_CHAMELEON=ON -DCMAKE_INSTALL_PREFIX=$HOME/soft/x86_64
make -j5
sudo make install
cd $HOME_DIR
git clone git@gitlab.inria.fr:afranc/diodon.git
cd diodon/cpp
mkdir build
cd build
cmake .. -DDIODON_USE_CHAMELEON=ON -DCMAKE_INSTALL_PREFIX=$HOME/soft/x86_64
make -j5
./demos/mdsDriver -t 2 -gen 1 -rs 640 -rk 320 -os 0 -osvd 1

-DCMAKE_BUILD_TYPE=Debug flag can be used to build with debug symbols.

3.4.3 Install Diodon with Guix

OpenMPI stack

guix install diodon-mpi

Nmad stack

guix install diodon-mpi --with-input=openmpi=nmad

Check

NMAD=`guix package -I nmad |awk '{print $4}'`
$NMAD/bin/mpiexec -np 2 mdsDriver -t 2 -gen 1 -rs 640 -rk 320 -os 0

To setup a "pure" environment to test the stack (openmpi) with uncommited changes in Diodon and FMR, example

export DIODON=/home/.../git/diodon
export FMR=/home/.../git/fmr
guix environment --pure --ad-hoc diodon-mpi --with-source=diodon-mpi=$DIODON fmr-mpi --with-source=fmr-mpi=$FMR chameleon-mkl-mt --with-branch=chameleon-mkl-mt=master openmpi openssh -- /bin/bash --norc

4 Meetings

4.1 2020-09-24

4.1.1 Perfs de références GEMM et QR sur occigen

Perfs noyau (Intel MKL) et algorithme GEMM (Chameleon) en Gflop/s :

  Noyau NT=1 Algo NT=22
Simple 64 1345
Double 32 670

Perfs noyau (Intel MKL) et algorithme QR (Chameleon) en Gflop/s :

  Noyau NT=1 Algo NT=22
Simple 45 960
Double 22 480

4.1.2 Simple precision

UC READH5 MAPH5 MIGRATE BUILDMAT GRAM GEMM QR SVD RSVD FROB POST WRITEH5 SAVE Total
L6 (99594-1) 488 24 1 517 7 150 19 197 621 6 11 21 31 1202
L2L3L6 (270983-6) 506 23 89 622 17 176 9 193 608 15 40 20 45 1370
Lodd (426548-16) 393 20 79 497 32 173 6 196 594 27 77 17 53 1329
Leven (616644-32) 414 22 110 550 57 186 5 199 617 47 118 39 99 1576
Lall (1043192-125) 301 15 226 548 152 142 5 199 540 112 175 48 124 1927

reunions/mds_occigen_simple_ts320_r10000.pdf

mds_occigen_simple_ts320_r10000.png

UC Size MPI NP Perf GEMM Eff GEMM 1 Eff GEMM 2 Perf QR Eff QR 1 Eff QR 2
L6 (99594-1) 99594 1 1321 0.98215613 0.86002604 980 1.0208333 0.90740741
L2L3L6 (270983-6) 270983 6 8340 1.0334572 0.90494792 5929 1.0293403 0.91496914
Lodd (426548-16) 426548 16 21019 0.97671933 0.85526530 14294 0.93059896 0.82719907
Leven (616644-32) 616644 32 41172 0.95659851 0.83764648 25638 0.83457031 0.74184028
Lall (1043192-125) 1043192 125 158181 0.94085353 0.82385938 41972 0.34976667 0.31090370

reunions/mds_occigen_simple_ts320_r10000_perf_gemm_qr.pdf

mds_occigen_simple_ts320_r10000_perf_gemm_qr.png

UC Size MPI NP Read Read speed Write Write speed
L6 (99594-1) 99594 1 488 38.768287 21 180.91474
L2L3L6 (270983-6) 270983 6 506 276.79845 20 516.85905
Lodd (426548-16) 426548 16 393 883.02572 17 957.14793
Leven (616644-32) 616644 32 414 1751.8574 39 603.15646
Lall (1043192-125) 1043192 125 301 6895.9179 48 829.05451

reunions/mds_occigen_simple_ts320_r10000_perf_read_write.pdf

mds_occigen_simple_ts320_r10000_perf_read_write.png

4.1.3 Double precision

UC READH5 MAPH5 MIGRATE BUILDMAT GRAM GEMM QR SVD RSVD FROB POST WRITEH5 SAVE Total
L6 (99594-2) 249 15 25 292 5 150 19 276 694 4 28 17 32 1061
L2L3L6 (270983-12) 233 18 94 349 15 187 9 314 758 12 56 29 69 1279
Lodd (426548-32) 203 15 99 321 27 177 5 341 752 22 133 40 98 1399
Leven (616644-75) 186 13 79 283 55 162 5 344 732 39 192 79 168 1564
Lall (1043192-250) 166 11 224 409 187 152 6 346 753 106 281 169 313 2525

reunions/mds_occigen_double_ts320_r10000.pdf

mds_occigen_double_ts320_r10000.png

UC Size MPI NP Perf GEMM Eff GEMM 1 Eff GEMM 2 Perf QR Eff QR 1 Eff QR 2
L6 (99594-2) 99594 2 1324 0.98805970 0.86197917 998 1.0395833 0.94507576
L2L3L6 (270983-12) 270983 12 7921 0.98519900 0.85948351 5759 0.99982639 0.90893308
Lodd (426548-32) 426548 32 20775 0.96898321 0.84533691 14751 0.96035156 0.87304688
Leven (616644-75) 616644 75 47484 0.94495522 0.824375 30269 0.84080556 0.76436869
Lall (1043192-250) 1043192 250 144456 0.86242388 0.752375 33905 0.28254167 0.25685606

reunions/mds_occigen_double_ts320_r10000_perf_gemm_qr.pdf

mds_occigen_double_ts320_r10000_perf_gemm_qr.png

UC Size MPI NP Read Read speed Write Write speed
L6 (99594-2) 99594 2 249 151.95923 17 446.96583
L2L3L6 (270983-12) 270983 12 233 1202.2319 29 712.90904
Lodd (426548-32) 426548 32 203 3419.0060 40 813.57574
Leven (616644-75) 616644 75 186 7798.5912 79 595.52156
Lall (1043192-250) 1043192 250 166 25008.088 169 470.94221

reunions/mds_occigen_double_ts320_r10000_perf_read_write.pdf

mds_occigen_double_ts320_r10000_perf_read_write.png

4.2 2020-09-02

4.2.1 Progression

  1. Ecriture nuage de points X (N, K) sans centralisation

    Ecriture HDF5 par proc 0 par blocs de lignes lapack contigues centralisés les uns après les autres.

  2. Ecriture parallèle HDF5 de X

    Modification interface HDF5 appelée dans Diodion C++ -> C car H5Cpp pas compatible avec HDF5 parallèle.

  3. Développement dans Chameleon
    • un build multi-runtime
    • rank-k random matrix generation
    • perfmodels simgrid
  4. Bilan performances
    UC Size MPI NP BUILDMAT PRE RSVD POST SAVE Total
    L6 (99594-1) 99594 1 618 7 602 10 38 1275
    L2L3L6 (270983-6) 270983 6 706 19 592 55 58 1430
    Lodd (426548-16) 426548 16 715 34 587 118 94 1548
    Leven (616644-32) 616644 32 950 54 612 194 152 1962
    Lall (1043192-125) 1043192 125 595 152 552 229 488 2016

    reunions/mds_occigen_simple_gemm_ts320_r10000.pdf

    mds_occigen_simple_gemm_ts320_r10000.png

    UC Size MPI NP BUILDMAT PRE RSVD POST SAVE Total
    L6 (99594-1) 99594 2 373 5 692 34 56 1160
    L2L3L6 (270983-12) 270983 12 506 14 782 137 134 1573
    Lodd (426548-32) 426548 32 481 27 765 237 134 1644
    Leven (616644-75) 616644 75 372 56 762 232 260 1682
    Lall (1043192-250) 1043192 250 450 237 848 508 1200 3243

    reunions/mds_occigen_double_gemm_ts320_r10000.pdf

    mds_occigen_double_gemm_ts320_r10000.png

4.3 2020-07-29

4.3.1 Progression

  1. mesures lecture de plusieurs fichiers Hdf5 en parallèle LAPACK

    On garde la meilleure solution cf. multi3 (répartition des procs dans l'ordre des fichiers en assurant l'équilibre de charge entre les processus, ex f1 <-> p0, p1, p2 / f2 <-> p2, p3, p4 / f3 <-> p4, p5 etc) + gestion cohérence des tuiles (une tuile ne peut pas être découpée sur plusieurs processus). Pas trop de perte de performances sauf lorsque NP ~ N fichiers cela peut induire un déséquilibre de charge.

    reunions/readh5multi8_Limpair_15-45.pdf

    readh5multi8_Limpair_15-45.png

  2. bilan performances avant optim lecture
    UC Size MPI NP Read H5 Pre RSVD Post Total
    L6 (99594-2) 99594 2 347 5 32 5 389
    L2L3L6 (270983-9) 270983 9 1478 14 56 18 1565
    Lodd (426548-25) 426548 25 2638 26 55 36 2755
    Leven (616644-50) 616644 50 5299 57 64 38 5458
    Lall (1043192-150) 1043192 150 5912 161 178 45 6296

    reunions/mds_occigen_gemm_ts320_r960.pdf

    mds_occigen_gemm_ts320_r960.png

  3. Implémentation lecture LAPACK –> Chameleon –> migrate (2dbc) dans ChameleonDenseWrapper de FMR

    Bench occigen avec nouvelle lecture L2_L3_L6 :

      Read H5 Migrate Pre [GE/SY]MM RSVD Post
    GEMM/1DBC 1590 0 12 13 32 15
    SYMM/1DBC 701 0 31 35 75 15
    GEMM/1DBNO 167 130 12 13 34 38
    SYMM/1DBNO 164 102 34 35 77 39

    reunions/L2_L3_L6_mds_occigen_np18_ts320.pdf

    L2_L3_L6_mds_occigen_np18_ts320.png

  4. benchs occigen NoCyclicOpti SYMM R=960
    UC Size MPI NP Read H5 Pre RSVD Post Total
    L6 (99594-2) 99594 2 360 17 33 4 414
    L2L3L6 (270983-9) 270983 9 580 68 101 14 763
    Lodd (426548-25) 426548 25 431 72 149 157 809
    Leven (616644-50) 616644 50 465 101 293 479 1338
    Lall (1043192-150) 1043192 150 454 214 576 1036 2280

    reunions/mds_occigen_symm_ts320_r960.pdf

    mds_occigen_symm_ts320_r960.png

  5. compar symm gemm chameleon occigen

    NP=25, M=426548, N=960, double

    • dsymm : 80s, 4371 gflop/s
    • dgemm : 23s, 15390 gflop/s
  6. tests simple precision
    • résultats numériques équivalent sur le cas 1M
    • perfs ok au niveau gemm (161 TFlop/s sur NP=150, N=10^6, R=9600 –> 1.1 Tflops/s par noeud, peak à 1.35Tflops/s)
    • lecture pas plus rapide (les données sont stockées en double dans les fichiers, H5T_IEEE_F64LE)

4.3.2 Prochains objectifs

  1. HIEPACS : amélioration performances SYMM ?
  2. Florent : écriture parallèle HDF5 du nuage de point X afin d'être en distribué sur l'ensemble du programme
  3. Florent : postraitement, écriture algo chameleon pour scaling de X avec sqrt(sigma_i)

4.4 2020-06-17

4.4.1 Progression

  1. Post-traitement MDS Diodon en chameleon
    • après la random SVD il y a un post traitement à faire pour la MDS sur U Sigma V^t :
      • sélection des colonnes de U correspondant aux valeurs propres positives
      • U -> filtre -> X, puis scaling des colonnes i de X par sqrt(sigma_i)
    • cette étape était effectuée en centralisé sur des objets type LAPACK, cela limitant les rangs K utilisables (nombre de colonnes de U) car ne passant plus en mémoire sur un noeud quand N est grand (exemple 1M)
    • développement de fonctions dans FMR (class C++ interface vers chameleon) pour pouvoir extraire des sous-matrices et mettre à jour des sous-matrices de chameleon
    • ce n'est pas hyper performant mais on peut maintenant conserver U et V^t en distribué chameleon et faire le post-traitement de la MDS, cela permettrait de tester le cas 1M avec des rangs K=10 000 par exemple

    Remarque : l'écriture de X dans un fichier Hdf5 en parallèle MPI n'a pas été regardé encore. X doit donc être encore centralisé pour faire l'écriture séquentielle. Mais la gestion du post-traitement en parallèle nous permet de gagner beaucoup de place avec la non centralisation de U et V^t.

  2. Gestion matrice symétrique
    • la matrice d'entrée est symétrique mais nous n'exploitions pas encore cette symétrie et remplissions une matrice chameleon pleine
    • adaptation de ChameleonDenseWrapper pour gérer matrice symm (copies, algos)
    • adaptation du gemm FMR pour appeler symm et non gemm, gram, lecture des fichiers Hdf5 que sur la partie triangulaire supérieure, correspondant à la triangulaire inférieure de la matrice chameleon par symétrie + différence de formatage row/column major
    • temps divisés pas 2 en lecture mais GRAM 2, 3 fois plus lent et SYMM idem par rapport au GEMM
  3. nouvelle distribution de la matrice de distance
    • on cherche à limiter le nombre d'accès à des fichiers différents par les processus tout en équilibrant la charge entre les processus
    • on a changé la distribution 2d bloc cyclique par défaut pour un 2d bloc non-cyclique où on découpe la matrices par tranches horizontales réparties de façon équilibrées parmis les processus (proc 0 1re tanche, proc 1 2è, etc), en réalité ce n'est pas très bien équilibré du fait de la structure triangulaire, en effet on ne découpe qu'en tenant compte des lignes
      • mode MAT_ALLOC_GLOBAL -> MAT_ALLOC_TILE
      • premier accès aux matrices doit être en écriture only sinon bug
      • ajout de la migration de données après la lecture afin de revenir à une distribution 2d-block cyclique, meilleure pour les algos qui suivent. Cette migration a quand même un coût mais ce peut être effectué en même temps que la lecture
        Read H5 Migrate Pre [GE/SY]MM RSVD Post
      GEMM/1DBC 2650 0 30 23 59 39
      SYMM/1DBC 1065 0 90 83 171 29
      SYMM/1DBN 930 118 62 61 128 164

      reunions/L1_L3_L5_L7_L9_mds_occigen_np25_ts320.pdf

      L1_L3_L5_L7_L9_mds_occigen_np25_ts320.png

  4. mesures lecture de plusieurs fichiers Hdf5 en parallèle LAPACK
    • le temps de lecture est encore important, on cherche à mieux comprendre l'impact de notre lecture par blocs de tuiles chameleon à partir de plusieurs processus
    • tests de différentes stratégies de lecture en utilisant des exécutables dédiés, donc hors de la pile Diodon, Chameleon Starpu etc
    • on lit les données de façon contigue en un bloc pour remplir un buffer type LAPACK pour écarter l'aspect lecture tuilée. on cherche à mesurer l'impact de la répartition des données des fichiers par processus
    • différentes stratégies :
      • multi1 : 1 ou 2 ou 3 processus affectés par fichier, ex f1 <-> p0, p1, p2 / f2 <-> p3, p4, p5 / etc
      • multi 2 : chaque proc lit une proportion 1/NP de chaque fichier, par bloc de lignes correspondant à chaque rang, rank i lit chaque ième bloc de tous les fichiers
      • multi 3 : répartition des procs dans l'ordre des fichiers en assurant l'équilibre de charge entre les processus, ex f1 <-> p0, p1, p2 / f2 <-> p2, p3, p4 / f3 <-> p4, p5 etc
      • multi 4 : algo type sac à dos pour essayer de minimiser le nombre de fichiers différents traîtés par les processus. Boucle tant qu'il reste des lignes à affecter, on cherche le fichier le plus lourdement chargé en ligne restante, on cherche le proc le moins chargé en ligne affectées, on donne le plus de ligne possible de ce fichier à ce proc, etc.

      reunions/readh5multi_Limpair_15-45.pdf

      reunions/readh5multi2_Limpair_15-45.pdf

      reunions/readh5multi3_Limpair_15-45.pdf

      reunions/readh5multi4_Limpair_15-45.pdf

      readh5multi_Limpair_15-45.png

      readh5multi2_Limpair_15-45.png

      readh5multi3_Limpair_15-45.png

      readh5multi4_Limpair_15-45.png

    Conclusions : les versions multi3 et 4 offrent de bonnes performances en limitant le nombre fichiers différents traités par processus tout en assurant l'équilibre de charge. On remarque aussi que les temps sont bien meilleurs qu'en tuiles type chameleon.

4.4.2 Prochains objectifs

4.5 2020-05-06

4.5.1 Progression

  1. Lecture fichiers H5 par un seul thread + interface chameleon async
    • modification chameleon pour pouvoir forcer l'exécution des tâches d'un algo sur un worker précis
    • adaptation de fmr et diodon pour utiliser cette fonctionnalité de chameleon, obligé de passer à l'interface "async"
    • pas de meilleurs perfs possibles en async pour le moment car la plupart des algos en interne ne sont plus async (c'était le cas avant) pour des questions d'optim en interne (matrices temporaires de travail)
    • écriture classe chameleon dans fmr pour mieux gérer encapsulation de fonctionnalité chameleon (init/finalize, gestion des options, …)
    • temps de lecture divisées par 2 grace à ça, ex L2 L3 L6 (270k) 2500s -> 1166s sur 9 noeuds
    • les temps semblent aussi bien décroître avec le nombre de noeuds, 1166s (9N) -> 644s (18N)
      N Procs MPI L10.h5 L10\_3.txt
      3 524 1076
      9 162 386
      18 86 187

      On observe une décroisssance quasi proportionnelle en fonction du nombre de noeuds.

  2. Passage du cas 1 million sur Occigen
    • copie des fichiers hdf5 (700Go) de irods curta vers occigen (sur scratch) en scp (en passant par sync)
    • préparation fichiers de configs L1_L3_L5_L7_L9.txt, L2_L4_L6_L8_L10.txt, L1_L2_L3_L4_L5_L6_L7_L8_L9_L10.txt
    • les calculs sont passés et sorties MDS ok cf. images
    • perfs pour le cas N = 1M rang prescrit R = 960, 150 noeuds:
      • Lecture H5 : 5900s
      • Gram : 161s
      • RSVD : 178s (GEMM 67s, QR 0.6s, il y a plusieurs gemm et qr nécessaires)
      • Ecriture H5 de Sigma et X : 15s
    • génération des fichiers hdf5 + images pour tous les cas tests L1, L2, L_pairs, L_impairs, L*

    L1_L2_L3_L4_L5_L6_L7_L8_L9_L10.png

  3. Fin de la tâche 3: Consolidation des ordonnanceurs de StarPU
    • ajout de bitmap statiques et inlinés

    t3-prio.png

    t3-modular-eager.png

  4. Début de la tâche 5.1: Intégration des tâches parallèles
    • compilation de starpu et chameleon sur conan
    • échanges avec PAW et Terry pour récupérer des expériences
    • Lecture thèse Terry

4.6 2020-04-09

4.6.1 Progression

  • cas N=1024000, K=N/100, passe sur occigen (pas curta qui ne fonctionne pas bien sur beaucoup de noeuds, problème avec cette machine ?) avec les paramètres starpu suivants pour limiter la soumission de tâches et donc de comms
export STARPU_LIMIT_MIN_SUBMITTED_TASKS=6400
export STARPU_LIMIT_MAX_SUBMITTED_TASKS=7400
  • Perf: NP=150, NT=22
    • GEMM 326s, 66 Tflops/s (Chameleon seul 92 Tf, à comprendre pourquoi cette perte de perf)
    • QR 9s, 24Tflops/s
    • RSVD 1095s
  • Rappel perf lecture, matrice L2_L3_L6 N=270983, NP=10
    • ReadHDF5 2344.86 s
    • RSVD 96s
  • Prise en main NetCDF et PNetCDF pour comparaison temps lecture avec HDF5
    Table 1: Performance of HDF5 file 10V-RbcL_S74 reading, N=23214, Data Size=4111 MB, on 4 Plafrim miriel nodes using the /beegfs disk partition
    Mode Time (s) Perf (MB/s)
    HDF5 P=1 T=1 18 228
    Chameleon P=1 T=22 50 88
    Chameleon P=1 T=1 23 178
    Chameleon P=2 T=1 12 342
    Chameleon P=3 T=1 8 514
    Chameleon P=4 T=1 6 685
    Table 2: Performance of HDF5 file L6 reading, N=99594, Data Size=75675 MB, on 4 Plafrim miriel nodes using the /beegfs disk partition
    Mode Time (s) Perf (MB/s)
    HDF5 P=1 T=1 371 204
    Chameleon P=4 T=1 125 582
    Chameleon P=8 T=1 65 1164
    Chameleon P=16 T=1 30 2522
  • Correction d'un bug dans calcul de la matrice de Gramm en parallèle Chameleon: permet à Romain Peressoni de commencer l'étude de cas tests innovants issus de la "fusion" de plusieurs fichiers HDF5 d'échantillon
  • Mise en place gitlab-ci pour les projets FMR et Diodon, écriture de test unitaire Gramm avec chameleon dans Diodon

4.7 2020-01-27

4.7.1 Progression

  • curta instable, installation et bench de la pile sur occigen
  • guix packaging -> fmr, diodon + mpi (chameleon+mkl+mt)
  • new kibana
  • problemes memoire avec systeme 1 Million, generation de la matrice anormalement long (> 30 minutes), en soumettant sur plus de noeuds (256 -> 512) probleme de depassement memoire qui ne devrait pas arriver car occupation theorique des donnees initiales 15 Go sur des noeuds de 64 Go
    mpiexec --map-by node -mca pml ucx -mca btl '^uct,openib' -n 512 /home/fpruvost/gordon/install/bin/mdsDriver -d -osvd -t 22 -gen 1 -rs 1024000 -rk 10240 -os 0
    ...
    slurmstepd: error: Exceeded step memory limit at some point.
    slurmstepd: error: Exceeded job memory limit at some point.
    

    J'applique pourtant le controle suivant par starpu

    export STARPU_LIMIT_CPU_MEM=50000
    export STARPU_LIMIT_MIN_SUBMITTED_TASKS=15000
    export STARPU_LIMIT_MAX_SUBMITTED_TASKS=16000
    
  • perfs sur CURTA
    Table 3: CPU times (s) on random matrices on CURTA with OpenMPI, K=9600
    # Nodes 2 5 10 15 30 60 100 150
    Size N 80000 160000 272000 320000 480000 640000 800000 960000
    Generate Random(NxN) 49 85 146 138 162 166 179 257
    GRAM (NxN) 3 7 57 23 45 87 134 205
    GEMM (NxN * NxK) 48 78 151 130 166 181 192 219
    Perfs GEMM(Gflop\s) 2537 6245 9355 15012 26513 43340 63925 80752
    QR (NxK) 7.8 6.8 7.7 5.5 5.1 6.9 4.3 7.4
    Perfs QR (Gflop\s) 1814 4214 6425 10495 17052 16857 34069 24667
    SVD (KxK) 150 143 144 144 142 153 193 155
    RSVD 294 353 533 488 601 658 759 1038
  • perfs sur Occigen
    Table 4: CPU times (s) on random matrices on Occigen with OpenMPI, K=100*N
    # Nodes 2 6 16 64 256 512
    Size N 64000 128000 256000 512000 768000  
    Generate Random(NxN) 4.6 11 32 65 69  
    GRAM (NxN) 2.2 5 16 80 214  
    GEMM (NxN * NxK) 4.4 11 36 91 118  
    Perfs GEMM(Gflop\s) 1192 3583 9293 29375 76750  
    QR (NxK) 0.1 0.3 0.6 1.5 4.6  
    Perfs QR (Gflop\s) 400 1424 5272 17192 19295  
    SVD (KxK) 0.6 1.2 6.7 46 260  
    RSVD 10 28 88 263 584  
  • probleme de perf a large echelle dans diodon sur occigen, le gemm se degrade par rapport au driver de testing de chameleon (2x plus long sur 256 noeuds), a analyser … (pb de memoire ?)
  • segfault Nmad sur plafrim, attente version debogué de nmad pour gordon (release guix)

4.8 2020-01-17

4.8.1 Progression

  • bootstrap via IB et PMI2 par défaut
    • benchmarks du temps d'initialisation
    • l'initialisation semble ok mais des soucis au niveau des queues de pioman
  • profiling de l'ordonnanceur modular-eager
    • piste pour la différence entre les ordonnanceurs eager et modular-eager
    • le composant best_impl semble être la cause de la différence de performances entre les 2 ordonnanceurs
  • problème de déséquilibrage mémoire du QR identifié et corrigé avec Mathieu F.
  • tests sur curta :
    • jusqu'à 640000 sur 60 noeuds
    • consommation mémoire importante du GEMM sur les gros cas, "out of memory" dans certains cas, 50 Go au départ -> 96 Go limite mémoire. Dû aux tuiles dupliquées pour les communications certainement. Utilisation de la fenêtre de soumission Starpu, cf. STARPU_LIMIT_MIN|MAX_SUBMITTED_TASKS et limitation usage mémoire par Starpu avec STARPU_LIMIT_CPU_MEM.
    • curta a eu des pannes
    • job à N=800000 en attente
Table 5: CPU times (s) on random matrices on CURTA with OpenMPI
# Nodes 2 5 10 15 30 60
Size N 80000 160000 272000 320000 480000 640000
Generate Random(NxN) 49 85 146 138 162 166
GRAM (NxN) 3 7 57 23 45 87
GEMM (NxN * NxK) 48 78 151 130 166 181
Perfs GEMM(Gflop\s) 2537 6245 9355 15012 26513 43340
QR (NxK) 7.8 6.8 7.7 5.5 5.1 6.9
Perfs QR (Gflop\s) 1814 4214 6425 10495 17052 16857
SVD (KxK) 150 143 144 144 142 153
RSVD 294 353 533 488 601 658
  • cas test 1 million de Inra pret et normalement accessible sur curta

4.8.2 Prochains objectifs

  • résultats 1 million sur curta

4.9 2019-12-17

4.9.1 Progression

  • bootstrap via IB et PMI2 terminé (testé sur plafrim)
    • plus qu'une seule barrière globale nécessaire
    • lancement de benchmarks sur l'efficacité du bootstrap sur IB
  • ordonnanceurs StarPU
    • modification de l'animation du scheduling
    • réduction de l'utilisation mémoire
  • ajout de traces FXT pour avoir un grain plus fin dans le scheduling
    • les différences de perfs pour le heft pourraient être causer par un mutex
    • les différences pour le scheduler eager sont plus importantes mais plus difficile d'identifier la cause (boucles imbriquées ?)
  • réorganisation du site web adt-gordon https://adt-gordon.gitlabpages.inria.fr/adt-gordon/
  • init rapport gordon dans git adt-gordon https://gitlab.inria.fr/adt-gordon/adt-gordon/tree/master/document/rapport
    • mettre en avant la problématique pile logicielle, chaque composants de niveaux mondial connectés les uns avec les autres, puis après challenge applicatif
  • discussion pour post processing U Sigma et VT -> X (nuage de points) en distribué avec chameleon, on a des idées d'implémentation, ça ne devrait pas être très long
  • bench curta : avancées jusqu'à 320K, des pbs de mémoire, déséquilibrage sur processus 0 lors du QR hierarchical tree, en cours analyse avec M. Faverge
  • guix + singularity sur curta et plafrim
  • pb avec mad sur plafrim
    • comment on release nmad, comment on teste nmad ?
  • accès jean zay à voir A8 ou accès prépa
    • demande accès prépa
    • puis en janvier faire dossier, limite 07/02

4.9.2 Prochains objectifs

  • reunion en janvier 2020 : vendredi 17 ?
    • orientée rédaction rapport
  • benchs du bootstrap via IB sur occigen
  • comprendre et corriger le problème de déséquilibrage mémoire sur la QR

4.10 2019-11-29

4.10.1 Progression

  • ajout de IBUD (Infiniband Verbs Unreliable Datagram) (alex)
    • problèmes sur plafrim
  • ajout des drivers PSM/PSM2 pour le bootstrap
    • nécessite des modifications d'interfaces des modules de bootstrap
  • retour sur la segfault du paquet guix de psm (ludo)
  • lecture du code des ordonnanceurs StarPU
  • Test des donnees pleiade à 270k ok, la pile arrive a la traiter
  • Le rang K ne pouvait pas être trop grand dans FMR car U, Sigma et V^T les objets de résultats sont des objets de type Lapack (centralisés) -> problème mémoire pour des matrices 270K x 10K par exemple.
    • developpement dans FMR et Diodon pour enlever cette limitation: les matrices de valeurs singulières U et V^T peuvent être des objects Lapack centralisés ou Chameleon décentralisés
  • Results on the L2_L3_L6 testcase K=960: We present the CPU time of the different stages of the Diodon Random SVD on CURTA nodes "compute":
    Table 6: CPU times (s) on the L2_L3_L6 (N=270983, K=960) testcase on CURTA with OpenMPI
    # Nodes 8 10 12 14
    READ HDF5 (NxN) 2612 2344 2051 1328
    GRAM (NxN) 48 42 40 40
    GEMM (NxN * NxK) 49 35 31 35
    Perfs GEMM(Gflop\s) 2847 3940 4481 3937
    QR (NxK) 3.7 3.16 3.5 3.18
    Perfs QR (Gflop\s) 133 158 141 156
    SVD (KxK) 0.51 0.53 0.46 0.9
    RSVD 121 96 87 98
  • Ajout d'une option dans diodon permettant de générer une matrice aléatoire de rang K pour éviter de lire les données dans des fichiers H5. Pratique pour des tests.

4.10.2 Prochains objectifs

  • [FP] tester le million avec matrices générées aléatoirement sur CURTA
  • [FP] voir si possibilités d'accès à jean zay
  • [FP] boostrap article
  • [AG] avancer sur les ordonnanceurs starpu

4.11 2019-10-15

4.11.1 Progression

  • FP : code pour la lecture des données diodon à partir de plusieurs blocs hdf5 ok, à tester sur gros cas "dimensionnant"
  • AG :
    • ajout de l'echange de la topologie du master via pmi2
    • utilisation du driver IB sans buffer

4.11.2 Prochains objectifs

  • on fixe la réunion du jalon M+12 au mercredi 13 novembre
  • test de la pile diodon+fmr+chameleon en openmpi et newmadeleine sur des plus gros cas Inra càd >100k, par exemple viser 300k voir plus sur MCIA ou Occigen selon dispo (Florent+Inra)
  • identifier les pbs avec nmad+pmi2
  • amorcer discussion Adrien+Samuel à propos du refactoring des ordonnanceurs dans StarPU
  • stabiliser collective (Philippe+Nathalie) pour intégration dans des versions stables de starpu et nmad

4.12 2019-09-30

4.12.1 Progression

  • FP : pas pu avancer à cause d'autres activités

4.12.2 Prochains objectifs

  • FP : voir objectifs réunion du 5 septembre

4.13 2019-09-05

4.13.1 Progression

  • discussion avec O. Coulaud + A. Franc + JM Frigerio pour convenir d'un format pour pouvoir assembler un plus gros problème à partir des fichiers hdf5 existant
    • prototype python dévelopé par Pleide pour assembler le problème
    • dev. d'un script de découpe d'un problème existant en hdf5 en sous fichiers pour valider
    • O. Coulaud a développé une nouvelle interface dans FMR pour faire la même chose dans le code C++
  • tests de performances automatisés Chameleon: développement d'un pipeline gitlab-ci -> plafrim -> compilation chameleon dans env. guix reproductible -> jube pour piloter les benchs (produit cartésien de paramètres + parsing des outputs) -> DB elasticsearch -> kibana cf. https://kibana.bordeaux.inria.fr

4.13.2 Prochains objectifs

  1. developper un generateur de matrice (aleatoire) de rang k
    • directement dans chameleon qui permettra de tester une RSVD avec des parametres choisis arbitraires
  2. assemblage grosse matrice dans diodon avec version chameleon mpi

4.14 2019-07-26

4.14.1 Progression

  1. réponse du cines pour les sous réseaux filtrants
    • privililégier la résolution de noms à la liste des interfaces réseaux
  2. bootstrap des communications en IB
    • création d'un module de name service pmi2
    • création d'un module pour l'échange des urls des drivers utilisant le module de name service
    • ajout de l'interface PadicoBootstrap au module Control-minidriver
    • publication des urls des drivers via pmi2
    • publication des adresses IB via le name service de pmi2/slurm
    • connexion IB entre les noeuds (tous vers un)
  3. intégration QR dans FMR
    • interfacagee dans FMR des l'algo QR avec arbres hierarchiques de chameleon (HQR)
    • test de fonctonnalité comparaison blas vs. chameleon ok
    • validé avec l'application Diodon
  4. cohabitation chameleon + MKL multithreadé (MT) dans Diodon
    • modifications cmake pour avoir Chameleon avec MKL MT dans Diodon
    • problème de performances lorsqu'on passe d'un algo chameleon à MKL MT car thread de starpu en attente active
    • test avec fonction CHAMELEON_Pause ok (met en pause l'activité starpu). On devrait avoir ce comportement par défaut dans chameleon avec l'interface tuilé
  5. tests sur CURTA avec openmpi
    • installation de la pile "à la main" car guix pas installé sur curta (je n'ai gardé que le compilo gcc et MKL comme module)
    • problèmes de performances mais c'était à cause du binding effectué par MPI, avec les bons paramètre, perfs ok sur le cas 100k
  6. préparation jalon/release v2
  7. réunion avec Pleide pour le cas test 1M
    • on a pas le cas 1M sous la main dans un seul fichier hdf5
    • il est découpé en sous matrices car correspond à des échantillons ayant un sens pour les biologistes
    • il faut un mécanisme formalisé pour lire les fichiers hdf5 et assembler à la volée un cas plus gros
    • travail en cours de pleiade pour faire un proto python

4.14.2 Prochains objectifs

  1. bootstrap des communications en IB
    • échange des topologies via pmi2 (pour le moment on assume que IB est toujours présent)
    • création d'un thread de polling sur la carte IB pour la réception des messages de bootstrap
  2. assemblage grosse matrice
  3. squelette rapport de recherche

4.15 2019-07-03

4.15.1 Progression

  1. scalabilité du launcher de NewMadeleine
    • ajout de pmi2
    • rassemblement des arguments PADICO_STARTUP_* en une seule variable (-compact)
    • aggregation des lignes du fichier de configuration slurm
      • calcul local du UUID
      • calcul local de PADICO_HOST_RANK/COUNT
      • sans pmi2, PADICO_BOOT_RANK=%t (substitué par slum par le rank)
  2. bug sur occigen
    • bug lors de la connexion des noeuds en UDP, le bug apparaît lorsque des noeuds sont sur des sous-réseaux différents
  3. analyse coût de lecture de la matrice de distance fichier HDF5
    • cf. mail envoyé sur adt-gordon@inria.fr le 19/06/2019
    • exemples de réponses :
      • machine type lsd
      • coupler avec un compresseur SZ
      • scinder le fichier en plusieurs sous-fichiers
      • optimiser la configuration de l'installation HDF5
    • HDF5 thread safe sérialise les appels à la lib HDF5 donc on accélère pas en multithread
    • on ne bénéficie a priori pas des avantages du lustre sur 1 noeud, on devrait aller environ 4 fois plus vite dans les 1 Go/s, certainement dû à la serialisation de HDF5

4.15.2 Prochains objectifs

  1. contacter le cines pour les sous réseaux filtrants
  2. bootstrap des communications en IB avec publication des adresses via pmi2
  3. priorité : avancer sur interfacage Diodon+FMR+Chameleon avec l'algo QR hiérarchique
  4. tests lecture matrice (si du temps après interfacage QR)
    • tester avec parallel HDF5
    • voir avec starpu out-of-core + hdf5
    • générer une matrice "bidon" mais ayant les bonnes propriétés en terme de rang (voir avec Emmanuel)
    • étudier utilisation compresseur

4.16 2019-06-12

4.16.1 Progression

Réunion du jalon 1, cf. release v1.

4.16.2 Prochains objectifs

Outre les principaux objectifs déjà identifiés dans le document d'ADT, certains points restent à éclaircir :

  • Hooks starpu comprendre résultats
  • I/O, comprendre mieux la performance de lecture du fichier HDF5
  • réfléchir à une meilleure intégration emacs+orgmode avec guix afin que tout le monde puisse avec guix avoir le même environnment pour tester les cahiers d'expériences

4.17 2019-04-08

4.17.1 Progression

  1. briques de base pour algo de Gram dans chameleon codée
    • merge request en cours dans chameleon
    • il faut pouvoir trouver un temps avec Mathieu pour faire l'algo de Gram au niveau
  2. nighlty benchs plafrim, en cours
    • des scripts sont prêts pour commander l'exécution sur plafrim (miriel) avec soumission de jobs (slurm) et tout en environnement guix "pure" càd bien contrôlé
    • l'idée est d'avoir les perfs chameleon avec openmpi et newmadeleine
    • il reste à tester le runner gitlab sur plafrim pour envoie des tests automatiquement par gitlab (schedule) et traîtement des résultats pour graphs (elasticsearch+kibana ?)

4.17.2 Prochains objectifs

4.18 2019-03-22

4.18.1 Progression

  1. GEMM SUMMA perfs ok
    • discussion interne HiePACS : on a validé que le GEMM SUMMA est ok pour nous, pret pour intégration dans Diodon
  2. Nouvelle release chameleon-0.9.2
    • stabilisation des développements de ces dernières années dans un versions archivée
    • https://gitlab.inria.fr/solverstack/chameleon/releases
    • le GEMM SUMMA n'est pas dedans. L'algo est dans une branche de Mathieu Faverge. Il faudra l'intégrer dans une prochaine release
  3. Packages GUIX
    • intégration des packages Gordon dans GUIX-HPC et non plus comme un fork
    • tests avec CUDA
  4. Benchs progression gemm
    • progression des communication dans un thread dédié + hooks:
      • L'utilisation des hooks peut légèrement dégrader les performances des gemm sur peu de noeud.
    • progression seuleument dans les hooks de starpu:
      • plus le nombre de noeud augmente, plus on se rapproche des performances avec un thread dédié. Ce comportement est plus marqué dans les cas sgemm.
  5. Benchs progression potrf
    • le test finit invariablement par un "node failure", un ticket à été ouvert.
  6. Baise de performance gemm
    • on peut observer des comportements similaires avec openmpi sur plafrim.
    • pas de traces pour le moment, on attend de voir si ce comportement peut se reproduire sur le gemm summa (eput-être une taille de bloc mal choisie).

4.18.2 Prochains objectifs

  • Florent:
    • Coder algo de Gramm en tâches dans chameleon
    • Commencer à mettre en place des nightly benchs plafrim via guix pour comparer perfs de chameleon + openmpi VS nmad
  • Adrien:
    • Faire la batterie de benchs avec le gemm summa
    • commencer à implémenter l'interface pmi2 dans nmad en prévision d'exécutions sur occigen

4.19 2019-02-28

4.19.1 Progression

  1. perfs progression des communications
    • mesure des perfs pour le cas où on utilise un thread dédié pour la progression des communications (cf. bench_comms_progress).
  2. implémentation des hooks dans starpu
    • il est maintenant possible d'activer des hooks dans starpu pour que les threads idle participent à la progression des communications
    • possibilité d'activer des hooks (déjà présents) dans starpu pour que les threads participent à la progression des communications après chaque tâche
    • 2 nouvelles variables d'environnement:
      • STARPU_MPI_NMAD_PROG_HOOKS
      • STARPU_MPI_NMAD_IDLE_HOOKS
      • 3 valeurs possibles:
        • FORCED: poll all tasks from all queues
        • SINGLE: poll one task from local queue
        • HOOK: poll all tasks from local queue
  3. mesure des performances de l'utilisation des hooks
    • tests préliminaires:
      • progression des communications uniquement dans les hooks starpu (plus de thread de communication, progression dans les hooks, 1 worker de plus disponible)
      • progression des communications hybride (1 thread de communication + progression dans les hooks)
  4. deadlock non identifié
    • comportement erratique sur plafrim, difficulté à reproduire en mode debug, avec suffisamment de noeuds…
  5. nouvelle perf à 256 noeuds fix et différentes configuration PxQ

    cf. dernière figure en bas à droite

    reunions/dgemm_openmpi_occigen_reu15022019.pdf

    • ça se comporte plutôt pas mal en Q=1, 2
    • voir ce qu'on peut gagner
    • Mathieu réflechit à une amélioration de l'algo, étude de faisabilité, peut-on exprimer ce qu'on veut faire avec le paradigme à base de tâches séquentielles
  6. traces pour mieux comprendre le cas P=Q
    • dgemm
    • M=K=57600 (180*320), N=1280 (4*320)
    • 9 noeuds en PxQ=9x1 et PxQ=3x3

    trace_dgemm_3x3.png

    trace_dgemm_9x1.png

    • beaucoup plus de comms en 3x3
    • le cout des comms est-il normal ?
    • expérimentation en faisant varier le paramètre "lookahead" dans chameleon
      • un impact mais pas suffisant pour avoir la perf de 9x1
      • au mieux sur notre cas on arrive à obtenir 2 TFlops alors que le 9x1 est à 5 TFlops
    • testé sur plafrim aussi avec nmad et pas d'améliorations

4.19.2 Prochains objectifs

  • Adrien:
    • commencer à implémenter l'interface pmi2 dans nmad en prévision d'exécutions sur occigen
    • trouver la cause de la baisse de perf autour de N=113000
    • mesurer les perfs des autres algos gemm (voir avec Florent)
  • Florent:
    • voir avec équipe runtime si coût de comms ok
    • voir avec hiepacs si on expérimente des variantes de l'algo SUMMA de GEMM
    • codage/expes/traces

4.20 2019-02-15

4.20.1 Progression

  1. perfs dgemm chameleon+openmpi occigen
    • nouvelle campagne de perf sur le dgemm cette fois et avec plus de tailles de matrices et différentes configurations de distribution de données
    • STARPU_COMM_STATS positionné à 1
    • on garde l'idée que M est bien plus grand que N, ici M=100*N
    • on s'éloigne du peak à mesure que l'on augmente le nombre de noeuds
    • la distribution de données en PxQ influe beaucoup, les perfs sont bien meilleures en PxQ rectangulaire (P>>Q) qu'en PxQ carré

    reunions/dgemm_openmpi_occigen_reu15022019.pdf

  2. trace paje
    • on arrive à générer des traces mais problème avec les flèches des comms qui n'apparraisent pas -> à corriger
    • il faut comprendre pourquoi ça se comporte si mal en grille PxQ carré
  3. Guix avec MKL
    • test de Chameleon+openmpi+mkl installé avec Guix sur plafrim fonctionnel
      • l'installation et l'environnement d'exécution est entièrement maîtrisé
      • il reste quelques rustines dans les scripts slurm qu'il faudrait éviter, échanges avec Ludovic Courtès, il s'en occupe
  4. Progression des communications
    • le thread MPI de starpu (gestion des callbacks de nmad) n'est plus bindé sur un coeur
    • début de la mise en place de hooks dans starpu lorsque les workers sont idle (pour de la progression opportuniste des communications)
  5. Bug
    • un bug empêche la terminaison "propre" des tests (bloque la suite des tests + consommation excessive d'heures sur occigen)
  6. nmad sur occigen
    • impossibilité d'exécuter nmad sur occigen :
      • la version de srun utilisée ne permet pas de passer de grandes lignes d'arguments
      • connexion en ssh sur les noeuds de calcul interdit
    • voir du côté de la librairie pmi2 (en lien avec la tâche 2)

4.20.2 Prochains objectifs

  • Déterminer la cause du bug
  • Florent : analyser les traces paje du dgemm afin de comprendre les chutes de perfs
  • Adrien :
    • terminer l'implémentation des hooks
    • mesurer différentes stratégies de progression des communications

4.21 2019-02-01

4.21.1 Progression

  1. perf gemm chameleon+openmpi occigen
    • mêmes paramètres de taille que précédents bench sur plafrim
    • nouvel algo gemm perfs x3 par rapport à master
    • mais encore loin du peak, il faut profiler plus en détail

    reunions/gemm_openmpi_occigen_reu01022019.pdf

  2. modification de la progression des communications
    • ajout du cas ou pioman possède un coeur dédié
    • starpu peut explicitement demander à pioman à quel niveau et combien de threads pioman doit utiliser

4.21.2 Prochains objectifs

  • Florent : profilage nouveau GEMM, voir avec STARPU_COMM_STATS, STARPU_LIMIT_CPU_MEM et trace paje sur 4 noeuds, matrice de 50K par exemple
  • Adrien : finir l'intégration de la nouvelle progression dans pioman, ne plus binder le thread MPI de starpu utiliser starpu_get_next_bindid avec THREAD_ACTIVE pour que le coeur soit reservé commencer l'insertion des hooks dans starpu pour faire progresser les comms (voir cpu_driver_run_once, !task)

4.22 2019-01-16

4.22.1 Objectifs

  • guix standby ok. La balle dans le camp de ludo. Florent fera des compte rendu de l'avancée
  • calcule de la matrice de Gramm à partir de celle de distance en taches. Florent
  • Test de scala forte de GEMM sur les tailles diodon. Adrien
    • présenter les chiffres à la prochaine réunion
  • refaire tourner les benchs de non-regression perf existants Chameleon+StarPU+Nmad+MKL (miriel et sans cuda car sirocco pas dispo). Nathalie+Alex
    • partager tailles des matrices diodon (20k, 100k, etc). Florent
    • anticiper taille matrice des gros cas à venir, ~1M. Florent
    • modifier le nightly bench pour faire tourner le GEMM. Nathalie
    • en simple et double précision. Nathalie
    • sauver les benchs (commits, paramètres, perfs)

4.22.2 Progression

  • guix toujours en standby, il faut relancer Ludo
  • calcul de la matrice de Gramm : discussion avec Mathieu Faverge pour voir ce qu'il manque à Chameleon. A ppriori rien d'insurmontable mais un peu de taf quand même car on veut un implementation en tâches avec MPI pas trop dégueulasse niveau perfs
  • scalabilité forte s|dgemm : Adrien a bien fait tourner les benchs sur les miriel plafrim (réseau Omnipath) ce qui nous donne des perfs de références. Les perfs ne sont pas bonnes dès qu'on passe sur plusieurs noeuds, comme attendu car l'algo n'est pas adapté pour faire du MPI. Le tout est commité dans le sous dossier bench_init du git "adt-gordon"
  • perfs de non regression starpu : Nathalie a ajouté les tests de s|dgemm sur la page dédiée. Les perfs sont bien en deça de ce qui est attendu, certainement un problème de configuration ou de taille de matrice considéré

4.22.3 Prochains objectifs

  • Coder/valider algo construction de la matrice de Gramm en taches dans Chameleon. FP+MF
  • avancer la tache T1 de l'ADT. AG
  • nightly benchs starpu : corriger la config pour avoir les perfs attendues sur s|dgemm

4.23 2018-12-20

4.23.1 Objectifs

  • simgrid : à tester une pile starpu+simgrid dans guix
  • infiniband : voir si lib ibverbs, psm2 installable par guix ?
  • cuda : partie driver chargé dans le noyau mais pas forcément problématique pour les perfs, à tester une pile chameleon+nmad avec cuda dans guix
  • intel mkl : question d'une pile gcc mkl à la place de openblas, voir avec LC pour ajout package mkl (sans compilo intel)
  • créer un driver de validation dans diodon avec du gemm chameleon (nmad) où on peut vraiment comparer les résultats numériques (pas d'algo aléatoire), être capable pour tout le monde de rejouer l'expé
  • échanger avec A. Franc pour avoir matrices de grosse taille, stockage plafrim
  • resultats expes sur gitlab dans adt-gordon
  • clarification : mpi que dans chameleon, pas de io // pour nous pour l'instant

4.23.2 Progression

  • discussion sur MKL dans GUIX :
    • pas si simple, e.g. script sh généré et appellé par l'installeur avec des paths en dur vers /bin/sh alors que l'environnement de build de guix est cloisonné et ne voit que /gnu/store
    • possible mais pas pour tout de suite
    • metrre en standby guix de notre côté ? i.e. déploiement avec spack ?
  • réunion couches hautes 13/12 avec A. Franc, JM. Frigerio, O. Coulaud, E. Agullo :
    • explication de l'algo
    • discussion sur l'interface de chameleon et comment on va remplir les tuiles de la matrice (en prévision du passage au distribué)
    • besoin d'un nouvel algo en taches pour construire la matrice de Gramm à partir de la matrice de distance
    • accès aux grosses données de Inra sur plafrim, espace projet diodon
  • driver de validation du gemm avec données diodon OK en multithread
    • chrono sur les deux gemm blas et chameleon pour comparaison et test erreur
    • d'abord interface lapack
    • puis tile avec remplissage des tuiles via fonction build de chameleon (l'utilisateur créé une fonction pour remplir n'importe quelle tuile de façon générique), copie de lapack vers tuile
    • puis directement lecture dans le fichier hdf5 de subset correspondant à la tuile

4.23.3 Questions à clarifier

  • guix standby ?
  • pour moi pas claire l'idée de vite faire des expés sur grosse matrice sur plafrim alors que diodon n'est pas pensé données distribuées…
    • perte de temps pour faire de la copie lapack -> tuile, distribution des tuiles, alors que ce n'est pas comme ça qu'on va vouloir livrer le code in fine
    • d'où le travail sur l'interface hdf5 vers tuile directement sans passer par un grand vecteur type lapack centralisé
    • si nécessaire de faire bench sur une v0 : comparer mkl multithread à chameleon multithread sur une matrice 100000 sur un noeud
  • trop de réunions qui réunissent tout le monde ? début de projet donc certainement ok

4.23.4 Compte-Rendu et prochains objectifs

  • guix standby ok. La balle dans le camp de ludo. Florent fera des compte rendu de l'avancée
  • calcule de la matrice de Gramm à partir de celle de distance en taches. Florent
  • tuto/bench sur un noeud plafrim ? NON
  • Test de scala forte de GEMM sur les tailles diodon. Adrien
    • présenter les chiffres à la prochaine réunion
  • refaire tourner les benchs de non-regression perf existants Chameleon+StarPU+Nmad+MKL (miriel et sans cuda car sirocco pas dispo). Nathalie+Alex
    • partager tailles des matrices diodon (20k, 100k, etc). Florent
    • anticiper taille matrice des gros cas à venir, ~1M. Florent
    • modifier le nightly bench pour faire tourner le GEMM. Nathalie
    • en simple et double précision. Nathalie
    • sauver les benchs (commits, paramètres, perfs)

4.24 2018-12-07

4.24.1 Objectifs

Pour cette réunion il a été convenu qu'Adrien et Florent auront fini l'intégration Nmad-StarPU-Chamelon-MetaBarcodoing. Que le tout sera porté sur Guix. Et que l'on aura les premiers résultats de performance sur un jeu d'entrées. Des premiers travaux sur la tâche 1.2 (Approche concertée de la progression des communications dans StarPU) devront aussi avoir été initiés.

4.24.2 Progresssion

  • Prise en main des ecosystèmes logiciels
    • Pile starpu+nmad
    • Pile diodon+FMR+Chameleon
  • Prise en main de l'outil Guix
  • Extension des packages Chameleon et Starpu pour considérer plus de variantes (mpi, fxt)
  • Développement des nouveaux paquets
    • pile Nmad (puk, pioman, padico)
    • starpu+nmad ou madmpi
    • chameleon+nmad ou madmpi
    • FMR
    • Diodon
  • 1ers test de la pile sur plafrim (chameleon avec nmad/madmpi)
  • Création de packages starpu+cuda et chameleon+cuda (en cours)
  • 1re intégration de Chameleon dans FMR
    • développement d'un wrapper C++
    • non distribué car Diodon, FMR pas en mesure de traiter les données en distribué pour le moment
    • interface LAPACK de Chameleon = copie des données sous format LAPACK au format tuilé de Chameleon
  • 1re intégration de Diodon+FMR+Chameleon
    • cf. "mdsDriverChameleon" dans demos de diodon
  • écriture des procédures de tests dans le gitlab adt-gordon/adt-gordon

4.24.3 Difficultés

  • Guix pour les neophytes c'est pas si simple
    • on est loin de maîtriser, on va y arriver progressivement
    • on a aussi découvert certaines limitations au fur et à mesure
  • Guix a vocation à distribuer des versions releases = interface non prévue pour gérer des branches git
    • échanges avec L. Courtès, développement en cours pour qu'on puisse considérer des branches git au commit que l'on veut et directement via la CLI (fastidieux de devoir écrire ça en dur dans les packages)
  • Guix vocation OpenSource = problème pour télécharger FMR et Diodon qui sont des projets privés sur gitlab
    • échanges avec L. Courtès, développement en cours pour ajouter ssh au fetch du git
  • Guix : pas de Intel MKL pour le moment, possible dans le futur ?
  • Pile Chameleon-StarPU-Nmad fonctionnelle ?
  • Du côté de Diodon-FMR : ne pas sous-estimer le boulot pour le passage au distribué, qui le fait ?

4.24.4 Grands chantiers

  1. contrôle de l'environnement via packages Guix : la balle est dans le camp de LC, on attend pour re-boucler
  2. validation de la pile Chameleon-StarPU-Nmad dans Guix
  3. aller plus loin dans l'intégration de Diodon+Chameleon : récupérer les données Diodon directement sous forme tuilée, ajouter interface tuilée dans wrapper C++
  4. Diodon MPI : lecture des données en décentralisé

4.24.5 Compte-Rendu et prochains objectifs

  • simgrid : à tester une pile starpu+simgrid dans guix
  • infiniband : voir si lib ibverbs, psm2 installable par guix ?
  • cuda : partie driver chargé dans le noyau mais pas forcément problématique pour les perfs, à tester une pile chameleon+nmad avec cuda dans guix
  • intel mkl : question d'une pile gcc mkl à la place de openblas, voir avec LC pour ajout package mkl (sans compilo intel)
  • créer un driver de validation dans diodon avec du gemm chameleon (nmad) où on peut vraiment comparer les résultats numériques (pas d'algo aléatoire), être capable pour tout le monde de rejouer l'expé
  • échanger avec A. Franc pour avoir matrices de grosse taille, stockage plafrim
  • resultats expes sur gitlab dans adt-gordon
  • clarification : mpi que dans chameleon, pas de io // pour nous pour l'instant
  1. Réunions couches hautes
    1. 13 décembre 2018

      Présents: AF, JMF, OC, EA, FP

      Questions:

      • pointeurs pour comprendre l'algorithme de mdsDriver
      • manière de voir et valider les résultats
      • où trouver les données, comment les lire, les comprendre
      • interfaces data diodon vers tuiles chameleon
      • grandes étapes d'intégration Diodon-Chameleon
      1. comprendre l'algorithme de mdsDriver
        • explication au tableau
        • clarifie bien les étapes et les paramètres principaux
          • input : matrice de distance D entre individus deux à deux, de taille N*N, symmétrique, on peut ne stocker qu'une triangulaire
          • transformation de D en G matrice de Gramm
          • random SVD (de la lib FMR) sur G
          • Rank = K : nombres de colonnes pour construire l'espace image de G, choisit arbitrairement tel que K << N, par exemple K=500
          • output : un vecteur S (Sigma) des valeurs singulières de G, de taille K j'imagine et une matrice rectangulaire de taille N*K X, projection de G dans l'espace à K dimensions

        linalg_method_random_svd.jpg

      2. validation des résultats
        • dans mds, il faut prendre un rang K assez grand pour que les résultats soient stables
        • sinon, les données de sortie de mds ne sont pas énormes = N*K, K<<N (K ~ 500, N ~ 10^6), donc on peut avoir la donnée en RAM et on peut facilement sauvegarder les output en hdf5
      3. disponibilité des données
        • matrices présentes sur plafrim dans l'espace projet diodon
        • voir avec OC et J. Lelaurain pour accès projet diodon sur plafrim
        • voir avec JMF pour metadata des matrices, commandes hdf5
        1. pointeurs et metadata matrices de distance
          Nom plafrim path (/projets/diodon/) N
          10V-RbcL_S74 malabar/june2017/10V-RbcL_S74.h5 23214
          L6 diatoms/L6.h5 99594
          • commande metadata (taille, forme e.g. pleine ou triangulaire car symmetrique) sur les matrices
          h5ls L6.h5
          
      4. interfaces data diodon vers tuiles chameleon
        • il faudra utiliser l'interface CHAMELEON_zbuild_Tile à qui on doit donner une fonction de callback pour remplir les sous-blocs de la matrice
          • appel à l'API HDF5 pour lire les données directement dans le fichier ?
          • on doit faire le pré-traitement D->G avant non ?
        • on aura donc D puis on devra développer un algo parallèle en taches (dans une branche chameleon, e.g. "diodon") pour calculer G
        • les metadonnées, e.g. nom du fichier, décomposition éventuelle des fichiers, forme matrice (si triangulaire sup par exemple) peuvent être passée par l'intermédiaire de la structure opaque user_data
        /* The callback 'user_build_callback' is expected to build the block of matrix [row_min, row_max] x [col_min, col_max]
         * (with both min and max values included in the intervals, index start at 0 like in C, NOT 1 like in Fortran)
         * and store it at the address 'buffer' with leading dimension 'ld'
         */
        user_build_callback(row_min, row_max, col_min, col_max, A, ld, user_data);
        
      5. grandes étapes d'intégration Diodon-Chameleon

        Note : sur le premier GEMM appelé dans Diodon Y = G ω, i.e. jalon 1 à T0+6

        1. In Core - Centralized - Lapack interface
        2. In Core - Centralized - Tile interface
        3. [Out of Core - Centralized - Tile interface]
        4. In Core - Decentralized - Tile interface
        5. [Out of Core - Decentralized - Tile interface]

        diodon_chameleon_plan.jpg

5 Release: Proof of concept

5.1 Introduction

This document aims at validating the first release (on a total of four) of the Gordon ADT. This release is the "proof of concept". The goal is twofold:

  • The first is to improve StarPU/New Madeleine integration about the communication tasks assignment on workers.
  • The second is to validate Diodon/FMR/Chameleon/StarPU/NewMadeleine stack on the first algorithm which needs to be parallelized, i.e. the first matrix vector product Y = G Ω. We need to show that the software stack builds, and can be executed on PlaFRIM on a medium sized metabarcoding testcase of size 100000 on a couple of PlaFRIM "miriel" nodes.

5.2 Work done to achieve this release

5.2.1 Guix packaging

To be able to make the experiments reproducible we use the Guix package manager. The guix packages we have developed specifically for this project are hosted on gitlab Inria:

List of packages developed and experimented:

  • puk, pioman, padicotm, nmad, nmad-mini
  • starpu, starpu-fxt
  • chameleon, chameleon-fxt
  • fmr, diodon (private projects, in "Non Free" repository)

In addition to these developments several features have been added to Guix because of our feedbacks to Ludovic Courtes, one of the Guix developers:

  • Intel mkl package (in "Non Free" repository)
  • slurm package working on PlaFRIM
  • hdf5 package thread safe (was not the case)
  • ability to change the url of a git repository
  • ability to choose the branch and the commit of a git repository
  • ability to keep some specific environment variables definitions when building a clean Guix environment (not polluted by user's one)

5.2.2 Dialogue mechanism for polling threads

To avoid contention between worker threads and polling thread, it is necessary to take care of thread binding. At the beginning of the project, it was done by pioman, a part of the PM2 package, which is in charge of creating polling threads. The default polling strategy was to create 1 idle thread per NUMA node and 1 timer thread per MPI process. For this release, we added the possibility for pioman to create a bound thread dedicated to polling.

We also have the possibility to give pioman a way of distributing polling threads via the function int piom_ltask_set_bound_thread_indexes(int level, int *indexes, int size) This function takes 3 arguments:

  • level: level of topology object (ie. hwloc_obj_type_t)
  • indexes: an array of logical ids within level
  • size: size of indexes

For now, StarPU uses this function to bind one polling thread on the same core as the "MPI" thread. This thread is used to manage callbacks when using StarPU with the nmad backend.

To use this new mechanism between StarPU and NewMadeleine, 2 environment variables have to be set:

  • PIOM_DEDICATED=1
  • PIOM_DEDICATED_WAIT=1

A complete list of evironment variables related to polling threads binding:

  • PIOM_DEDICATED: polling threads will be binded. Default 0.
  • PIOM_DEDICATED_DISTRIB: gives a way of distributing polling threads within the topology object (all, odd, even, first, last). Default last.
  • PIOM_DEDICATED_LEVEL: type of topology object to bind threads on (machine, node, socket, core, pu). Default socket.
  • PIOM_DEDICATED_WAIT: pioman waits for an external program to give a topology level and a list of logical indexes within this level via int piom_ltask_set_bound_thread_indexes(int level, int *indexes, int size). PIOM_DEDICATED_DISTRIB and PIOM_DEDICATED_LEVEL are ignored when this variable is set to 1.

5.2.3 Progression hooks in StarPU

An other way to improve communication progress is to insert hooks in StarPU worker threads. Hook mechanism was already present in StarPU, these hooks can be called after every task. In addition to this mechanism, we added an other form of hooks that are called every time a worker is idle.

Environment variables related to StarPu hooks:

  • STARPU_MPI_NMAD_PROG_HOOKS: register a hook for polling after each task.
  • STARPU_MPI_NMAD_IDLE_HOOKS: register a hook for polling when a cpu is idle.

Possible values are:

  • FORCED : poll all tasks from all queues.
  • HOOK : poll all tasks from local queue.
  • SINGLE : poll one task from local queue.

Those mechanism are now implemented but there is not a global strategy yet as we haven't been able to do further testing because of several limitations presented later.

5.2.4 Chameleon benchmarks GEMM algorithm

At the beginning of the project, the matrix multiplication (GEMM) task-based algorithm was not efficient if performed on several nodes with MPI. Mathieu Faverge has developed a SUMMA version and we have performed benchmarks that assess the scalability of the algorithm. This new version of the algorithm has been integrated into the Chameleon master branch (stable).

dgemm_openmpi_occigen_reu15022019.png

Figure 25: DGEMM performances on occigen with OpenMPI

Notice that performances are highly impacted by the data placement

  • dgemm
  • M=K=57600 (180*320), N=1280 (4*320)
  • 9 nodes, 2D block cyclic distribution, PxQ=3x3 and PxQ=9x1

trace_dgemm_3x3.png

Figure 26: DGEMM trace, 2D block cyclic = 3x3

trace_dgemm_9x1.png

Figure 27: DGEMM trace, 2D block cyclic = 9x1

5.2.5 FMR "Chameleon" Dense Wrapper

FMR is a software developed by Olivier Coulaud in the HiePACS team providing random SVD algorithms. Diodon, the metabarcoding application software needs FMR to compute some SVD on very large matrices.

FMR has encapsulated dense linear algebra calls in "DenseWrapper" classes allowing to use different linear algebra implementations in a generic way into FMR algorithms.

Thus we have developed a ChameleonDenseWrapper class to provide some BLAS features but using the Chameleon library (initialization, destruction, copy, GEMM, QR) and we have adapted some other classes and subroutines to be able to consider either BlasDenseWrapper (centralized data) or ChameleonDenseWrapper (centralized or distributed data).

5.2.6 Task-based distance matrix reading

One of the first operation performed in Diodon, the application software, is to read a square distance matrix A in an HDF5 file. In order to avoid centralizing the matrix as it was done until then we have added the possibility to read the distance matrix with the "map" Chameleon routine. This routine apply the same function to all tiles of a Chameleon matrix. It needs some users's data (the HDF5 file for example) and a generic function that defines how to read a sub block of the matrix in the HDF5 file and fill a chameleon tile. The algorithm is trivially parallel and allows to load the matrix in memory in a distributed way.

This parallel I/O algorithm as shown better performances that reading the HDF5 file in sequential but is not scalable. The cost of the tasks to read subsets using the "read" HDF5 routine should be studied later into more details.

5.2.7 Task-based algorithm of the Gram matrix

After reading the distance matrix a pretreatement is performed on the matrix to transform it into a "Gram" matrix, A -> G. This was computed into Diodon with a hand made algorithm parallelized with OpenMP.

We have developed a task-based algorithm integrated into Chameleon to perform this operation in parallel distributed.

5.2.8 Random values generation

The RSVD algorithm requires that we generate a random matrix Ω. FMR centralized uses the std::normal_distribution routine. Chameleon provides a parallel algorithm to do that PLRNT so that we directly call this routine to generate Ω. We should investigate into more details if this routine allows to get similar results in term of Random SVD quality.

5.2.9 Integration and test in Diodon

After having developed all these basic components we have plugged the FMR+Chameleon stack with Diodon by modifying the CMake project and bring some adjustments to the "mdsDriver" source code to be able to work with the FMR ChameleonDenseWrapper and MPI.

Finally some tests on PlaFRIM has shown that we are now able to perform the beginning of the random SVD algorithm, HDF5 file reading, Gram matrix computation, first matrix product on the large matrix Y = G Ω, with the FMR+Chameleon+StarPU+Newmadeleine stack. The trace on the L6 testcase, of about 100K rows, shows that the Gram algorithm, the generation of the Ω and the GEMM perform well on one and two nodes. The number of tasks in the GEMM for this matrix size (M=99594, N=960) does not necessitate to take more than two nodes because the GEMM performances are good and the CPU time is highly dominated by the I/O.

Table 7: Performances in seconds
Operations BLAS (1 PU) Chameleon 2 nodes (48 PU) Chameleon 3 nodes (72 PU) Chameleon 5 nodes (120 PU)
Read Distance 511 150 150 89
Gram 160 14 12 8
GEMM 1173 17 12 9

See in the following traces the performances of each step:

  • Reading HDF5 file in cyan
  • Gram in yellow
  • Generating Ω in cyan but we need to zoom in to see
  • GEMM in green

L6_N2.jpg

Figure 28: L6 testcase trace with Chameleon on 2 miriel nodes

L6_N2_z1.jpg

Figure 29: L6 testcase trace with Chameleon on 2 miriel nodes, zoom after reading, Gram in yellow

L6_N2_z2.jpg

Figure 30: L6 testcase trace with Chameleon on 2 miriel nodes, zoom on Ω generation in cyan

L6_N3.jpg

Figure 31: L6 testcase trace with Chameleon on 3 miriel nodes with MPI communications

5.3 Reproducible experiment

5.3.1 StarPU and new Madeleine (T1)

Fixed version for this release

  • StarPU: latest release 1.3.2
  • PM2: we'll use the latest release (2019-05-13, revision 28048) defined as a guix package in the guix-hpc repo.PM2 depends on revision 5248 of the external repo PadicoTM.

5.3.2 Diodon, FMR and Chameleon (T6)

  1. Prerequisites

    Guix is the package manager used to setup the environment and install the software stack. Chameleon, StarPU, New Madeleine and dependencies will be installed with Guix. To get the packages one has to have access to guix-hpc non free version for Intel MKL:

    Other softwares, FMR and Diodon will be compiled directly with cmake because they are not free softwares and cannot be downloaded freely (requires an authentification). To be able to get these softwares (FMR, Diodon) one has to have access to the following git repositories:

    Other requirements:

    • an account on plafrim
    • an account on gitlab.inria.fr with your ssh key configured

    Connect to plafrim.

    ssh plafrim
    

    Define git commits to use

    STARPU_COMMIT=61b7dc1f262b4fb2d1114523fea49d9f8bf78d87
    CHAMELEON_COMMIT=b10ae63a184064328c86c60320c57fd241a2ea81
    FMR_COMMIT=4d6fb42f881a36ffa88baaffd4d02ea88d075caf
    DIODON_COMMIT=b83ce759e083c3019eeb0643b34f914ab814e677
    

    The New Madeleine stack use SVN and guix is not able to handle it so that New Madeleine commits are fixed in the definition of guix packages. As a result the proper version of New Madeleine depends on the guix-hpc project state. We have saved the guix commits state into the file channels_gordon-v1.scm which can be found beside this v1.org file.

    Define the working directory to use

    GORDONDIR=$HOME/gordon
    mkdir -p $GORDONDIR
    

    Clone the adt-gordon git repository

    git clone git@gitlab.inria.fr:adt-gordon/adt-gordon.git $GORDONDIR/adt-gordon
    

    This requires that your local ssh key is enabled even on plafrim. If it is not the case you can for example add ForwardAgent yes in your local $HOME/.ssh/config file.

    Clone the guix-hpc-non-free project

    git clone git@gitlab.inria.fr:guix-hpc/guix-hpc-non-free.git $GORDONDIR/guix-hpc-non-free/
    

    Setup Guix version to the fixed state for this version

    sed -i -e "s#/tmp#$GORDONDIR#g" $GORDONDIR/adt-gordon/releases/v1/channels_gordon-v1.scm
    guix pull -C $GORDONDIR/adt-gordon/releases/v1/channels_gordon-v1.scm
    

    Clone the FMR source code

    git clone --recursive -b gemm_test git@gitlab.inria.fr:piblanch/fmr.git $GORDONDIR/fmr
    

    Clone the Diodon source code

    git clone --recursive -b gemm_test git@gitlab.inria.fr:afranc/diodon.git $GORDONDIR/diodon
    
  2. Build the Diodon software stack
    1. Prepare scripts

      This is the definition of scripts to use for building.

      if [ "$1" == "openmpi" ]
      then
        guix environment -v 2 --pure --preserve=^SLURM fmr --with-input=chameleon=chameleon-fxt --with-commit=starpu-fxt=$3 --with-commit=chameleon-fxt=$4 --with-input=openblas=mkl --ad-hoc slurm git coreutils inetutils util-linux procps grep tar sed gzip which gawk perl zlib openssh mkl hwloc openmpi starpu-fxt chameleon-fxt -- /bin/bash --norc diodon_build_sub.sh "$@"
      elif [ "$1" == "nmad" ]
      then
        guix environment -v 2 --pure --preserve=^SLURM fmr --with-input=chameleon=chameleon-fxt --with-commit=starpu-fxt=$3 --with-commit=chameleon-fxt=$4 --with-input=openblas=mkl --with-input=openmpi=nmad --ad-hoc slurm git coreutils inetutils util-linux procps grep tar sed gzip which gawk perl zlib openssh mkl hwloc nmad starpu-fxt chameleon-fxt -- /bin/bash --norc diodon_build_sub.sh "$@"
      else
        echo "Error: first parameter should be a supported MPI implementation, openmpi or nmad."
      fi
      
      MPI=$1
      GORDONDIR=$2
      FMR_DIR=$GORDONDIR/fmr
      DIODON_DIR=$GORDONDIR/diodon/cpp
      
      BUILD_DIR="build-$MPI"
      cd $FMR_DIR
      git checkout $5
      rm -rf $BUILD_DIR
      mkdir $BUILD_DIR
      cd $BUILD_DIR
      cmake .. -DCMAKE_INSTALL_PREFIX=$PWD/install -DFMR_USE_HDF5=ON -DFMR_USE_CHAMELEON=ON
      make -j5 install
      
      cd $DIODON_DIR
      git checkout $6
      rm -rf $BUILD_DIR
      mkdir $BUILD_DIR
      cd $BUILD_DIR
      cmake .. -DDIODON_USE_CHAMELEON=ON -DDIODON_USE_INTERNAL_FMR=OFF -DFMR_INCLUDE_DIR=$FMR_DIR/$BUILD_DIR/install/include
      make -j5
      
    2. Build
      1. OpenMPI stack

        Build the stack using the scripts

        cd $GORDONDIR/adt-gordon/releases/v1/
        MPI=openmpi
        ./diodon_build.sh $MPI $GORDONDIR $STARPU_COMMIT $CHAMELEON_COMMIT $FMR_COMMIT $DIODON_COMMIT
        
      2. New Madeleine stack

        Build the stack using the scripts

        cd $GORDONDIR/adt-gordon/releases/v1/
        MPI=nmad
        ./diodon_build.sh $MPI $GORDONDIR $STARPU_COMMIT $CHAMELEON_COMMIT $FMR_COMMIT $DIODON_COMMIT
        

5.3.3 Execute Diodon

  1. Prepare scripts

    This is the definition of scripts to use for performing the experiments.

    #SBATCH --constraint MirielOPA
    #SBATCH --exclusive
    #SBATCH --ntasks-per-node=1
    #SBATCH --threads-per-core=1
    
    # to avoid a lock during fetching chameleon branch in parallel
    export XDG_CACHE_HOME=/tmp/guix-$$
    
    # execute our bench script in a pure guix environment
    if [ "$1" == "openmpi" ]
    then
      exec guix environment -v 2 --pure --preserve=^SLURM fmr --with-input=chameleon=chameleon-fxt --with-commit=starpu-fxt=$3 --with-commit=chameleon-fxt=$4 --with-input=openblas=mkl --ad-hoc slurm coreutils inetutils util-linux procps grep tar sed gzip which gawk perl zlib openssh mkl hwloc openmpi starpu-fxt chameleon-fxt -- /bin/bash --norc diodon_run.sh "$@"
    elif [ "$1" == "nmad" ]
    then
      exec guix environment -v 2 --pure --preserve=^SLURM fmr --with-input=chameleon=chameleon-fxt --with-commit=starpu-fxt=$3 --with-commit=chameleon-fxt=$4 --with-input=openblas=mkl --with-input=openmpi=nmad --ad-hoc slurm coreutils inetutils util-linux procps grep tar sed gzip which gawk perl zlib openssh mkl hwloc nmad starpu-fxt chameleon-fxt -- /bin/bash --norc diodon_run.sh "$@"
    else
      echo "Error: first parameter should be a supported MPI implementation, openmpi or nmad."
    fi
    
    # clean tmp
    rm -rf /tmp/guix-$$
    

    Diodon execution shell script called by the slurm script

    MPI=$1
    GORDONDIR=$2
    HDF5FILE=$5
    RANK=$6
    TRACEDIR=$7
    mkdir -p $TRACEDIR
    
    NT=23
    
    IS_NMAD=`mpiexec -h |grep NMAD |head -n 1`
    if [[ ! -z "$IS_NMAD" ]]
    then
      export MPI_FLAGS="-DNMAD_DRIVER=psm2 -DLD_LIBRARY_PATH=${GUIX_ENVIRONMENT}/lib"
      export PIOM_DEDICATED=1
      export PIOM_DEDICATED_WAIT=1
    fi
    
    STARPU_FXT_PREFIX=$TRACEDIR/ $GUIX_ENVIRONMENT/bin/mpiexec $MPI_FLAGS -np $SLURM_NPROCS hwloc-bind machine:0 -- "$GORDONDIR/diodon/cpp/build-$MPI/demos/mdsDriver" -if $HDF5FILE -r $RANK -t $NT
    
    $GUIX_ENVIRONMENT/bin/starpu_fxt_tool -i $TRACEDIR/prof_file_${USER}_* -o $TRACEDIR/paje.trace
    
  2. Execute experiments

    Define where are stored HDF5 matrices on lustre for better I/O performances

    MATRIXDIR=/lustre/diodon
    mkdir -p $MATRIXDIR
    cp -n /projets/diodon/malabar/june2017/rbcL/10V-RbcL_S74.h5 $MATRIXDIR
    cp -n /projets/diodon/diatoms/L6.h5 $MATRIXDIR
    

    First testcase of size 20000 on one node with OpenMPI

    cd $GORDONDIR/adt-gordon/releases/v1
    MPI=openmpi
    TESTCASE=10V-RbcL_S74
    HDF5FILE=$MATRIXDIR/$TESTCASE.h5
    NP=1
    TIME=00:05:00
    JOB_NAME=$TESTCASE\_$NP
    RANK=315
    TRACEDIR=/lustre/$USER/diodon/trace/$TESTCASE
    rm /lustre/$USER/diodon/trace/$TESTCASE/prof_file_*
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" --nodes=$NP --time=$TIME -p court diodon.sl $MPI $GORDONDIR $STARPU_COMMIT $CHAMELEON_COMMIT $HDF5FILE $RANK $TRACEDIR
    squeue -u $USER
    

    Second testcase of size 100000 on two nodes with OpenMPI

    cd $GORDONDIR/adt-gordon/releases/v1
    MPI=openmpi
    TESTCASE=L6
    HDF5FILE=$MATRIXDIR/$TESTCASE.h5
    NP=2
    TIME=00:10:00
    JOB_NAME=$TESTCASE\_$NP
    RANK=955
    TRACEDIR=/lustre/$USER/diodon/trace/$TESTCASE
    rm /lustre/$USER/diodon/trace/$TESTCASE/prof_file_*
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" --nodes=$NP --time=$TIME -p court diodon.sl $MPI $GORDONDIR $STARPU_COMMIT $CHAMELEON_COMMIT $HDF5FILE $RANK $TRACEDIR
    squeue -u $USER
    

    Third testcase of size 100000 on two nodes with NewMadeleine

    cd $GORDONDIR/adt-gordon/releases/v1
    MPI=nmad
    TESTCASE=L6
    HDF5FILE=$MATRIXDIR/$TESTCASE.h5
    NP=2
    TIME=00:10:00
    JOB_NAME=$TESTCASE\_$NP
    RANK=955
    TRACEDIR=/lustre/$USER/diodon/trace/$TESTCASE
    rm /lustre/$USER/diodon/trace/$TESTCASE/prof_file_*
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" --nodes=$NP --time=$TIME -p court diodon.sl $MPI $GORDONDIR $STARPU_COMMIT $CHAMELEON_COMMIT $HDF5FILE $RANK $TRACEDIR
    squeue -u $USER
    

    Once jobs have terminated, one can print outputs

    cd $GORDONDIR/adt-gordon/releases/v1
    cat *.out
    
  3. Visualizing paje trace with Vite

    One can also visualize the paje trace generated through StarPU with the Vite sofware. Installation example with dpkg for a Debian operating system

    sudo apt-get install -y vite
    
    cd /tmp
    PLAFRIMUSER=\$USER
    TESTCASE=10V-RbcL_S74
    TRACEDIR=/lustre/$PLAFRIMUSER/diodon/trace/$TESTCASE
    scp plafrim:$TRACEDIR/paje.trace .
    

    One can then visualize the trace with vite (do not call this in emacs, do it in your own shell !)

    vite /tmp/paje.trace &
    

    And if you are a brave try with TESTCASE=L6 ! The trace size is 1.4G.

    cd /tmp
    PLAFRIMUSER=\$USER
    TESTCASE=L6
    TRACEDIR=/lustre/$PLAFRIMUSER/diodon/trace/$TESTCASE
    scp plafrim:$TRACEDIR/paje.trace .
    
    vite /tmp/paje.trace &
    

5.3.4 Auxiliary

Regenerate the scripts in tangle blocks if they are not up-to-date

cd $GORDONDIR/adt-gordon/releases/v1
guix package -i emacs
echo v1.org | ~/.guix-profile/bin/emacs v1.org --batch -f org-babel-tangle-file

6 Release: Operating version

6.1 Initial objectives

This document aims at validating the second release (on a total of four) of the Gordon ADT. This release is the “Operating version”. The objectives is to get a working version of the Diodon software stack using the efficient parallel GEMM and QR algorithms of Chameleon. MadMPI should scale over hundreds of nodes and StarPU should have consolidated his scheduling modules.

6.2 Work done to achieve this release

6.2.1 Using IB for control canal

Control canal is established when launching a NewMadeleine program. It aims to exchange information between nodes in order to create a consistent network stack between them. This canal is opened between every node and the master node. Basically, this canal is represented by a TCP connection. One of the limitation of using TCP is that we cannot open an arbitrary big number of connection because of the file descriptor system cap. To overcome this cap, the idea was to use Infiniband instead of TCP for the canal control. Every IB connection uses an unique address so the idea was to exchange such addresses via the PMI2 "key-value" space for every connection. The problem is that before retrieving an address in this space, every node needs to perform a fence operation. So for every connection, we need to perform a global synchronization. This solution is not adapted. It will be possible to use such a mechanism with PMIx (which is similar to PMI2 but its more oriented for exascale usage); the specification tells that fence operations are no longer needed. We still can use IB for control canal but we have to use UDP to exchange addresses. Doing so reduce the number of file descriptor used by the master node (n - 1 fds) which was one of the limitation of the previous implementation. When using UDP to exchange IB addresses, only one socket is used by each node.

6.2.2 Benchmarks

Some problems were encountered while trying to benchmark previous development on Occigen:

  • when exchanging IB addresses with UDP, a ACK is send when an address is received. However, it seems that a lot of these ACK are dropped by the network and some nodes wait indefinitely for this ACK.
  • network cards used in Occigen need a synchronization after connection. When a message is received too soon, an error appears.

Those 2 issues can be fixed using IBUP (Infiniband Verbs Unreliable Datagram) instead of UDP to exchange IB addresses (cf Alexandre).

6.2.3 Integration of the Chameleon QR algorithm with hierarchical trees in FMR

Classical QR algorithm in Chameleon was involving too many communications with huge impact on the performances when using multiple nodes. A new QR algorithm with hierarchical trees had been integrated into Chameleon. The FMR QR class has been extended to be able to call this Chameleon algorithm. A Unitary test has been developed in FMR just like the GEMM.

6.2.4 Assembling a large matrix from several HDF5 files

The testcases coming from Inra, some distance matrices, are stored in HDF5 files. The larger testcase Inra can provide is about 100 000 by 100 000 matrix. This size is not so challenging and the goal of the project is to solve problems ten times larger. From the Inra's point of view the interest is to be able to perform study on several large samples whose distances information are stored in different HDF5 files. We could construct the associated large problem and store the big matrix in one HDF5 file but this would be a waste of time and disk storage since the data can be already stored in existing files (existing distance matrices of sub-samples). A new way of reading the data to construct the A matrix has been developed to be able to assemble a big problem by reading the data into different files. Some discussions between Inra and Inria have led to determine a format, i.e. a txt file, to define the large problem in function of existing files. This development has been validated in sequential BLAS and in parallel with Chameleon so that we are able to read the Chameleon's tile in several HDF5 files even if the tiles are astride different files.

6.2.5 Generation of larger testcases

Inra is producing the missing matrices allowing to test the Diodon software on large testcase (N > 100 000). This work is not finished yet.

6.2.6 Experiments on CURTA supercomputer

The goal is also to prove that the Diodon software and its dependencies (FMR, Chameleon, StarPU, MadMPI) can be easily installed and used on real life supercomputers. Experiments have been performed on CURTA and we provide in the next section a hands-on session to demonstrate how to install all the software stack and usage examples.

6.3 Experiments on MCIA CURTA

Let's consider we have an access to the CURTA supercomputer and that we will work into the directory "gordon" once logged in.

SSH connection on CURTA, please define the appropriate alias "curta" depending on how you are able to access the cluster and your SSH configuration

ssh curta

Define the working directory to use and move the matrices into it

WORKDIR=$HOME/gordon
mkdir -p $WORKDIR

On your local computer copy the input matrices on CURTA

scp plafrim:/projets/diodon/atlas_guyane_trnH/atlas_guyane_trnH.h5 . && scp atlas_guyane_trnH.h5 curta:gordon/
scp plafrim:/projets/diodon/malabar/june2017/rbcL/10V-RbcL_S74.h5 . && scp 10V-RbcL_S74.h5 curta:gordon/
scp plafrim:/projets/diodon/diatoms/L6.h5 . && scp L6.h5 curta:gordon/

6.3.1 Install the sofwares

Guix is not installed at that time on CURTA so that we have installed the softwares one by one.

  1. Environment on CURTA

    Define and load the software environment

    module purge
    module add cmake/3.14.3
    module add compiler/intel/2019.3.199
    module add gcc/7.3.0
    export PATH=$HOME/install/bin:$PATH
    export CPATH=$HOME/install/include:$CPATH
    export LIBRARY_PATH=$HOME/install/lib:$LIBRARY_PATH
    export LD_LIBRARY_PATH=$HOME/install/lib:$LD_LIBRARY_PATH
    export PKG_CONFIG_PATH=$HOME/install/lib/pkgconfig:$PKG_CONFIG_PATH
    
  2. Install fxt
    cd $WORKDIR
    wget http://download.savannah.nongnu.org/releases/fkt/fxt-0.3.9.tar.gz
    tar xf fxt-0.3.9.tar.gz
    cd fxt-0.3.9/
    ./configure --prefix=$WORKDIR/install
    make -j5 install
    
  3. Install hwloc
    cd $WORKDIR
    wget https://download.open-mpi.org/release/hwloc/v2.0/hwloc-2.0.4.tar.gz
    tar xf hwloc-2.0.4.tar.gz
    cd hwloc-2.0.4/
    ./configure --prefix=$WORKDIR/install
    make -j5 install
    
  4. Install openmpi
    cd $WORKDIR
    wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.1.tar.bz2
    tar xf openmpi-4.0.1.tar.bz2
    cd openmpi-4.0.1/
    ./configure --enable-mpi-ext=affinity --with-sge --with-hwloc=external --with-libevent --enable-openib-control-hdr-padding --enable-openib-udcm --enable-openib-rdmacm --enable-openib-rdmacm-ibaddr --with-pmi --prefix=$WORKDIR/install
    make -j5 install
    
  5. Install hdf5
    cd $WORKDIR
    wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.5/src/hdf5-1.10.5.tar.gz
    tar xf hdf5-1.10.5.tar.gz
    cd hdf5-1.10.5
    ./autogen.sh
    CFLAGS="-fPIC" CXXFLAGS="-fPIC" ./configure --enable-cxx --enable-fortran --enable-threadsafe --with-pthread --enable-unsupported --prefix=$WORKDIR/install
    make -j5 install
    
  6. Install starpu
    cd $WORKDIR
    wget http://starpu.gforge.inria.fr/files/starpu-1.3.2/starpu-1.3.2.tar.gz
    tar xf starpu-1.3.2.tar.gz
    cd starpu-1.3.2/
    ./configure --enable-blas-lib=none --disable-mlr --with-fxt --prefix=$WORKDIR/install
    make -j5 install
    
  7. Install chameleon
    cd $WORKDIR
    git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git
    cd chameleon
    git checkout 8e2330472ca8042466aa4b516307d65df600f0ee
    mkdir -p build
    cd build
    cmake .. -DCHAMELEON_USE_MPI=ON -DBLA_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_Fortran_COMPILER=gfortran -DBUILD_SHARED_LIBS=ON -DCMAKE_INSTALL_PREFIX=$WORKDIR/install
    make -j5 install
    
  8. Install fmr
    cd $WORKDIR
    git clone --recursive git@gitlab.inria.fr:piblanch/fmr.git
    cd fmr
    git checkout diodon
    git checkout a829d393f1c893488daf5d6c955663a6cf822961
    mkdir -p build
    cd build
    cmake .. -DFMR_USE_HDF5=ON -DFMR_USE_MKL_AS_BLAS=ON -DFMR_USE_CHAMELEON=ON -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=$WORKDIR/install
    make -j5 install
    
  9. Install diodon
    cd $WORKDIR
    git clone git@gitlab.inria.fr:afranc/diodon.git
    cd diodon/cpp
    git checkout 91127b500e4334d7ddaa3003ed19c14bc9e2f186
    mkdir -p build
    cd build
    cmake .. -DDIODON_USE_INTERNAL_FMR=OFF -DDIODON_USE_CHAMELEON=ON -DBLA_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_Fortran_COMPILER=gfortran -DCMAKE_PREFIX_PATH=$WORKDIR/install
    make -j5
    
  10. Install the stack with NewMadeleine

    NewMadeleine can be used to replace OpenMPI.

    cd $WORKDIR
    svn checkout https://scm.gforge.inria.fr/anonscm/svn/pm2/trunk pm2/
    cd pm2/
    svn up -r28444
    cd scripts/
    ./pm2-build-packages ./madmpi.conf --prefix=$HOME/install-nmad
    

    We can then install all the previous modules, StarPU, Chameleon, etc linked with NewMadeleine in the install-nmad directory.

6.3.2 Run experiments

  1. Some examples of Diodon execution
    1. Small test atlas_guyane_trnH
      TESTCASE=atlas_guyane_trnH
      FILENAME=$WORKDIR/$TESTCASE.h5
      RANK=1502
      NP=1
      salloc --nodes=$NP --tasks-per-node=1 --cpus-per-task=32 --constraint=compute
      STARPU_FXT_PREFIX=$HOME/ mpiexec --map-by ppr:$SLURM_TASKS_PER_NODE:node:pe=$SLURM_CPUS_PER_TASK -x STARPU_FXT_PREFIX --output-filename $TESTCASE-$RANK-$NP $WORKDIR/diodon/cpp/build/demos/mdsDriver -if $FILENAME -r $RANK -os 0 -t 30 -writeOutput 0
      

      To connect to a node after an "salloc"

      srun --pty --jobid <jobid> /bin/bash -i
      
    2. Medium testcase 10V-RbcL_S74
      TESTCASE=10V-RbcL_S74
      FILENAME=$WORKDIR/$TESTCASE.h5
      RANK=960
      NP=1
      salloc --nodes=$NP --tasks-per-node=1 --cpus-per-task=32 --constraint=compute
      STARPU_FXT_PREFIX=$HOME/ mpiexec --map-by ppr:$SLURM_TASKS_PER_NODE:node:pe=$SLURM_CPUS_PER_TASK -x STARPU_FXT_PREFIX --output-filename $TESTCASE-$RANK-$NP $WORKDIR/diodon/cpp/build/demos/mdsDriver -if $FILENAME -r $RANK -os 0 -t 30 -writeOutput 0
      
    3. Large testcase L6
      TESTCASE=L6
      FILENAME=$WORKDIR/$TESTCASE.h5
      RANK=9600
      NP=2
      salloc --nodes=$NP --tasks-per-node=1 --cpus-per-task=32 --constraint=compute
      mpiexec --map-by ppr:$SLURM_TASKS_PER_NODE:node:pe=$SLURM_CPUS_PER_TASK -x STARPU_FXT_PREFIX --output-filename $TESTCASE-$RANK-$NP $WORKDIR/diodon/cpp/build/demos/mdsDriver -if $FILENAME -r $RANK -os 0 -t 30 -writeOutput 0
      
  2. Results on the L6 testcase

    We present the CPU time of the different stages of the Diodon Random SVD on CURTA nodes "compute":

    Table 8: CPU times (s) on the L6 testcase on CURTA with OpenMPI
    # Nodes 1 (bigmem) 2 3 4 5 6 7
    READ HDF5 (NxN) 377 204 137 107 81 70 61
    GRAM (NxN) 20 12 9 8 7 7 7
    GEMM (NxN * NxK) 158 81 68 41 40 28 26
    Perfs GEMM(Gflop\s) 1200 2349 2783 4583 4662 6580 7289
    QR (NxK) 27 15 13 11 12 11 11
    Perfs QR (Gflop\s) 637 1182 1316 1526 1437 1510 1509
    SVD (KxK) 270 153 153 145 153 152 154
    RSVD 764 498 381 304 311 272 267
    Table 9: CPU times (s) on the L6 testcase on CURTA with NewMadeleine
    # Nodes 1 (bigmem) 2 3 4 5 6 7
    READ HDF5 (NxN) 377 210 134 102 84 68 59
    GRAM (NxN) 20 13 10 8 8 8 7
    GEMM (NxN * NxK) 158 107 71 69 45 42 51
    Perfs GEMM(Gflop\s) 1200 1772 2663 2743 4204 4490 3664
    QR (NxK) 27 17 13 10 14 13 13
    Perfs QR (Gflop\s) 637 1011 1304 1656 1255 1329 1365
    SVD (KxK) 270 170 153 143 153 154 154
    RSVD 764 525 380 339 324 308 314

    results.png

    Figure 32: CPU times (s) on the L6 testcase on CURTA - Compare OpenMPI/Nmad from 1 to 7 nodes

    Characteristics:

    • CURTA nodes "compute" : SD530 2 processors Intel Xeon Gold SKL-6130 @ 2.1 GHz (32 cpus per node), 96 Go RAM
    • CURTA node "bigmem" : SR950 4 processors Intel Xeon Gold SKL-6130 @ 2,1 GHz (64 cpus), 3 To RAM
    • N = 99594
    • K = 9600
    • we use 30 threads on each node to reserve one thread for the scheduler and one thread for MPI communications
    • RSVD CPU time is made of several GEMM and QR operations distributed over the nodes thanks to Chameleon and one SVD performed in parallel multithreaded on one node with MKL. It does not take into account the preprocessing part, made of HDF5 file reading and GRAM treatment.

    For the case one node we had to use another kind of node with more memory using the "bigmem" partition. The reason why the MKL SVD with 30 threads does not show the same performance on the "bigmem" and "compute" node has not been well understood. Maybe this is because the execution is done on a shared node which cannot be reserved in an exclusive way.

    Here is an execution trace of Diodon with K=960. The trace was too big with K=9600. There are some sleeping parts between some of the algorithms that should be more carefully investigated as soon as the testcases will be more difficult to solve. These sleeping parts may be related with FxT recording trace through StarPU.

    L6.jpg

    Figure 33: Diodon RSVD trace, L6, K=960, QxP=2x1

7 Release: Production version

7.1 Objectives

The goal is to deliver a release of the metabarcoding application Diodon ready to solve a Multidimensional Scaling problem efficiently on a distance matrix of size one million.

The main challenge is to change the parallelism paradigm in Diodon, from a shared memory parallelism to a distributed one. A second challenge is to be able to reuse existing data to study very large datasets. We have to be able to build our distance matrices based on existing submatrices stored on disk in dozens of HDF5 files.

Several steps have to be distributed:

  1. Reading of the distance matrix \(D\) stored in several (55) hdf5 files
  2. Pre-treatment i.e. compute the Gramian matrix \(G\) from \(D\)
  3. Random Singular Value Decomposition of \(G\): $$G = U \Sigma V^t$$
  4. Post-treatments i.e. compute the cloud of points \(X\): $$ X = U^+ \begin{pmatrix} \sigma^+_1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \sigma^+_k \end{pmatrix}, \sigma^+ \text{the positive eigen values, } U^+ \text{the associated vectors} $$
  5. Writing of \(X\) in a hdf5 file.

7.2 Work done to achieve this release

7.2.1 Chameleon asynchronous interface

The asynchronous interface of Chameleon allows to avoid synchronization barriers from the runtime used internally by Chameleon. Hence the subtasks of Chameleon's algorithms that are called subsequently can be pipelined and we can get better performances. This asynchronous interface has been exposed in FMR. A new function has been added in FMR, async_manager, that can be called to handle objects required by the Chameleon's interface, ask for a synchronization, etc.

Unfortunately the GEMM and QR algorithms are not asynchronous in Chameleon for now because some temporary workspace are used inside the parallel supposedly asynchronous algorithm and are not exposed in the interface. This workspace must be cleaned before leaving the function and thus requires to call a barrier waiting for the completion of tasks.

In addition GRAM involves the accumulation of sum of squares of values over the entire columns. Those operations are not expensive but requires that we get all the data before performing coslty computations, that is, the update of the values with the sum of squares. This means that the pipelining of the matrix building and GRAM is not obvious. Additional works should be done to really take advantage of the asynchronous interface calls and make the overall MDS algorithm pipelined with Chameleon/StarPU.

7.2.2 Handle symmetric matrix format

The distance matrix in input is symmetric. Till now we were reading the symmetric part of the matrix and fill a full Chameleon's matrix and perform GRAM, GEMM, etc on a full matrix. To better optimize memory space we have adapted our matrix assembly and Chameleon's call to handle symmetric format for the distance matrix. It has revealed that the SYMM performance of Chameleon are three times slower than the GEMM for large matrices of size one million. This algorithm should be optimized later. Same observation concerning GRAM. For that reason we use the full matrix format and GEMM algorithm in the benchmarks presented hereafter.

7.2.3 HDF5 files parallel reading optimizations

In the previous release the HDF5 file reading was performed with the Map function of Chameleon. All worker threads of every MPI processes was reading the tiles in parallel following the 2D-block cyclic distribution of the Chameleon's matrix (P=NP, Q=1). The problem is that for large matrices and when increasing the number of ressources the reading performances fall such that the reading step was representing 90% of the total time (reading time 10 times the RSVD time with rank=10000). See 10 and 34 for the performance results at that time.

Table 10: Diodon MDS on occigen HSW24, TS=320, R=960, the UC, the size and the number of nodes used are given in parenthesis
UC Size MPI NP Read H5 Pre RSVD Post Total
L6 (99594-2) 99594 2 347 5 32 5 389
L2L3L6 (270983-9) 270983 9 1478 14 56 18 1565
Lodd (426548-25) 426548 25 2638 26 55 36 2755
Leven (616644-50) 616644 50 5299 57 64 38 5458
Lall (1043192-150) 1043192 150 5912 161 178 45 6296

mds_occigen_gemm_ts320_r960.png

Figure 34: Diodon MDS on occigen HSW24, TS=320, R=960, the UC, the size and the number of nodes used are given in parenthesis

Lets discuss how we have optimized this step.

First, the HDF5 library is not guaranteed threadsafe and there is a lock in HDF5 that prevent calling it from multiple threads. We get better performances by handling all the reading tasks from a single CPU worker. We have modified Chameleon so that it forces the execution of all tasks of one algorithm on a specific worker thanks to StarPU. See the behavior in 11 of reading the matrix by a single call to HDF5 or by multiple call through the Chameleon Map function tile by tile using one or several threads and one or several MPI processes.

Table 11: Performance of HDF5 file 10V-RbcL_S74 reading, N=23214, Data Size=4111 MB, on 4 PlaFRIM miriel nodes using the /beegfs disk partition
Mode Time (s) Perf (MB/s)
HDF5 P=1 T=1 18 228
Chameleon P=1 T=22 50 88
Chameleon P=1 T=1 23 178
Chameleon P=2 T=1 12 342
Chameleon P=3 T=1 8 514
Chameleon P=4 T=1 6 685

Second, we have observed that reading our matrix from several HDF5 files impacts performances. One can see in table 12 that reading a matrix from three files instead of one involves a performance loss of more than a factor 2. We can see the impact of the tile size. Increase its size allows to increase performances and reduce the gap a little bit. This is just for testing, of course we will use the best tile size more suitable for costly linear algebra algorithm such as GEMM and QR and 320 is a fine candidate.

Table 12: Example times for reading a 166501 dof matrix by chameleon's tile on PlaFRIM bora
Tile Size L10.h5 L10_3.txt
320 86 183
480 79 148
640 80 129
800 88 117
960 74 111

Finally, reading tile by tile from the different MPI processes in multiple HDF5 file at the same time lead to poor performances. Indeed following the 2D-block cyclic distribution all MPI processes are reading tiles in every files in parallel, this causing many concurent accesses. We have performed some tests to modify the initial distribution of Chameleon's tiles in order to limit this behavior. The time was reduced but not enough to be satisfactory. We have done many experiments outside our Diodon application in order to find an efficient algorithm for reading our dozens of HDF5 files and build the distributed Chameleon's matrix by tiles. The result is an algorithm which first read large blocks of rows following a simple 1D-block no cyclic distribution, i.e. splitting the set of HDF5 files as blocks of rows allocated in a balanced way over MPI processes (and nodes). This way there are far less calls to the HDF5 read, this done on much larger data blocks and with a limited number of different MPI processes handling each file. See figure 35 that shows the time spent by each MPI process to read a testcase of size 426548.

readh5multi8_Limpair_15-45.png

Figure 35: Read H5 L1L3L5L7L9 (15 files) on Occigen, NP=15, 30, 45

The new algorithm optimize the way we read data from HDF5 files, then we have large data blocks over MPI processes corresponding to some sub-blocks of rows, we then have to build the tiled matrix distributed in a 2D-block cyclic fashion more suitable for subsequent algorithms Gram, Gemm and QR calls. This is done by first building the Chameleon's matrix, following the simple 1D distribution used for reading, with a Map function. Then by applying a migration of the tiles calling in a loop over tiles the function RUNTIME_data_migrate. The resulting algorithm ReadHDF5+BuildChameleonMatrix is thus ten times faster than the original algorithm reading the files tile by tile directly with Chameleon Map. On the one million testcase the time for this step goes from 6000s to 600s a time comparable to the Random SVD computing time.

Finally we read the HDF5 matrices in a loop of StarPU tasks over blocks of rows of size NB x N (panel), where NB is the tile size used in Chameleon and N the total number of columns of the matrix described in the HDF5 file. For each block we submit a read task of the block with an HDF5 call. We then submit tasks to copy the tiles contained in the panel into the Chameleon matrices. And finally we submit a freed task on the panel so that we just handle the chameleon matrix tiles and one panel at the same time in memory. The read tasks done on one worker by node and the filling of the chameleon tiles, plus the migration of data, are thus pipelined and the overall algorithm to build the matrix take the time for reading HDF5 files. With this strategy we have won around 30% of CPU time to build the matrix compared to the version with synchronization between each step (read H5 files, fill Chameleon tiles, data migrate to get 2d block cyclic distribution).

7.2.4 Ability to get/set Chameleon submatrices in FMR

For posttreatments we need to extract specific columns of the matrix \(U\) to compute \(U^+\) necessary to compute the output data \(X\), the cloud of points. Chameleon provides an API to either transform a LAPACK format matrix to a Chameleon's one and conversely from Chameleon's format to the LAPACK one, the LAPACK buffer being centralized on MPI process rank 0. Nevertheless there is no ready to use function to set and get arbitrary submatrix (a sub-part of the full matrix) with a LAPACK format array. There is a way to get submatrix in Chameleon, using the desc_submatrix object, but the starting indexes must correspond to the beginning of a Chameleon's tile.

We have added a function in FMR to get arbitrary Chameleon submatrices. We rely on the existing function, desc_submatrix working on tiles, we get the submatrix in a LAPACK buffer, respecting the tiled structure. We then extract the sub-part, without respecting necessarily the tiled structure, into the caller ready to be filled LAPACK buffer.

For writting into an arbitrary sub-part of the Chameleon's matrix we use the existing function Map. This function allows to apply a generic function on each tile. By writting an adequate generic function we are able to update any part of the Chameleon's matrix. Because the tiles are different data the Map algorithm is trivially parallel. The price to pay is to have the input data, the one used to update as a LAPACK buffer, copied on every MPI processes.

7.2.5 Parallelize posttreatments with Chameleon

Thanks to the ability to get and set Chameleon's submatrices we can easily apply the postreatments on Chameleon distributed matrices coming from the output of the random SVD. The filtering of singular vectors associated with positive eigen values, to form \(U^+\) can be solved by calling successively the submatrix get for block columns of size of a tile with all the lines of the full matrix. Inside this loop over tiles over columns we get the tile block columns in a LAPACK buffer then we extract the columns of interest.

This algorithm was taking severall minutes for large cases, hence we have proposed a task algorithm with StarPU with a loop over tiles of \(U\) to copy the columns of interest into \(U^+\) directly. The resulting algorithm makes the posttreatment step lasts just one second when it was around three minutes before on large cases.

To compute the scaling for computing \(X\) we simply call the Map function with an approriate function to perform the multiplication of each column of \(U^+\) by \(\sigma_i\).

7.2.6 HDF5 file writing with Chameleon

After the posttreatment the resulting cloud of point \(X\) must be saved and write on disk in a HDF5 file. The size of \(X\) is much smaller than \(D\) because the number of columns is at least one hundred smaller. Still, this object is distributed in order to avoid memory unbalanced and such that we can study Multidimensional scaling with a Random SVD parametrized by any value of the prescribed rank R. Thus, we have to save \(X\) in a distributed way.

Concurent write of all the tiles with the Map function from every MPI processes is not possible, the performances would be too poor. We should do something similar than the reading process, convert the tiled matrix into large blocks of rows distributed over MPI processes then call parallel HDF5 to write the LAPACK blocks. This is done thanks to a task-based algorithm on top of StarPU. For the parallel writting we had to change our HDF5 dependency from the sequential one to the parallel MPI one.

7.3 Experiments on PlaFRIM

Requirements:

  • an account on PlaFRIM
  • an account on gitlab inria with your ssh key configured
  • read access to the Diodon project directory /projets/diodon in PlaFRIM

7.3.1 Install the application with Guix-HPC

Connect to plafrim.

ssh plafrim

Clone the adt-gordon git repository

git clone git@gitlab.inria.fr:adt-gordon/adt-gordon.git
export GORDONDIR=$PWD/adt-gordon/releases/v3
cd $GORDONDIR

This requires that your local ssh key is enabled even on plafrim. If it is not the case you can for example add ForwardAgent yes in your local $HOME/.ssh/config file.

Setup Guix version to the fixed state for this version

guix pull --allow-downgrades -C $GORDONDIR/channels_gordon-v3.scm
CHAMELEON_COMMIT=768d1c7ce93da3b2a9c13fd18f4a8a43d6a8b2be
FMR_COMMIT=d49dc1f4beab838e36db19207fb1cc6c64c919bd
DIODON_COMMIT=ccd7085b18373282a4d3f8be53201e730486e06a

Define git commits to use for not released libraries and install the application and be ready to use it

With OpenMPI

guix environment --pure \
		 --preserve=^GORDON \
		 --ad-hoc openmpi intel-mpi-benchmarks hdf5-parallel-openmpi chameleon-mkl-mt diodon-mpi \
			  gnuplot python python-numpy python-matplotlib python-h5py \
			  bash coreutils inetutils openssh vim emacs gawk grep sed slurm@19 \
		 --with-commit=chameleon-mkl-mt=$CHAMELEON_COMMIT \
		 --with-commit=fmr-mpi=$FMR_COMMIT \
		 --with-commit=diodon-mpi=$DIODON_COMMIT \
		 -- /bin/bash --norc

With Nmad

guix environment --pure \
		 --preserve=^GORDON \
		 --ad-hoc nmad hdf5-parallel-openmpi@1.10.7.gordon --with-input=openmpi=nmad chameleon-mkl-mt diodon-mpi --with-input=hdf5-parallel-openmpi@1.10.7=hdf5-parallel-openmpi@1.10.7.gordon \
			  gnuplot python python-numpy python-matplotlib python-h5py \
			  bash emacs grep gawk gzip coreutils inetutils openssh procps sed slurm@19 tar vim util-linux which zlib \
		 --with-commit=chameleon-mkl-mt=$CHAMELEON_COMMIT \
		 --with-commit=fmr-mpi=$FMR_COMMIT \
		 --with-commit=diodon-mpi=$DIODON_COMMIT \
		 -- /bin/bash --norc

Minimalist test

salloc -n 2 --time=00:01:00 --exclusive --ntasks-per-node=1 --threads-per-core=1 mpiexec -np 2 mdsDriver -if /beegfs/diodon/gordon/atlas_guyane_trnH.h5 -off h5 -outfiles atlas_guyane_trnH.mds

You should get the following results on screen

[diodon] rsvd quality estimation ||Psy_r|| / ||G||= 0.997076
[diodon] mds quality estimation ||Psy^+_r|| / ||G||= 0.82941
Singular values: =[
1.95533e+06 1.29561e+06 992370 948580 684702 628921 560903 448407 432766 368975  [...] 17016.3 16507.3 16217.8 16134.6 15470 14414.7 14202.6 13970.5 13222.4 12925.3 12247.5 ]

7.3.2 Execute Diodon benchmarks

  1. Slurm script used on Plafrim

    Slurm script reference. You don't need to tangle it. This script is commited and located in $GORDONDIR/plafrim/diodon.sl

    #SBATCH --constraint bora
    #SBATCH --exclusive
    #SBATCH --ntasks-per-node=1
    #SBATCH --threads-per-core=1
    
    # check number of arguments
    if [ "$#" -lt 4 ]
    then
      echo -e "This script requires\n 1. The testcase name\n 2. The path to a HDF5 file\n 3. The RSVD prescribed rank\n 4. The floating point operation precision (s or d)\n Example:"
      echo "sbatch  `basename $0` atlas_guyane_trnH /beegfs/diodon/gordon/atlas_guyane_trnH.h5 960 d"
      exit 1
    fi
    TESTCASE=$1
    HDF5FILE=$2
    RANK=$3
    PREC=$4
    
    # location where to read and write matrices
    MATRIXDIR=/beegfs/diodon/gordon
    
    # arithmetic precision s (simple) or d (double)
    DIODONPREC=""
    if [ $PREC == "d" ]; then DIODONPREC="-d"; fi
    
    IS_NMAD=`mpiexec -h |grep NMAD |head -n 1`
    if [[ ! -z "$IS_NMAD" ]]
    then
      export MPI_FLAGS=""
    else
      export MPI_FLAGS="--map-by ppr:$SLURM_TASKS_PER_NODE:node:pe=$SLURM_CPUS_ON_NODE"
    fi
    
    # activate timings of Diodon and FMR operations
    export FMR_TIME=1
    export DIODON_TIME=1
    
    # Chameleon+StarPU better performances to dedicate one cpu for the
    # scheduler and one for MPI comms
    NTHREADS=$((SLURM_CPUS_ON_NODE-2))
    
    # read H5 files by big blocks or rows distributed over MPI processes
    # and not by Chameleon's tiles
    export H5_DISTRIB_2DNOCYCLIC=1
    
    # activate ASYNC interface in Chameleon
    export CHAMELEON_ASYNC=1
    # tile size in Chameleon
    export CHAMELEON_TILE_SIZE=320
    
    # submission window in StarPU to limit number of submitted tasks and
    # avoid to start some MPI communications too early. This optimizes
    # main memory consumption.
    export STARPU_LIMIT_MIN_SUBMITTED_TASKS=6400
    export STARPU_LIMIT_MAX_SUBMITTED_TASKS=7400
    
    # test mpi and get hostnames
    time mpiexec $MPI_FLAGS -np $SLURM_NPROCS hostname 1>&2
    
    # test mpi PingPong between two nodes
    time mpiexec $MPI_FLAGS -np $SLURM_NPROCS IMB-MPI1 PingPong 1>&2
    
    # save performance of the GEMM on one node for reference
    B=$CHAMELEON_TILE_SIZE
    K=$((5*B))
    M=$((SLURM_CPUS_ON_NODE*K))
    if [ $PREC == "s" ]; then time mpiexec -np 1 chameleon_stesting -t $SLURM_CPUS_ON_NODE -o gemm -b $B -k $K -m $M -n $M; fi
    if [ $PREC == "d" ]; then time mpiexec -np 1 chameleon_dtesting -t $SLURM_CPUS_ON_NODE -o gemm -b $B -k $K -m $M -n $M; fi
    
    # launch the diodon application: mdsDriver which performs a
    # MultiDimensional Scaling analysis of a distance matrix. The main
    # input is the distance matrix given with one or several h5 files. The
    # main output is the cloud of points given by the MDS saved as a h5
    # file locally.
    time mpiexec $MPI_FLAGS -np $SLURM_NPROCS mdsDriver -if $HDF5FILE -r $RANK $DIODONPREC -t $NTHREADS -off h5 -outfiles $MATRIXDIR/$TESTCASE.mds -os 0
    
  2. Run benchmark
    cd $GORDONDIR
    
    MATRIXDIR=/beegfs/diodon/gordon
    
    LISTMDS=""
    LISTJOB=""
    
    PREC=s
    
    TESTCASE=atlas_guyane_trnH
    SIZE=1502
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/atlas_guyane_trnH.h5
    NP=1
    RANK=1000
    TIME=00:01:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME --partition=routage \
           plafrim/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=10V-RbcL_S74
    SIZE=23214
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/10V-RbcL_S74.h5
    NP=1
    RANK=1000
    TIME=00:05:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME --partition=routage \
           plafrim/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L6
    SIZE=99594
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L6.h5
    NP=2
    RANK=5000
    TIME=00:15:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME --partition=routage \
           plafrim/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L2_L3_L6
    SIZE=270983
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L2_L3_L6.txt
    NP=6
    RANK=10000
    TIME=00:30:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME --partition=routage \
           plafrim/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    squeue -u $USER
    

7.3.3 Postreatments

LISTJOB="diodon_s_1_1502_1000 diodon_s_1_23214_1000 diodon_s_2_99594_5000 diodon_s_6_270983_10000"
LISTMDS="atlas_guyane_trnH.mds.h5 10V-RbcL_S74.mds.h5 L6.mds.h5 L2_L3_L6.mds.h5"
  1. Performances graphs
    BENCHCASE="bench_diodon_plafrim_s"
    ./loop_parse_results.sh $BENCHCASE "$LISTJOB"
    

    You should get the following files:

    • bench_diodon_plafrim_s_time.dat
    • bench_diodon_plafrim_s_perf.dat
    export GORDON_RESULT_TIME=$BENCHCASE\_time.dat
    gnuplot script_time.gp
    export GORDON_RESULT_PERF=$BENCHCASE\_perf.dat
    gnuplot script_perf.gp
    

    You should get the following files:

    • times.png
    • perfs.png

    times.png

    Figure 36: CPU times of Diodon

    perfs.png

    Figure 37: Performances of Chameleon GEMM and QR

  2. Images of the cloud of points
    for i in $LISTMDS; do python3 ../../scripts/nuage_png.py $MATRIXDIR/$i; done
    

    atlas_guyane_trnH.png

    Figure 38: Multidimensional scaling of atlas_guyane_trnH

    10V-RbcL_S74.png

    Figure 39: #+CAPTION: Multidimensional scaling of 10V-RbcL_S74

    L6.png

    Figure 40: Multidimensional scaling of L6

    L2_L3_L6.png

    Figure 41: Multidimensional scaling of L2_L3_L6

7.4 Experiments on MCIA CURTA

Requirements:

7.4.1 Install Guix on your machine

Guix requires a running GNU/Linux system, GNU tar and Xz.

gpg --keyserver pgp.mit.edu --recv-keys 3CE464558A84FDC69DB40CFB090B11993D9AEBB5
wget https://git.savannah.gnu.org/cgit/guix.git/plain/etc/guix-install.sh
chmod +x guix-install.sh
sudo ./guix-install.sh

The Gordon packages necessary are not official Guix packages (some softwares are not ready for an open source distribution (Diodon, FMR). It is then necessary to add a channel to get Gordon packages. Create a ~/.config/guix/channels.scm file with the following snippet:

(cons (channel
    (name 'guix-hpc-non-free)
    (url "https://gitlab.inria.fr/guix-hpc/guix-hpc-non-free.git"))
  %default-channels)

Update guix package definition

guix pull

Update new guix in the path

PATH="$HOME/.config/guix/current/bin${PATH:+:}$PATH"
hash guix

For further shell sessions, add this to the ~/.bash_profile file

export PATH="$HOME/.config/guix/current/bin${PATH:+:}$PATH"
export GUIX_LOCPATH="$HOME/.guix-profile/lib/locale"

Gordon packages are now available

guix search ^diodon

7.4.2 Create the singularity image

On your own machine do the following.

Clone the adt-gordon git repository

git clone git@gitlab.inria.fr:adt-gordon/adt-gordon.git
export GORDONDIR=$PWD/adt-gordon/releases/v3
cd $GORDONDIR

Setup Guix version to the fixed state for this version

guix pull --allow-downgrades -C $GORDONDIR/channels_gordon-v3.scm

Define git commits to use for not released libraries and create the singularity package:

CHAMELEON_COMMIT=b2ca80097111713c383975cebb894598acb027e8
FMR_COMMIT=9fddfffde733dca2d23539880c7bb52e6f62472d
DIODON_COMMIT=14a21522288c01db4f8806e8eb72864d42315098

singularity_diodon=`guix pack -f squashfs \
				 bash slurm openssh openmpi intel-mpi-benchmarks hdf5-parallel-openmpi chameleon-mkl-mt diodon-mpi \
				 gnuplot python python-numpy python-matplotlib python-h5py \
				 coreutils emacs gawk grep inetutils vim sed \
				 --with-commit=chameleon-mkl-mt=$CHAMELEON_COMMIT \
				 --with-commit=fmr-mpi=$FMR_COMMIT \
				 --with-commit=diodon-mpi=$DIODON_COMMIT \
				 -S /bin=bin --entry-point=/bin/bash`
cp $singularity_diodon diodon-pack.gz.squashfs
# copy the singularity image on the supercomputer, e.g. 'curta'
scp diodon-pack.gz.squashfs curta:

7.4.3 Copy the matrices in your scratch directory on CURTA

Login on CURTA

ssh curta

Clone the adt-gordon git repository

git clone git@gitlab.inria.fr:adt-gordon/adt-gordon.git
export GORDONDIR=$PWD/adt-gordon/releases/v3

This requires that your local ssh key is enabled even on curta. If it is not the case you can for example add ForwardAgent yes in your local $HOME/.ssh/config file.

Copy files

module add irods/mcia
MATRIXDIR=/scratch/$USER/gordon/
mkdir -p $MATRIXDIR
cd $MATRIXDIR
# copy h5 distance matrices from irods
# case L2_L3_L6
LISTFILE="L2.h5 L2_L3.h5 L2_L6.h5 L3.h5 L3_L6.h5 L6.h5"
for file in $LISTFILE; do iget /MCIA/home/fastmds/DiatomLeman/Hdf5_gzip/$file .; done
# case L1_L3_L5_L7
LISTFILE="L1.h5 L1_L3.h5 L1_L5.h5 L1_L7.h5 L1_L9.h5 L3.h5 L3_L5.h5 L3_L7.h5 L3_L9.h5 L5.h5 L5_L7.h5 L5_L9.h5 L7.h5 L7_L9.h5 L9.h5"
for file in $LISTFILE; do iget /MCIA/home/fastmds/DiatomLeman/Hdf5_gzip/$file .; done
# copy metadata files from adt-gordon repository
LISTFILE="L1_L2_L3_L4_L5_L6_L7_L8_L9_L10.txt  L1_L3_L5_L7_L9.txt  L2_L3_L6.txt  L2_L4_L6_L8_L10.txt"
for file in $LISTFILE; do cp $GORDONDIR/$file .; done

7.4.4 Execute Diodon benchmarks

  1. Slurm script used on CURTA

    Slurm script reference. You don't need to tangle it. This script is commited and located in $GORDONDIR/curta/diodon.sl

    #SBATCH --constraint=compute
    #SBATCH --tasks-per-node=1
    #SBATCH --cpus-per-task=32
    
    # check number of arguments
    if [ "$#" -lt 4 ]
    then
      echo -e "This script requires\n 1. The testcase name\n 2. The path to a HDF5 file\n 3. The RSVD prescribed rank\n 4. The floating point operation precision (s or d)\n Example:"
      echo "sbatch  `basename $0` atlas_guyane_trnH $MATRIXDIR/L6.h5 960 d"
      exit 1
    fi
    TESTCASE=$1
    HDF5FILE=$2
    RANK=$3
    PREC=$4
    
    # location where to read and write matrices
    MATRIXDIR=/scratch/$USER/gordon
    
    # arithmetic precision s (simple) or d (double)
    DIODONPREC=""
    if [ $PREC == "d" ]; then DIODONPREC="-d"; fi
    
    # activate timings of Diodon and FMR operations
    export SINGULARITYENV_FMR_TIME=1
    export SINGULARITYENV_DIODON_TIME=1
    
    # Chameleon+StarPU better performances to dedicate one cpu for the
    # scheduler and one for MPI comms
    NTHREADS=$((SLURM_CPUS_ON_NODE-2))
    
    # read H5 files by big blocks or rows distributed over MPI processes
    # and not by Chameleon's tiles
    export SINGULARITYENV_H5_DISTRIB_2DNOCYCLIC=1
    
    # activate ASYNC interface in Chameleon
    export SINGULARITYENV_CHAMELEON_ASYNC=1
    # tile size in Chameleon
    export SINGULARITYENV_CHAMELEON_TILE_SIZE=320
    
    # submission window in StarPU to limit number of submitted tasks and
    # avoid to start some MPI communications too early. This optimizes
    # main memory consumption.
    export SINGULARITYENV_STARPU_LIMIT_MIN_SUBMITTED_TASKS=6400
    export SINGULARITYENV_STARPU_LIMIT_MAX_SUBMITTED_TASKS=7400
    
    # setup modules required: openmpi and singularity
    module purge
    module add openmpi
    module add singularity
    
    # test mpi and get hostnames
    time mpiexec --map-by ppr:$SLURM_TASKS_PER_NODE:node:pe=$SLURM_CPUS_ON_NODE -np $SLURM_NPROCS \
         hostname 1>&2
    
    # test mpi PingPong between two nodes
    time mpiexec --map-by ppr:$SLURM_TASKS_PER_NODE:node:pe=$SLURM_CPUS_ON_NODE -np $SLURM_NPROCS \
         singularity exec ~/diodon-pack.gz.squashfs \
         IMB-MPI1 PingPong 1>&2
    
    # launch the diodon application: mdsDriver which performs a
    # MultiDimensional Scaling analysis of a distance matrix. The main
    # input is the distance matrix given with one or several h5 files. The
    # main output is the cloud of points given by the MDS saved as a h5
    # file locally.
    time mpiexec --map-by ppr:$SLURM_TASKS_PER_NODE:node:pe=$SLURM_CPUS_ON_NODE -np $SLURM_NPROCS \
         singularity exec ~/diodon-pack.gz.squashfs \
         mdsDriver -if $HDF5FILE -r $RANK $DIODONPREC -t $NTHREADS -off h5 -outfiles $MATRIXDIR/$TESTCASE.mds -os 0
    
    # save performance of the GEMM on one node for reference
    B=$SINGULARITYENV_CHAMELEON_TILE_SIZE
    K=$((5*B))
    M=$((SLURM_CPUS_ON_NODE*K))
    if [ $PREC == "s" ]; then time mpiexec --map-by ppr:$SLURM_TASKS_PER_NODE:node:pe=$SLURM_CPUS_ON_NODE -np 1 singularity exec ~/diodon-pack.gz.squashfs chameleon_stesting -t $SLURM_CPUS_ON_NODE -o gemm -b $B -k $K -m $M -n $M; fi
    if [ $PREC == "d" ]; then time mpiexec --map-by ppr:$SLURM_TASKS_PER_NODE:node:pe=$SLURM_CPUS_ON_NODE -np 1 singularity exec ~/diodon-pack.gz.squashfs chameleon_dtesting -t $SLURM_CPUS_ON_NODE -o gemm -b $B -k $K -m $M -n $M; fi
    
  2. Run benchmark
    cd $GORDONDIR
    
    MATRIXDIR=/scratch/$USER/gordon
    
    LISTMDS=""
    LISTJOB=""
    
    PREC=s
    
    TESTCASE=L6
    SIZE=99594
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L6.h5
    NP=2
    RANK=5000
    TIME=00:15:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME \
           curta/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L2_L3_L6
    SIZE=270983
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L2_L3_L6.txt
    NP=10
    RANK=10000
    PREC=s
    TIME=00:30:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME \
           curta/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L1_L3_L5_L7_L9
    SIZE=426548
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L1_L3_L5_L7_L9.txt
    NP=25
    RANK=10000
    PREC=s
    TIME=01:00:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME \
           curta/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    squeue -u $USER
    

7.4.5 Postreatments

LISTJOB="diodon_s_2_99594_5000 diodon_s_10_270983_10000 diodon_s_25_426548_10000"
LISTMDS="L6.mds.h5 L2_L3_L6.mds.h5 L1_L3_L5_L7_L9.mds.h5"

We need singularity to get access to postreatments executables.

module purge
module add singularity
  1. Performances graphs
    BENCHCASE="bench_diodon_curta_s"
    ./loop_parse_results.sh $BENCHCASE "$LISTJOB"
    

    You should get the following files:

    • bench_diodon_curta_s_time.dat
    • bench_diodon_curta_s_perf.dat
    export GORDON_RESULT_TIME=$BENCHCASE\_time.dat
    singularity exec ~/diodon-pack.gz.squashfs gnuplot script_time.gp
    export GORDON_RESULT_PERF=$BENCHCASE\_perf.dat
    singularity exec ~/diodon-pack.gz.squashfs gnuplot script_perf.gp
    

    You should get the following files:

    • times.png
    • perfs.png

    times.png

    Figure 42: CPU times of Diodon

    perfs.png

    Figure 43: Performances of Chameleon GEMM and QR

  2. Images of the cloud of points
    for i in $LISTMDS; do singularity exec ~/diodon-pack.gz.squashfs python3 ../../scripts/nuage_png.py $i; done
    

    L6.png

    Figure 44: Multidimensional scaling of L6

    L2_L3_L6.png

    Figure 45: Multidimensional scaling of L2_L3_L6

    L1_L3_L5_L7_L9.png

    Figure 46: Multidimensional scaling of L1_L3_L5_L7_L9

7.5 Experiments on Occigen (CPUs)

Let's consider we have an access to the Occigen supercomputer. Please define the appropriate alias "occigen" depending on how you are able to access the cluster and your SSH configuration.

Guix is not installed at that time on Occigen so that we have installed the softwares one by one.

7.5.1 Generate an archive containing the source tarballs

Here is the script to generate an archive tarballs of the gordon softwares (requires ssh key set on gitlab.inria.fr and access to diodon project granted)

#!/bin/bash
PM2_COMMIT=8317021ced771ba585ba78e0d88053aa53436f62
CHAMELEON_COMMIT=33fb79f1a9102133b5b8bfbe1efa4dcb390be097
CHAMELEON_MORSE_CMAKE_COMMIT=dac4ef11820f9c00c1dfaaa593c3a2463da0469f
CHAMELEON_HQR_COMMIT=a0ec4d899c07cadc860a92808294ec3599c3cf17
DIODON_COMMIT=ccd7085b18373282a4d3f8be53201e730486e06a
mkdir -p gordon-v3
cd gordon-v3
git clone git@gitlab.inria.fr:adt-gordon/adt-gordon.git
wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.6/src/hdf5-1.10.6.tar.gz
wget https://download.open-mpi.org/release/hwloc/v2.1/hwloc-2.1.0.tar.gz
wget https://github.com/openucx/ucx/releases/download/v1.7.0/ucx-1.7.0.tar.gz
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.2.tar.gz
wget https://github.com/intel/mpi-benchmarks/archive/IMB-v2019.6.tar.gz
wget http://download.savannah.nongnu.org/releases/fkt/fxt-0.3.9.tar.gz
wget http://starpu.gforge.inria.fr/files/starpu-1.3.7/starpu-1.3.7.tar.gz
git clone https://gitlab.inria.fr/pm2/pm2
cd pm2 && git checkout $PM2_COMMIT && cd ..
git clone --recursive git@gitlab.inria.fr:solverstack/chameleon.git
cd chameleon && git checkout $CHAMELEON_COMMIT
cd cmake_modules/morse_cmake/ && git checkout $CHAMELEON_MORSE_CMAKE_COMMIT && cd ../..
cd hqr/ && git checkout $CHAMELEON_HQR_COMMIT && cd ..
cd ..
git clone git@gitlab.inria.fr:afranc/diodon.git
cd diodon && git checkout $DIODON_COMMIT && cd ..
wget https://sourceforge.net/projects/gnuplot/files/gnuplot/5.4.0/gnuplot-5.4.0.tar.gz
cd ..
tar czf gordon-v3.tar.gz gordon-v3/

On your labtop with internet access

./generate_release_gordon.sh
scp gordon-v3.tar.gz occigen:

7.5.2 Copy the matrices in your scratch directory on Occigen

TODO: prepare files in a directory shared by the group 'ces1926'.

On your local computer copy the input matrices from Curta to Occigen. You should have an account on Curta, on Occigen and be a member of the fastmds irods project, see the previous section about experiments on Curta.

Adapt the following script, replace 'fpruvost' by your login.

#!/bin/bash
set -x

# purpose: copy hdf5 files from curta to occigen SSH

# usage: copy_hdf5_files.sh filelist a7Ad12rT
# args: file containing the list of files, occigen password

# dependencies:
# packages: openssh

# check number of arguments
if [ "$#" -lt 2 ]
then
    echo -e "This script requires\n 1. a file containing the list of hdf5 files to copy\n 2. password occigen\n Example:"
    echo "`basename $0` filelist a7Ad12rT"
    exit 1
fi

# handle arguments
args=( "$@" )
filelist="${args[0]}"
passwordoccigen="${args[1]}"

list=`cat $filelist`

for file in $list
do
    ssh fpruvost@curta.mcia.univ-bordeaux.fr "module add irods/mcia && mkdir -p /scratch/fpruvost/gordon/ && iget /MCIA/home/fastmds/DiatomLeman/Hdf5_gzip/$file /scratch/fpruvost/gordon/"
    sshpass -p $passwordoccigen scp -3 -o ProxyCommand="ssh fpruvost@acces1.bordeaux.inria.fr nc %h %p" fpruvost@curta.mcia.univ-bordeaux.fr:/scratch/fpruvost/gordon/$file fpruvost@occigen.cines.fr:/scratch/cnt0021/ces1926/fpruvost/gordon/
done

7.5.3 Install the softwares

Login on occigen

ssh occigen

Clone the adt-gordon git repository

git clone git@gitlab.inria.fr:adt-gordon/adt-gordon.git
export GORDONDIR=$PWD/adt-gordon/releases/v3

This requires that your local ssh key is enabled even on occigen. If it is not the case you can for example add ForwardAgent yes in your local $HOME/.ssh/config file.

  1. Environment on Occigen

    Define and load the software environment. This script is already commited in occigen/set_env.sh.

    • BUILDDIR is where we build (occigen strongly limits the number of files we can save in our home so that we use /tmp),
    • INSTALLDIR is where we install libraries,
    • MATRIXDIR is where we store and read the matrices
    #!/bin/bash
    module purge
    module add gcc/8.3.0
    module add cmake/3.16.1
    module add mkl
    export CPATH=$MKL_INC_DIR:$CPATH
    
    export CC=gcc
    export CXX=g++
    export FC=gfortran
    
    export BUILDDIR=$SCRATCHDIR/gordon
    export INSTALLDIR=$HOME/gordon/install
    # export INSTALLDIR=$HOME/gordon/install-nmad
    export MATRIXDIR=$SCRATCHDIR/gordon
    mkdir -p $BUILDDIR
    mkdir -p $INSTALLDIR
    mkdir -p $MATRIXDIR
    
    export PATH=$INSTALLDIR/bin:$PATH
    export CPATH=$INSTALLDIR/include:$CPATH
    export LIBRARY_PATH=$INSTALLDIR/lib:$LIBRARY_PATH
    export LD_LIBRARY_PATH=$INSTALLDIR/lib:$LD_LIBRARY_PATH
    export PKG_CONFIG_PATH=$INSTALLDIR/lib/pkgconfig:$PKG_CONFIG_PATH
    
  2. Prepare tarballs in the BUILDDIR
    . $GORDONDIR/occigen/set_env.sh
    cp $GORDONDIR/*.txt $MATRIXDIR/
    cd $BUILDDIR && mv ~/gordon-v3.tar.gz . && tar xvf gordon-v3.tar.gz
    
  3. Install hwloc
    cd $BUILDDIR/gordon-v3
    tar xvf hwloc-2.1.0.tar.gz
    cd hwloc-2.1.0/
    ./configure --prefix=$INSTALLDIR
    make -j5 install
    
  4. Install ucx
    cd $BUILDDIR/gordon-v3
    tar xvf ucx-1.7.0.tar.gz
    cd ucx-1.7.0/
    ./contrib/configure-release --enable-optimizations --enable-mt --prefix=$INSTALLDIR
    make -j5 install
    
  5. Install openmpi
    cd $BUILDDIR/gordon-v3
    tar xvf openmpi-4.0.2.tar.gz
    cd openmpi-4.0.2/
    ./configure --with-ucx --enable-mca-no-build=btl-uct --with-hwloc=$INSTALLDIR --prefix=$INSTALLDIR
    make -j5 install
    
  6. Install intel-mpi-benchmarks
    cd $BUILDDIR/gordon-v3
    tar xvf IMB-v2019.6.tar.gz
    cd mpi-benchmarks-IMB-v2019.6/
    CC=$INSTALLDIR/bin/mpicc CXX=$INSTALLDIR/bin/mpic++ make IMB-MPI1
    cp IMB-MPI1 $INSTALLDIR/bin
    
  7. Install Nmad instead of OpenMPI and intel-mpi-benchmark (optional)
    cd $BUILDDIR/gordon-v3/pm2/scripts
    ./pm2-build-packages ./madmpi.conf --prefix=$INSTALLDIR
    
  8. Install hdf5
    cd $BUILDDIR/gordon-v3
    tar xvf hdf5-1.10.6.tar.gz
    cd hdf5-1.10.6
    ./autogen.sh
    CC=mpicc FC=mpif90 CFLAGS="-fPIC" CXXFLAGS="-fPIC" ./configure --enable-shared --disable-static --disable-tests --enable-threadsafe --with-pthread --enable-unsupported --enable-parallel --prefix=$INSTALLDIR/
    make -j5 install
    
  9. Install starpu
    cd $BUILDDIR/gordon-v3
    tar xvf starpu-1.3.7.tar.gz
    cd starpu-1.3.7/
    ./configure --disable-build-doc --disable-build-examples --disable-build-tests --enable-blas-lib=none --disable-starpufft --disable-mlr --disable-opencl --disable-cuda --disable-hdf5 --disable-fortran --prefix=$INSTALLDIR
    make -j5 install
    
  10. Install chameleon
    cd $BUILDDIR/gordon-v3/chameleon
    mkdir -p build
    cd build
    cmake .. -DCHAMELEON_USE_MPI=ON -DBLA_VENDOR=Intel10_64lp -DBUILD_SHARED_LIBS=ON -DCMAKE_INSTALL_PREFIX=$INSTALLDIR
    make -j5 install
    
  11. Install diodon
    cd $BUILDDIR/gordon-v3/diodon
    mkdir -p build
    cd build
    cmake ../cpp -DDIODON_USE_CHAMELEON=ON -DDIODON_USE_MKL_AS_BLAS=ON -DCMAKE_INSTALL_PREFIX=$INSTALLDIR
    make -j5 install
    
  12. Install gnuplot (posttreatment)
    cd $BUILDDIR/gordon-v3/
    tar xvf gnuplot-5.4.0.tar.gz
    cd gnuplot-5.4.0/
    ./configure --prefix=$INSTALLDIR
    make -j5 install
    

7.5.4 Run experiments

  1. Slurm script used on Occigen

    Slurm script reference. You don't need to tangle it. This script is commited and located in $GORDONDIR/occigen/diodon.sl

    #!/usr/bin/env bash
    #SBATCH --mem=118000
    #SBATCH --constraint=HSW24
    #SBATCH --exclusive
    #SBATCH --ntasks-per-node=1
    #SBATCH --threads-per-core=1
    
    set -x
    
    . $GORDONDIR/occigen/set_env.sh
    
    # check number of arguments
    if [ "$#" -lt 4 ]
    then
        echo -e "This script requires\n 1. The testcase name\n 2. The path to a HDF5 file\n 3. The RSVD prescribed rank\n 4. The floating point operation precision (s or d)\n Example:"
        echo "sbatch  `basename $0` ~/test.h5 3200 320 d"
        exit 1
    fi
    
    TESTCASE=$1
    FILENAME=$2
    RANK=$3
    PREC=$4
    
    # location where to read and write matrices
    MATRIXDIR=$SCRATCHDIR/gordon
    
    IS_NMAD=`mpiexec -h |grep NMAD |head -n 1`
    if [[ ! -z "$IS_NMAD" ]]
    then
      export MPI_FLAGS=""
    else
      export MPI_FLAGS="--map-by ppr:$SLURM_TASKS_PER_NODE:node:pe=$SLURM_CPUS_ON_NODE"
    fi
    
    # arithmetic precision s (simple) or d (double)
    DIODONPREC=""
    if [ $PREC == "d" ]; then DIODONPREC="-d"; fi
    
    # activate timings of Diodon and FMR operations
    export FMR_TIME=1
    export DIODON_TIME=1
    
    # Chameleon+StarPU better performances to dedicate one cpu for the
    # scheduler and one for MPI comms
    NTHREADS=$((SLURM_CPUS_ON_NODE-2))
    
    # read H5 files by big blocks or rows distributed over MPI processes
    # and not by Chameleon's tiles
    export H5_DISTRIB_2DNOCYCLIC=1
    
    # activate ASYNC interface in Chameleon
    export CHAMELEON_ASYNC=1
    # tile size in Chameleon
    export CHAMELEON_TILE_SIZE=320
    
    # force name of hostname to avoid independent calibration of each node
    export STARPU_HOSTNAME=occigen
    # submission window in StarPU to limit number of submitted tasks and
    # avoid to start some MPI communications too early. This optimizes
    # main memory consumption.
    export STARPU_LIMIT_MIN_SUBMITTED_TASKS=6400
    export STARPU_LIMIT_MAX_SUBMITTED_TASKS=7400
    
    # test mpi and get hostnames
    time mpiexec -np $SLURM_NPROCS hostname 1>&2
    
    # test mpi PingPong between two nodes
    time mpiexec -np $SLURM_NPROCS $INSTALLDIR/bin/IMB-MPI1 PingPong 1>&2
    
    # save performance of the GEMM on one node for reference
    B=$CHAMELEON_TILE_SIZE
    K=$((5*B))
    M=$((SLURM_CPUS_ON_NODE*K))
    if [ $PREC == "s" ]; then time mpiexec -np 1 $INSTALLDIR/bin/chameleon_stesting -t $SLURM_CPUS_ON_NODE -o gemm -b $B -k $K -m $M -n $M; fi
    if [ $PREC == "d" ]; then time mpiexec -np 1 $INSTALLDIR/bin/chameleon_dtesting -t $SLURM_CPUS_ON_NODE -o gemm -b $B -k $K -m $M -n $M; fi
    
    # launch the diodon application: mdsDriver which performs a
    # MultiDimensional Scaling analysis of a distance matrix. The main
    # input is the distance matrix given with one or several h5 files. The
    # main output is the cloud of points given by the MDS saved as a h5
    # file locally.
    time mpiexec $MPI_FLAGS -np $SLURM_NPROCS $INSTALLDIR/bin/mdsDriver -t $NTHREADS -if $FILENAME -r $RANK -os 0 -outfiles $MATRIXDIR/$TESTCASE.mds -format h5 $DIODONPREC
    
  2. Run benchmark

    Advice: to avoid performances losses during the reading of hdf5 files avoid to let several jobs being executed at the same time. This because some input files are the same between two jobs. If several MPI processes from different jobs read the same file this will cause lower performances for each job.

    LISTMDS=""
    LISTJOB=""
    
    cd $GORDONDIR
    
    MATRIXDIR=$SCRATCHDIR/gordon
    PREC=s
    RANK=10000
    
    TESTCASE=10V-RbcL_S74
    SIZE=23214
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/10V-RbcL_S74.h5
    NP=1
    TIME=00:30:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME occigen/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L6
    SIZE=99594
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L6.h5
    NP=1
    TIME=00:30:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME occigen/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L2_L3_L6
    SIZE=270983
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L2_L3_L6.txt
    NP=6
    TIME=00:30:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME occigen/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L1_L3_L5_L7_L9
    SIZE=426548
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L1_L3_L5_L7_L9.txt
    NP=16
    TIME=01:00:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME occigen/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L2_L4_L6_L8_L10
    SIZE=616644
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L2_L4_L6_L8_L10.txt
    NP=32
    TIME=01:00:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME occigen/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L1_L2_L3_L4_L5_L6_L7_L8_L9_L10
    SIZE=1043192
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L1_L2_L3_L4_L5_L6_L7_L8_L9_L10.txt
    NP=100
    TIME=01:00:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME occigen/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    

7.5.5 Posttreatments

cd $GORDONDIR
. $GORDONDIR/occigen/set_env.sh
LISTJOB="diodon_s_1_99594_10000 diodon_s_6_270983_10000 diodon_s_16_426548_10000 diodon_s_32_616644_10000 diodon_s_125_1043192_10000"
LISTMDS="L6.mds.h5 L2_L3_L6.mds.h5 L1_L3_L5_L7_L9.mds.h5 L2_L4_L6_L8_L10.mds.h5 L1_L2_L3_L4_L5_L6_L7_L8_L9_L10.mds.h5"
cd $GORDONDIR
. $GORDONDIR/occigen/set_env.sh
LISTJOB="diodon_s_1_99594_1000 diodon_s_6_270983_1000 diodon_s_16_426548_1000 diodon_s_32_616644_1000 diodon_s_100_1043192_1000"
LISTMDS="L6.mds.h5 L2_L3_L6.mds.h5 L1_L3_L5_L7_L9.mds.h5 L2_L4_L6_L8_L10.mds.h5 L1_L2_L3_L4_L5_L6_L7_L8_L9_L10.mds.h5"
  1. Performances graphs
    BENCHCASE="bench_diodon_occigen_s"
    ./loop_parse_results.sh $BENCHCASE "$LISTJOB"
    

    You should get the following files:

    • bench_diodon_occigen_s_time.dat
    • bench_diodon_occigen_s_perf.dat
    export GORDON_RESULT_TIME=$BENCHCASE\_time.dat
    gnuplot script_time.gp
    export GORDON_RESULT_PERF=$BENCHCASE\_perf.dat
    gnuplot script_perf.gp
    

    You should get the following files:

    • times.png
    • perfs.png

    times.png

    Figure 47: CPU times of Diodon

    perfs.png

    Figure 48: Performances of Chameleon GEMM and QR

    times2.png

    Figure 49: CPU times of Diodon with Starpu task algorithms

    perfs2.png

    Figure 50: Performances of Chameleon GEMM and QR

  2. Images of the cloud of points
    python3 -m pip install --user virtualenv
    python3 -m venv env
    source env/bin/activate
    pip install numpy matplotlib h5py
    for i in $LISTMDS; do python3 ../../scripts/nuage_png.py $MATRIXDIR/$i; done
    

    L6.png

    Figure 51: Multidimensional scaling of L6

    L2_L3_L6.png

    Figure 52: Multidimensional scaling of L2_L3_L6

    L1_L3_L5_L7_L9.png

    Figure 53: Multidimensional scaling of L1_L3_L5_L7_L9

    L2_L4_L6_L8_L10.png

    Figure 54: Multidimensional scaling of L2_L4_L6_L8_L10

    L1_L2_L3_L4_L5_L6_L7_L8_L9_L10.png

    Figure 55: Multidimensional scaling of L1_L2_L3_L4_L5_L6_L7_L8_L9_L10

7.6 Experiments on Jean Zay (CPUs+GPUs)

7.6.1 Generate an archive containing the source tarballs

See previous section Occigen to get the script.

On your labtop with internet access :

./generate_release_gordon.sh
scp gordon-v3.tar.gz jzay:

7.6.2 Install the softwares

Let's consider we have an access to the Jean Zay supercomputer. SSH connection on Jean Zay, please define the appropriate alias "jzay" depending on how you are able to access the cluster and your SSH configuration

ssh jzay

Clone the adt-gordon git repository

git clone https://gitlab.inria.fr/adt-gordon/adt-gordon.git
export GORDONDIR=$PWD/adt-gordon/releases/v3

On your local computer copy the input matrices on jzay.

Guix is not installed at that time on Jean Zay so that we have installed the softwares one by one.

  1. Environment on JZAY

    Define and load the software environment. This script is already commited in jzay/set_env.sh.

    • BUILDDIR is where we build libraries,
    • INSTALLDIR is where we install libraries,
    • MATRIXDIR is where we store and read the matrices
    #!/bin/bash
    module purge
    module add gcc/4.8.5
    module add cmake/3.14.4
    module add intel-mkl/19.0.4
    module add cuda/10.1.2
    # required for starpu ?
    export CPATH=/gpfslocalsys/cuda/10.1.2/include:$CPATH
    
    export CC=gcc
    export CXX=g++
    export FC=gfortran
    
    export BUILDDIR=$SCRATCH/gordon
    export INSTALLDIR=$WORK/gordon/install
    export MATRIXDIR=$SCRATCH/gordon
    mkdir -p $BUILDDIR
    mkdir -p $INSTALLDIR
    mkdir -p $MATRIXDIR
    
    export PATH=$INSTALLDIR/bin:$PATH
    export CPATH=$INSTALLDIR/include:$CPATH
    export LIBRARY_PATH=$INSTALLDIR/lib:$LIBRARY_PATH
    export LD_LIBRARY_PATH=$INSTALLDIR/lib:$LD_LIBRARY_PATH
    export PKG_CONFIG_PATH=$INSTALLDIR/lib/pkgconfig:$PKG_CONFIG_PATH
    
  2. Prepare tarballs in the BUILDDIR
    . $GORDONDIR/jzay/set_env.sh
    cp $GORDONDIR/*.txt $MATRIXDIR/
    cd $BUILDDIR
    CHAMELEON_COMMIT=768d1c7ce93da3b2a9c13fd18f4a8a43d6a8b2be
    DIODON_COMMIT=061331b960635fe1a841d48ed72f77800869670b
    mkdir -p gordon-v3
    cd gordon-v3
    git clone git@gitlab.inria.fr:adt-gordon/adt-gordon.git
    wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.6/src/hdf5-1.10.6.tar.gz
    wget https://download.open-mpi.org/release/hwloc/v2.1/hwloc-2.1.0.tar.gz
    wget https://github.com/openucx/ucx/releases/download/v1.7.0/ucx-1.7.0.tar.gz
    wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.2.tar.gz
    wget https://github.com/intel/mpi-benchmarks/archive/IMB-v2019.6.tar.gz
    wget http://download.savannah.nongnu.org/releases/fkt/fxt-0.3.9.tar.gz
    wget http://starpu.gforge.inria.fr/files/starpu-1.3.7/starpu-1.3.7.tar.gz
    git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git
    cd chameleon && git checkout $CHAMELEON_COMMIT && cd ..
    git clone https://gitlab.inria.fr/afranc/diodon.git
    cd diodon && git checkout $DIODON_COMMIT && cd ..
    wget https://sourceforge.net/projects/gnuplot/files/gnuplot/5.4.0/gnuplot-5.4.0.tar.gz
    
  3. Install hwloc
    cd $BUILDDIR/gordon-v3
    tar xvf hwloc-2.1.0.tar.gz
    cd hwloc-2.1.0/
    ./configure --disable-nvml --prefix=$INSTALLDIR
    make -j5 install
    
  4. Install ucx
    cd $BUILDDIR/gordon-v3
    tar xvf ucx-1.7.0.tar.gz
    cd ucx-1.7.0/
    ./contrib/configure-release --enable-optimizations --enable-mt --prefix=$INSTALLDIR
    make -j5 install
    
  5. Install openmpi
    cd $BUILDDIR/gordon-v3
    tar xvf openmpi-4.0.2.tar.gz
    cd openmpi-4.0.2/
    ./configure --with-ucx --enable-mca-no-build=btl-uct --enable-mpi-ext=affinity --with-sge --with-hwloc=$INSTALLDIR --with-libevent --enable-openib-control-hdr-padding --enable-openib-udcm --enable-openib-rdmacm --enable-openib-rdmacm-ibaddr --with-pmi=/gpfslocalsys/slurm/current --with-cuda=$CUDA_PATH --prefix=$INSTALLDIR
    make -j5 install
    
  6. Install intel-mpi-benchmarks
    cd $BUILDDIR/gordon-v3
    tar xvf IMB-v2019.6.tar.gz
    cd mpi-benchmarks-IMB-v2019.6/
    CC=$INSTALLDIR/bin/mpicc CXX=$INSTALLDIR/bin/mpic++ make IMB-MPI1
    cp IMB-MPI1 $INSTALLDIR/bin
    
  7. Install hdf5
    cd $BUILDDIR/gordon-v3
    tar xvf hdf5-1.10.6.tar.gz
    cd hdf5-1.10.6
    ./autogen.sh
    CC=mpicc FC=mpif90 CFLAGS="-fPIC" CXXFLAGS="-fPIC" ./configure --enable-shared --disable-static --disable-tests --enable-cxx --enable-fortran --enable-threadsafe --with-pthread --enable-unsupported --enable-parallel --prefix=$INSTALLDIR/
    make -j5 install
    
  8. Install starpu
    cd $BUILDDIR/gordon-v3
    tar xvf starpu-1.3.7.tar.gz
    cd starpu-1.3.7/
    ./configure --disable-build-doc --disable-build-examples --disable-build-tests --enable-blas-lib=none --disable-starpufft --disable-mlr --disable-opencl --disable-hdf5 --disable-fortran --prefix=$INSTALLDIR
    make -j5 install
    
  9. Install chameleon
    cd $BUILDDIR/gordon-v3/chameleon
    mkdir -p build
    cd build
    cmake .. -DCHAMELEON_USE_MPI=ON -DCHAMELEON_USE_CUDA=ON -DBLA_VENDOR=Intel10_64lp -DBUILD_SHARED_LIBS=ON -DCMAKE_INSTALL_PREFIX=$INSTALLDIR
    make -j5 install
    
  10. Install diodon
    cd $BUILDDIR/gordon-v3/diodon
    mkdir -p build
    cd build
    cmake ../cpp -DDIODON_USE_CHAMELEON=ON -DDIODON_USE_MKL_AS_BLAS=ON -DCMAKE_INSTALL_PREFIX=$INSTALLDIR
    make -j5 install
    
  11. Install gnuplot (posttreatment)
    cd $BUILDDIR/gordon-v3/
    tar xvf gnuplot-5.4.0.tar.gz
    cd gnuplot-5.4.0/
    ./configure --prefix=$INSTALLDIR
    make -j5 install
    

7.6.3 Run experiments

  1. Slurm script used on Jean Zay

    Slurm script reference. You don't need to tangle it. This script is commited and located in $GORDONDIR/jzay/diodon.sl

    #!/usr/bin/env bash
    #SBATCH --account hyb@gpu
    #SBATCH --qos=qos_gpu-dev
    #SBATCH --ntasks-per-node=1
    #SBATCH --gres=gpu:4
    #SBATCH --cpus-per-task=40
    #SBATCH --hint=nomultithread
    #SBATCH --exclusive
    
    set -x
    
    . $GORDONDIR/jzay/set_env.sh
    
    # check number of arguments
    if [ "$#" -lt 4 ]
    then
        echo -e "This script requires\n 1. The testcase name\n 2. The path to a HDF5 file\n 3. The RSVD prescribed rank\n 4. The floating point operation precision (s or d)\n Example:"
        echo "sbatch  `basename $0` ~/test.h5 3200 320 d"
        exit 1
    fi
    
    TESTCASE=$1
    FILENAME=$2
    RANK=$3
    PREC=$4
    
    # Number of GPUS
    NCUDAS=4
    
    # location where to read and write matrices
    MATRIXDIR=$SCRATCH/gordon
    
    # arithmetic precision s (simple) or d (double)
    DIODONPREC=""
    if [ $PREC == "d" ]; then DIODONPREC="-d"; fi
    
    # activate timings of Diodon and FMR operations
    export FMR_TIME=1
    export DIODON_TIME=1
    
    # Chameleon+StarPU better performances to dedicate one cpu for the
    # scheduler and one for MPI comms and one per GPU
    NTHREADS=$((SLURM_CPUS_ON_NODE-2-NCUDAS))
    
    # read H5 files by big blocks or rows distributed over MPI processes
    # and not by Chameleon's tiles
    export H5_DISTRIB_2DNOCYCLIC=1
    
    # activate ASYNC interface in Chameleon
    export CHAMELEON_ASYNC=1
    # tile size in Chameleon
    export CHAMELEON_TILE_SIZE=1280
    # number of GPUs in Chameleon
    export CHAMELEON_NUM_CUDAS=$NCUDAS
    
    # force name of hostname to avoid independent calibration of each node
    export STARPU_HOSTNAME=jzaygpu
    # submission window in StarPU to limit number of submitted tasks and
    # avoid to start some MPI communications too early. This optimizes
    # main memory consumption.
    export STARPU_LIMIT_MIN_SUBMITTED_TASKS=6400
    export STARPU_LIMIT_MAX_SUBMITTED_TASKS=7400
    # control number of cuda streams
    export STARPU_NWORKER_PER_CUDA=4
    
    # test mpi and get hostnames
    time mpiexec -np $SLURM_NPROCS hostname 1>&2
    
    # test mpi PingPong between two nodes
    time mpiexec -np $SLURM_NPROCS $INSTALLDIR/bin/IMB-MPI1 PingPong 1>&2
    
    # save performance of the GEMM on one node for reference
    B=$CHAMELEON_TILE_SIZE
    K=$((1*B))
    M=$((SLURM_CPUS_ON_NODE*K))
    if [ $PREC == "s" ]; then time mpiexec -np 1 $INSTALLDIR/bin/chameleon_stesting -t $NTHREADS -g $NCUDAS -o gemm -b $B -n $M; fi
    if [ $PREC == "d" ]; then time mpiexec -np 1 $INSTALLDIR/bin/chameleon_dtesting -t $NTHREADS -g $NCUDAS -o gemm -b $B -n $M; fi
    
    # launch the diodon application: mdsDriver which performs a
    # MultiDimensional Scaling analysis of a distance matrix. The main
    # input is the distance matrix given with one or several h5 files. The
    # main output is the cloud of points given by the MDS saved as a h5
    # file locally.
    time mpiexec -np $SLURM_NPROCS --map-by ppr:1:node:pe=40 $INSTALLDIR/bin/mdsDriver -t $NTHREADS -if $FILENAME -r $RANK -os 0 -outfiles $MATRIXDIR/$TESTCASE.mds -format h5 $DIODONPREC
    
  2. Run benchmark

    Advice: to avoid performances losses during the reading of hdf5 files avoid to let several jobs being executed at the same time. This because some input files are the same between two jobs. If several MPI processes from different jobs read the same file this will cause lower performances for each job.

    LISTMDS=""
    LISTJOB=""
    
    export GORDONDIR=$HOME/adt-gordon/releases/v3
    . $GORDONDIR/jzay/set_env.sh
    cd $GORDONDIR
    
    MATRIXDIR=$SCRATCH/gordon
    PREC=s
    RANK=10000
    
    TESTCASE=10V-RbcL_S74
    SIZE=23214
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/10V-RbcL_S74.h5
    NP=1
    TIME=00:30:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --account hyb@gpu --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME jzay/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L6
    SIZE=99594
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L6.h5
    NP=1
    TIME=00:30:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --account hyb@gpu --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME jzay/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L2_L3_L6
    SIZE=270983
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L2_L3_L6.txt
    NP=4
    TIME=01:00:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --account hyb@gpu --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME jzay/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L1_L3_L5_L7_L9
    SIZE=426548
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L1_L3_L5_L7_L9.txt
    NP=10
    TIME=01:00:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --account hyb@gpu --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME jzay/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L2_L4_L6_L8_L10
    SIZE=616644
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L2_L4_L6_L8_L10.txt
    NP=25
    TIME=01:00:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --account hyb@gpu --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME jzay/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    
    TESTCASE=L1_L2_L3_L4_L5_L6_L7_L8_L9_L10
    SIZE=1043192
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    HDF5FILE=$MATRIXDIR/L1_L2_L3_L4_L5_L6_L7_L8_L9_L10.txt
    NP=100
    TIME=01:30:00
    JOB_NAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    LISTJOB="$LISTJOB $JOB_NAME"
    sbatch --account hyb@gpu --job-name="$JOB_NAME" --output="$JOB_NAME.out" --error="$JOB_NAME.err" \
           --nodes=$NP --time=$TIME jzay/diodon.sl $TESTCASE $HDF5FILE $RANK $PREC
    

7.7 Experiments on Hawk

7.7.1 Generate an archive containing the source tarballs

See previous section Occigen to get the script.

On your labtop with internet access :

./generate_release_gordon.sh
scp gordon-v3.tar.gz hawk:

7.7.2 Install the softwares

Let's consider we have an access to the HLRS Hawk supercomputer. SSH connection on Hawk, please define the appropriate alias "hawk" depending on how you are able to access the cluster and your SSH configuration

ssh hawk

On your local computer copy the input matrices on hawk in a MATRIXDIR for example defined as follows:

MATRIXDIR=`ws_allocate gordon 60`

Prepare tarballs

cd $HOME
tar xvf gordon-v3.tar.gz
export GORDONDIR=$HOME/gordon-v3/adt-gordon/releases/v3
. $GORDONDIR/hawk/set_env.sh
cp $GORDONDIR/*.txt $MATRIXDIR/
  1. Environment on Hawk

    Define and load the software environment. This script is already commited in hawk/set_env.sh.

    • BUILDDIR is where we build libraries,
    • INSTALLDIR is where we install libraries,
    • MATRIXDIR is where we store and read the matrices
    #!/bin/bash
    module purge
    module add system/site_names
    module add system/ws/1.3.1
    module add system/wrappers/1.0
    module add gcc/9.2.0
    #module add mpt/2.23
    module add cmake/3.16.4
    module add mkl/19.1.0
    
    export CC=gcc
    export CXX=g++
    export FC=gfortran
    
    export BUILDDIR=$HOME
    export INSTALLDIR=$HOME/gordon-v3/install
    mkdir -p $INSTALLDIR
    #MATRIXDIR=`ws_allocate diodon 60`
    export MATRIXDIR=/lustre/cray/ws9/5/ws/iprflpru-diodon
    
    export PATH=$INSTALLDIR/bin:$PATH
    export CPATH=$INSTALLDIR/include:$CPATH
    export LIBRARY_PATH=$INSTALLDIR/lib:$LIBRARY_PATH
    export LD_LIBRARY_PATH=$INSTALLDIR/lib:$LD_LIBRARY_PATH
    export PKG_CONFIG_PATH=$INSTALLDIR/lib/pkgconfig:$PKG_CONFIG_PATH
    
  2. Install hwloc
    cd $BUILDDIR/gordon-v3
    tar xvf hwloc-2.1.0.tar.gz
    cd hwloc-2.1.0/
    ./configure --disable-nvml --prefix=$INSTALLDIR
    make -j5 install
    
  3. Install ucx
    cd $BUILDDIR/gordon-v3
    tar xvf ucx-1.7.0.tar.gz
    cd ucx-1.7.0/
    ./contrib/configure-release --enable-optimizations --enable-mt --prefix=$INSTALLDIR
    make -j5 install
    
  4. Install openmpi
    cd $BUILDDIR/gordon-v3
    tar xvf openmpi-4.0.2.tar.gz
    cd openmpi-4.0.2/
    #./configure --with-ucx --enable-mca-no-build=btl-uct --with-hwloc=$INSTALLDIR --prefix=$INSTALLDIR
    ./configure --with-tm=/opt/pbs --enable-mca-no-build=btl-uct --with-hwloc=$INSTALLDIR --prefix=$INSTALLDIR
    make -j5 install
    
  5. Install intel-mpi-benchmarks
    cd $BUILDDIR/gordon-v3
    tar xvf IMB-v2019.6.tar.gz
    cd mpi-benchmarks-IMB-v2019.6/
    CC=$INSTALLDIR/bin/mpicc CXX=$INSTALLDIR/bin/mpic++ make IMB-MPI1
    cp IMB-MPI1 $INSTALLDIR/bin
    
  6. Install hdf5
    cd $BUILDDIR/gordon-v3
    tar xvf hdf5-1.10.6.tar.gz
    cd hdf5-1.10.6
    ./autogen.sh
    CC=mpicc FC=mpif90 CFLAGS="-fPIC" CXXFLAGS="-fPIC" ./configure --enable-shared --disable-static --disable-tests --enable-cxx --enable-fortran --enable-threadsafe --with-pthread --enable-unsupported --enable-parallel --prefix=$INSTALLDIR/
    make -j5 install
    
  7. Install starpu
    cd $BUILDDIR/gordon-v3
    tar xvf starpu-1.3.7.tar.gz
    cd starpu-1.3.7/
    ./configure --disable-build-doc --disable-build-examples --disable-build-tests --enable-blas-lib=none --disable-starpufft --disable-mlr --disable-opencl --disable-hdf5 --disable-fortran --enable-maxcpus=128 --prefix=$INSTALLDIR
    make -j5 install
    
  8. Install chameleon
    cd $BUILDDIR/gordon-v3/chameleon
    mkdir -p build
    cd build
    cmake .. -DCHAMELEON_USE_MPI=ON -DBLA_VENDOR=Intel10_64lp -DBUILD_SHARED_LIBS=ON -DCMAKE_INSTALL_PREFIX=$INSTALLDIR
    make -j5 install
    
  9. Install diodon
    cd $BUILDDIR/gordon-v3/diodon
    git apply $GORDONDIR/hawk/diodon_hawk.patch
    mkdir -p build
    cd build
    cmake ../cpp -DDIODON_USE_CHAMELEON=ON -DDIODON_USE_MKL_AS_BLAS=ON -DCMAKE_INSTALL_PREFIX=$INSTALLDIR
    make -j5 install
    
  10. Install gnuplot (posttreatment)
    cd $BUILDDIR/gordon-v3/
    tar xvf gnuplot-5.4.0.tar.gz
    cd gnuplot-5.4.0/
    ./configure --prefix=$INSTALLDIR
    make -j5 install
    

7.7.3 Run experiments

  1. PBS script used on Hawk

    PBS script reference. You don't need to tangle it. This script is commited and located in $GORDONDIR/hawk/diodon.pbs

    #!/bin/bash
    #PBS -N diodon
    #PBS -e diodon.err
    #PBS -o diodon.out
    #PBS -l select=1:ncpus=128:mpiprocs=1
    #PBS -l walltime=00:15:00
    
    set -x
    env 1>&2
    
    # Change to the directory that the job was submitted from
    cd $PBS_O_WORKDIR
    
    # set env
    . $HOME/gordon-v3/adt-gordon/releases/v3/hawk/set_env.sh
    
    # arithmetic precision s (simple) or d (double)
    DIODONPREC=""
    if [ $PREC == "d" ]; then DIODONPREC="-d"; fi
    
    # activate timings of Diodon and FMR operations
    export FMR_TIME=1
    export DIODON_TIME=1
    
    # number of nodes
    NNODES=`cat $PBS_NODEFILE |wc -l`
    
    # Chameleon+StarPU better performances to dedicate one cpu for the
    # scheduler and one for MPI comms and one per GPU
    NTHREADS=126
    
    # read H5 files by big blocks or rows distributed over MPI processes
    # and not by Chameleon's tiles
    export H5_DISTRIB_2DNOCYCLIC=1
    
    # tile size in Chameleon
    export CHAMELEON_TILE_SIZE=740
    
    # force name of hostname to avoid independent calibration of each node
    export STARPU_HOSTNAME=hawk
    # submission window in StarPU to limit number of submitted tasks and
    # avoid to start some MPI communications too early. This optimizes
    # main memory consumption.
    #export STARPU_LIMIT_MIN_SUBMITTED_TASKS=6400
    #export STARPU_LIMIT_MAX_SUBMITTED_TASKS=7400
    
    # check MPI
    time mpiexec -n 2 $INSTALLDIR/bin/IMB-MPI1 PingPong 1>&2
    
    # check Chameleon on one node
    B=$CHAMELEON_TILE_SIZE
    K=$((1*B))
    N=$((126*K))
    export MKL_NUM_THREADS=1
    if [ $PREC == "s" ]; then time mpiexec -n 1 $INSTALLDIR/bin/chameleon_stesting -o gemm -t 126 -n $N -k $K -b $B; fi
    if [ $PREC == "d" ]; then time mpiexec -n 1 $INSTALLDIR/bin/chameleon_dtesting -o gemm -t 126 -n $N -k $K -b $B; fi
    unset MKL_NUM_THREADS
    
    # launch diodon on all the allocated nodes
    time mpiexec --map-by ppr:1:node:pe=128 $INSTALLDIR/bin/mdsDriver -t $NTHREADS -if $FILENAME -r $RANK -os 0 -outfiles $MATRIXDIR/$TESTCASE.mds -format h5 $DIODONPREC -nfrb
    
  2. Run benchmark

    Advice: to avoid performances losses during the reading of hdf5 files avoid to let several jobs being executed at the same time. This because some input files are the same between two jobs. If several MPI processes from different jobs read the same file this will cause lower performances for each job.

    LISTMDS=""
    LISTJOB=""
    
    export GORDONDIR=$HOME/gordon-v3/adt-gordon/releases/v3
    . $GORDONDIR/hawk/set_env.sh
    
    cd $GORDONDIR
    PREC=s
    RANK=10000
    
    TESTCASE=10V-RbcL_S74
    SIZE=23214
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    FILENAME=$MATRIXDIR/10V-RbcL_S74.h5
    NP=1
    TIME=00:15:00
    JOBNAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    rm $JOBNAME.err $JOBNAME.out
    LISTJOB="$LISTJOB $JOB_NAME"
    qsub -e $JOBNAME.err -o $JOBNAME.out -l select=$NP:ncpus=128:mpiprocs=1 -l walltime=$TIME -v "TESTCASE=$TESTCASE, FILENAME=$FILENAME, RANK=$RANK, PREC=$PREC" hawk/diodon.pbs
    
    TESTCASE=L6
    SIZE=99594
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    FILENAME=$MATRIXDIR/L6.h5
    NP=1
    TIME=00:30:00
    JOBNAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    rm $JOBNAME.err $JOBNAME.out
    LISTJOB="$LISTJOB $JOB_NAME"
    qsub -e $JOBNAME.err -o $JOBNAME.out -l select=$NP:ncpus=128:mpiprocs=1 -l walltime=$TIME -v "TESTCASE=$TESTCASE, FILENAME=$FILENAME, RANK=$RANK, PREC=$PREC" hawk/diodon.pbs
    
    TESTCASE=L2_L3_L6
    SIZE=270983
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    FILENAME=$MATRIXDIR/L2_L3_L6.txt
    NP=3
    TIME=00:30:00
    JOBNAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    rm $JOBNAME.err $JOBNAME.out
    LISTJOB="$LISTJOB $JOB_NAME"
    qsub -e $JOBNAME.err -o $JOBNAME.out -l select=$NP:ncpus=128:mpiprocs=1 -l walltime=$TIME -v "TESTCASE=$TESTCASE, FILENAME=$FILENAME, RANK=$RANK, PREC=$PREC" hawk/diodon.pbs
    
    TESTCASE=L1_L3_L5_L7_L9
    SIZE=426548
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    FILENAME=$MATRIXDIR/L1_L3_L5_L7_L9.txt
    NP=8
    TIME=01:00:00
    JOBNAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    rm $JOBNAME.err $JOBNAME.out
    LISTJOB="$LISTJOB $JOB_NAME"
    qsub -e $JOBNAME.err -o $JOBNAME.out -l select=$NP:ncpus=128:mpiprocs=1 -l walltime=$TIME -v "TESTCASE=$TESTCASE, FILENAME=$FILENAME, RANK=$RANK, PREC=$PREC" hawk/diodon.pbs
    
    TESTCASE=L2_L4_L6_L8_L10
    SIZE=616644
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    FILENAME=$MATRIXDIR/L2_L4_L6_L8_L10.txt
    NP=16
    TIME=01:00:00
    JOBNAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    rm $JOBNAME.err $JOBNAME.out
    LISTJOB="$LISTJOB $JOB_NAME"
    qsub -e $JOBNAME.err -o $JOBNAME.out -l select=$NP:ncpus=128:mpiprocs=1 -l walltime=$TIME -v "TESTCASE=$TESTCASE, FILENAME=$FILENAME, RANK=$RANK, PREC=$PREC" hawk/diodon.pbs
    
    TESTCASE=L1_L2_L3_L4_L5_L6_L7_L8_L9_L10
    SIZE=1043192
    LISTMDS="$LISTMDS $TESTCASE.mds.h5"
    FILENAME=$MATRIXDIR/L1_L2_L3_L4_L5_L6_L7_L8_L9_L10.txt
    NP=50
    TIME=01:00:00
    JOBNAME=diodon\_$PREC\_$NP\_$SIZE\_$RANK
    rm $JOBNAME.err $JOBNAME.out
    LISTJOB="$LISTJOB $JOB_NAME"
    qsub -e $JOBNAME.err -o $JOBNAME.out -l select=$NP:ncpus=128:mpiprocs=1 -l walltime=$TIME -v "TESTCASE=$TESTCASE, FILENAME=$FILENAME, RANK=$RANK, PREC=$PREC" hawk/diodon.pbs
    

7.7.4 Posttreatments

export GORDONDIR=$HOME/gordon-v3/adt-gordon/releases/v3
. $GORDONDIR/hawk/set_env.sh
cd $GORDONDIR
LISTJOB="diodon_s_1_99594_10000 diodon_s_3_270983_10000 diodon_s_8_426548_10000 diodon_s_16_616644_10000 diodon_s_50_1043192_10000"
LISTMDS="L6.mds.h5 L2_L3_L6.mds.h5 L1_L3_L5_L7_L9.mds.h5 L2_L4_L6_L8_L10.mds.h5 L1_L2_L3_L4_L5_L6_L7_L8_L9_L10.mds.h5"
  1. Performances graphs
    BENCHCASE="bench_diodon_hawk_s"
    ./loop_parse_results.sh $BENCHCASE "$LISTJOB"
    

    You should get the following files:

    • bench_diodon_hawk_s_time.dat
    • bench_diodon_hawk_s_perf.dat
    export GORDON_RESULT_TIME=$BENCHCASE\_time.dat
    gnuplot script_time.gp
    export GORDON_RESULT_PERF=$BENCHCASE\_perf.dat
    gnuplot script_perf.gp
    

    You should get the following files:

    • times.png
    • perfs.png

    times.png

    Figure 56: CPU times of Diodon

    perfs.png

    Figure 57: Performances of Chameleon GEMM and QR

Author: Gordon fellows

Created: 2021-11-24 Wed 13:33

Emacs 25.2.2 (Org mode 8.2.10)

Validate