Re: [petsc-users] Preconditioner for Helmholtz-like problem

Alexey Kozlov Sat, 17 Oct 2020 02:21:47 -0700

Matt,

Thank you for your reply!
My system has 8 NUMA nodes, so the memory bandwidth can increase up to 8
times when doing parallel computations. In other words, each node of the
big computer cluster works as a small cluster consisting of 8 nodes. Of
course, this works only if the contribution of communications between the
NUMA nodes is small. The total amount of memory on a single cluster node is
128GB, so it is enough to fit my application.


Below is the output of -log_view for three cases:
(1) BUILT-IN PETSC LU SOLVER
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------

./caat on a arch-linux-c-opt named d24cepyc110.crc.nd.edu with 1 processor,
by akozlov Sat Oct 17 03:58:23 2020
Using 0 OpenMP threads
Using Petsc Release Version 3.13.6, unknown

                         Max       Max/Min     Avg       Total
Time (sec):           5.551e+03     1.000   5.551e+03
Objects:              1.000e+01     1.000   1.000e+01
Flop:                 1.255e+13     1.000   1.255e+13  1.255e+13
Flop/sec:             2.261e+09     1.000   2.261e+09  2.261e+09
MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00     0.000

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N
--> 2N flop
                            and VecAXPY() for complex vectors of length N
--> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---
 -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total
    Avg         %Total    Count   %Total
 0:      Main Stage: 5.5509e+03 100.0%  1.2551e+13 100.0%  0.000e+00   0.0%
 0.000e+00        0.0%  0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this
phase
      %M - percent messages in this phase     %L - percent message lengths
in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop
     --- Global ---  --- Stage ----  Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
 Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatSolve               1 1.0 7.3267e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  6246
MatLUFactorSym         1 1.0 1.0673e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatLUFactorNum         1 1.0 5.5350e+03 1.0 1.25e+13 1.0 0.0e+00 0.0e+00
0.0e+00100100  0  0  0 100100  0  0  0  2267
MatAssemblyBegin       1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 1.0247e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 1.4306e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 1.2596e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                 4 1.0 9.3985e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyBegin       2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSetUp               1 1.0 1.6689e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 7.3284e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  6245
PCSetUp                1 1.0 5.5458e+03 1.0 1.25e+13 1.0 0.0e+00 0.0e+00
0.0e+00100100  0  0  0 100100  0  0  0  2262
PCApply                1 1.0 7.3267e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  6246
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Matrix     2              2  11501999992     0.
              Vector     2              2      3761520     0.
       Krylov Solver     1              1         1408     0.
      Preconditioner     1              1         1184     0.
           Index Set     3              3      1412088     0.
              Viewer     1              0            0     0.
========================================================================================================================
Average time to get PetscTime(): 7.15256e-08
#PETSc Option Table entries:
-ksp_type preonly
-log_view
-pc_type lu
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 16 sizeof(PetscInt) 4
Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl
--with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
--with-scalar-type=complex --with-clanguage=c --with-openmp
--with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx
-axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2
-mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist
--download-mumps --download-scalapack --download-metis --download-cmake
--download-parmetis --download-ptscotch
-----------------------------------------
Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu
Machine characteristics:
Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc
Using PETSc arch: arch-linux-c-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2 -fopenmp
Using Fortran compiler: mpif90  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2  -fopenmp
-----------------------------------------

Using include paths: -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/include
-I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include
-I/opt/crc/v/valgrind/3.14/ompi/include
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc
-Wl,-rpath,/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-L/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl
-Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64
-L/opt/crc/i/intel/19.0/mkl/lib/intel64
-Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64
-L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64
-L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-L/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
-L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps
-lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64
-lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis
-lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis
-lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread
-lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc
-lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl
-----------------------------------------


(2) EXTERNAL PACKAGE MUMPS, 1 MPI PROCESS
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------

./caat on a arch-linux-c-opt named d24cepyc068.crc.nd.edu with 1 processor,
by akozlov Sat Oct 17 01:55:20 2020
Using 0 OpenMP threads
Using Petsc Release Version 3.13.6, unknown

                         Max       Max/Min     Avg       Total
Time (sec):           1.075e+02     1.000   1.075e+02
Objects:              9.000e+00     1.000   9.000e+00
Flop:                 1.959e+12     1.000   1.959e+12  1.959e+12
Flop/sec:             1.823e+10     1.000   1.823e+10  1.823e+10
MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00     0.000

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N
--> 2N flop
                            and VecAXPY() for complex vectors of length N
--> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---
 -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total
    Avg         %Total    Count   %Total
 0:      Main Stage: 1.0747e+02 100.0%  1.9594e+12 100.0%  0.000e+00   0.0%
 0.000e+00        0.0%  0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this
phase
      %M - percent messages in this phase     %L - percent message lengths
in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop
     --- Global ---  --- Stage ----  Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
 Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatSolve               1 1.0 3.1965e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00
0.0e+00  0100  0  0  0   0100  0  0  0 6126201
MatLUFactorSym         1 1.0 2.3141e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatLUFactorNum         1 1.0 1.0001e+02 1.0 1.16e+09 1.0 0.0e+00 0.0e+00
0.0e+00 93  0  0  0  0  93  0  0  0  0    12
MatAssemblyBegin       1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 1.0067e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 1.8650e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 1.3029e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecCopy                1 1.0 1.0943e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                 4 1.0 9.2626e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyBegin       2 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSetUp               1 1.0 1.6689e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 3.1981e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00
0.0e+00  0100  0  0  0   0100  0  0  0 6123146
PCSetUp                1 1.0 1.0251e+02 1.0 1.16e+09 1.0 0.0e+00 0.0e+00
0.0e+00 95  0  0  0  0  95  0  0  0  0    11
PCApply                1 1.0 3.1965e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00
0.0e+00  0100  0  0  0   0100  0  0  0 6126096
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Matrix     2              2     59441612     0.
              Vector     2              2      3761520     0.
       Krylov Solver     1              1         1408     0.
      Preconditioner     1              1         1184     0.
           Index Set     2              2       941392     0.
              Viewer     1              0            0     0.
========================================================================================================================
Average time to get PetscTime(): 4.76837e-08
#PETSc Option Table entries:
-ksp_type preonly
-log_view
-pc_factor_mat_solver_type mumps
-pc_type lu
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 16 sizeof(PetscInt) 4
Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl
--with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
--with-scalar-type=complex --with-clanguage=c --with-openmp
--with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx
-axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2
-mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist
--download-mumps --download-scalapack --download-metis --download-cmake
--download-parmetis --download-ptscotch
-----------------------------------------
Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu
Machine characteristics:
Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc
Using PETSc arch: arch-linux-c-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2 -fopenmp
Using Fortran compiler: mpif90  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2  -fopenmp
-----------------------------------------

Using include paths: -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/include
-I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include
-I/opt/crc/v/valgrind/3.14/ompi/include
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc
-Wl,-rpath,/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-L/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl
-Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64
-L/opt/crc/i/intel/19.0/mkl/lib/intel64
-Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64
-L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64
-L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-L/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
-L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps
-lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64
-lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis
-lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis
-lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread
-lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc
-lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl
-----------------------------------------


(3) EXTERNAL PACKAGE MUMPS , 48 MPI PROCESSES ON A SINGLE CLUSTER NODE WITH
8 NUMA NODES
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------

./caat on a arch-linux-c-opt named d24cepyc069.crc.nd.edu with 48
processors, by akozlov Sat Oct 17 04:40:25 2020
Using 0 OpenMP threads
Using Petsc Release Version 3.13.6, unknown

                         Max       Max/Min     Avg       Total
Time (sec):           1.415e+01     1.000   1.415e+01
Objects:              3.000e+01     1.000   3.000e+01
Flop:                 4.855e+10     1.637   4.084e+10  1.960e+12
Flop/sec:             3.431e+09     1.637   2.886e+09  1.385e+11
MPI Messages:         1.180e+02     2.682   8.169e+01  3.921e+03
MPI Message Lengths:  1.559e+05     5.589   1.238e+03  4.855e+06
MPI Reductions:       4.000e+01     1.000

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N
--> 2N flop
                            and VecAXPY() for complex vectors of length N
--> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---
 -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total
    Avg         %Total    Count   %Total
 0:      Main Stage: 1.4150e+01 100.0%  1.9602e+12 100.0%  3.921e+03 100.0%
 1.238e+03      100.0%  3.100e+01  77.5%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this
phase
      %M - percent messages in this phase     %L - percent message lengths
in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop
     --- Global ---  --- Stage ----  Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
 Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSided          5 1.0 1.0707e-02 3.3 0.00e+00 0.0 7.8e+02 4.0e+00
5.0e+00  0  0 20  0 12   0  0 20  0 16     0
BuildTwoSidedF         3 1.0 8.6837e-03 7.8 0.00e+00 0.0 0.0e+00 0.0e+00
3.0e+00  0  0  0  0  8   0  0  0  0 10     0
MatSolve               1 1.0 6.6314e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03
6.0e+00  0100 90 87 15   0100 90 87 19 29529617
MatLUFactorSym         1 1.0 2.4322e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
4.0e+00 17  0  0  0 10  17  0  0  0 13     0
MatLUFactorNum         1 1.0 5.8816e+00 1.0 5.08e+07 1.8 0.0e+00 0.0e+00
0.0e+00 42  0  0  0  0  42  0  0  0  0   332
MatAssemblyBegin       1 1.0 7.3917e-0357.6 0.00e+00 0.0 0.0e+00 0.0e+00
1.0e+00  0  0  0  0  2   0  0  0  0  3     0
MatAssemblyEnd         1 1.0 2.5823e-02 1.0 0.00e+00 0.0 3.8e+02 1.6e+03
5.0e+00  0  0 10 13 12   0  0 10 13 16     0
MatGetRowIJ            1 1.0 3.5763e-06 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 9.2506e-05 3.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                 4 1.0 5.3000e-0460.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyBegin       2 1.0 2.2390e-0319.1 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00  0  0  0  0  5   0  0  0  0  6     0
VecAssemblyEnd         2 1.0 9.7752e-06 2.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin        2 1.0 1.6036e-0312.8 0.00e+00 0.0 5.9e+02 4.8e+03
1.0e+00  0  0 15 58  2   0  0 15 58  3     0
VecScatterEnd          2 1.0 2.0087e-0338.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFSetGraph             2 1.0 1.5259e-05 5.8 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFSetUp                3 1.0 3.3023e-03 2.9 0.00e+00 0.0 1.6e+03 7.0e+02
2.0e+00  0  0 40 23  5   0  0 40 23  6     0
SFBcastOpBegin         2 1.0 1.5953e-0313.7 0.00e+00 0.0 5.9e+02 4.8e+03
1.0e+00  0  0 15 58  2   0  0 15 58  3     0
SFBcastOpEnd           2 1.0 2.0008e-0345.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFPack                 2 1.0 1.4646e-03361.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFUnpack               2 1.0 4.1723e-0529.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSetUp               1 1.0 3.0994e-06 3.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 6.6350e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03
6.0e+00  0100 90 87 15   0100 90 87 19 29513594
PCSetUp                1 1.0 8.4679e+00 1.0 5.08e+07 1.8 0.0e+00 0.0e+00
1.0e+01 60  0  0  0 25  60  0  0  0 32   230
PCApply                1 1.0 6.6319e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03
6.0e+00  0100 90 87 15   0100 90 87 19 29527282
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Matrix     4              4      1224428     0.
         Vec Scatter     3              3         2400     0.
              Vector     8              8      1923424     0.
           Index Set     9              9        32392     0.
   Star Forest Graph     3              3         3376     0.
       Krylov Solver     1              1         1408     0.
      Preconditioner     1              1         1160     0.
              Viewer     1              0            0     0.
========================================================================================================================
Average time to get PetscTime(): 7.15256e-08
Average time for MPI_Barrier(): 3.48091e-06
Average time for zero size MPI_Send(): 2.49843e-06
#PETSc Option Table entries:
-ksp_type preonly
-log_view
-pc_factor_mat_solver_type mumps
-pc_type lu
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 16 sizeof(PetscInt) 4
Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl
--with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
--with-scalar-type=complex --with-clanguage=c --with-openmp
--with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx
-axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2
-mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist
--download-mumps --download-scalapack --download-metis --download-cmake
--download-parmetis --download-ptscotch
-----------------------------------------
Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu
Machine characteristics:
Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc
Using PETSc arch: arch-linux-c-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2 -fopenmp
Using Fortran compiler: mpif90  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2  -fopenmp
-----------------------------------------

Using include paths: -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/include
-I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include
-I/opt/crc/v/valgrind/3.14/ompi/include
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc
-Wl,-rpath,/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-L/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl
-Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64
-L/opt/crc/i/intel/19.0/mkl/lib/intel64
-Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64
-L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64
-L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-L/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
-L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps
-lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64
-lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis
-lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis
-lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread
-lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc
-lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl
-----------------------------------------



On Sat, Oct 17, 2020 at 12:33 AM Matthew Knepley <knep...@gmail.com> wrote:

> On Fri, Oct 16, 2020 at 11:48 PM Alexey Kozlov <alexey.v.kozlo...@nd.edu>
> wrote:
>
>> Thank you for your advice! My sparse matrix seems to be very stiff so I
>> have decided to concentrate on the direct solvers. I have very good results
>> with MUMPS. Due to a lack of time I haven’t got a good result with
>> SuperLU_DIST and haven’t compiled PETSc with Pastix yet but I have a
>> feeling that MUMPS is the best. I have run a sequential test case with
>> built-in PETSc LU (-pc_type lu -ksp_type preonly) and MUMPs (-pc_type lu
>> -ksp_type preonly -pc_factor_mat_solver_type mumps) with default settings
>> and found that MUMPs was about 50 times faster than the built-in LU and
>> used about 3 times less RAM. Do you have any idea why it could be?
>>
> The numbers do not sound realistic, but of course we do not have your
> particular problem. In particular, the memory figure seems impossible.
>
>> My test case has about 100,000 complex equations with about 3,000,000
>> non-zeros. PETSc was compiled with the following options: ./configure
>> --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl --enable-g
>> --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
>> --with-scalar-type=complex --with-clanguage=c --with-openmp
>> --with-debugging=0 COPTFLAGS='-mkl=parallel -O2 -mavx -axCORE-AVX2
>> -no-prec-div -fp-model fast=2' FOPTFLAGS='-mkl=parallel -O2 -mavx
>> -axCORE-AVX2 -no-prec-div -fp-model fast=2' CXXOPTFLAGS='-mkl=parallel -O2
>> -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2' --download-superlu_dist
>> --download-mumps --download-scalapack --download-metis --download-cmake
>> --download-parmetis --download-ptscotch.
>>
>> Running MUPMS in parallel using MPI also gave me a significant gain in
>> performance (about 10 times on a single cluster node).
>>
> Again, this does not appear to make sense. The performance should be
> limited by memory bandwidth, and a single cluster node will not usually have
> 10x the bandwidth of a CPU, although it might be possible with a very old
> CPU.
>
> It would help to understand the performance if you would send the output
> of -log_view.
>
>   Thanks,
>
>     Matt
>
>> Could you, please, advise me whether I can adjust some options for the
>> direct solvers to improve performance? Should I try MUMPS in OpenMP mode?
>>
>> On Sat, Sep 19, 2020 at 7:40 AM Mark Adams <mfad...@lbl.gov> wrote:
>>
>>> As Jed said high frequency is hard. AMG, as-is,  can be adapted (
>>> https://link.springer.com/article/10.1007/s00466-006-0047-8) with
>>> parameters.
>>> AMG for convection: use richardson/sor and not chebyshev smoothers and
>>> in smoothed aggregation (gamg) don't smooth (-pc_gamg_agg_nsmooths 0).
>>> Mark
>>>
>>> On Sat, Sep 19, 2020 at 2:11 AM Alexey Kozlov <alexey.v.kozlo...@nd.edu>
>>> wrote:
>>>
>>>> Thanks a lot! I'll check them out.
>>>>
>>>> On Sat, Sep 19, 2020 at 1:41 AM Barry Smith <bsm...@petsc.dev> wrote:
>>>>
>>>>>
>>>>>   These are small enough that likely sparse direct solvers are the
>>>>> best use of your time and for general efficiency.
>>>>>
>>>>>   PETSc supports 3 parallel direct solvers, SuperLU_DIST, MUMPs and
>>>>> Pastix. I recommend configuring PETSc for all three of them and then
>>>>> comparing them for problems of interest to you.
>>>>>
>>>>>    --download-superlu_dist --download-mumps --download-pastix
>>>>> --download-scalapack (used by MUMPS) --download-metis --download-parmetis
>>>>> --download-ptscotch
>>>>>
>>>>>   Barry
>>>>>
>>>>>
>>>>> On Sep 18, 2020, at 11:28 PM, Alexey Kozlov <alexey.v.kozlo...@nd.edu>
>>>>> wrote:
>>>>>
>>>>> Thanks for the tips! My matrix is complex and unsymmetric. My typical
>>>>> test case has of the order of one million equations. I use a 2nd-order
>>>>> finite-difference scheme with 19-point stencil, so my typical test case
>>>>> uses several GB of RAM.
>>>>>
>>>>> On Fri, Sep 18, 2020 at 11:52 PM Jed Brown <j...@jedbrown.org> wrote:
>>>>>
>>>>>> Unfortunately, those are hard problems in which the "good" methods
>>>>>> are technical and hard to make black-box.  There are "sweeping" methods
>>>>>> that solve on 2D "slabs" with PML boundary conditions, H-matrix based
>>>>>> methods, and fancy multigrid methods.  Attempting to solve with STRUMPACK
>>>>>> is probably the easiest thing to try (--download-strumpack).
>>>>>>
>>>>>>
>>>>>> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MATSOLVERSSTRUMPACK.html
>>>>>>
>>>>>> Is the matrix complex symmetric?
>>>>>>
>>>>>> Note that you can use a direct solver (MUMPS, STRUMPACK, etc.) for a
>>>>>> 3D problem like this if you have enough memory.  I'm assuming the memory 
>>>>>> or
>>>>>> time is unacceptable and you want an iterative method with much lower 
>>>>>> setup
>>>>>> costs.
>>>>>>
>>>>>> Alexey Kozlov <alexey.v.kozlo...@nd.edu> writes:
>>>>>>
>>>>>> > Dear all,
>>>>>> >
>>>>>> > I am solving a convected wave equation in a frequency domain. This
>>>>>> equation
>>>>>> > is a 3D Helmholtz equation with added first-order derivatives and
>>>>>> mixed
>>>>>> > derivatives, and with complex coefficients. The discretized PDE
>>>>>> results in
>>>>>> > a sparse linear system (about 10^6 equations) which is solved in
>>>>>> PETSc. I
>>>>>> > am having difficulty with the code convergence at high frequency,
>>>>>> skewed
>>>>>> > grid, and high Mach number. I suspect it may be due to the
>>>>>> preconditioner I
>>>>>> > use. I am currently using the ILU preconditioner with the number of
>>>>>> fill
>>>>>> > levels 2 or 3, and BCGS or GMRES solvers. I suspect the state of
>>>>>> the art
>>>>>> > has evolved and there are better preconditioners for Helmholtz-like
>>>>>> > problems. Could you, please, advise me on a better preconditioner?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Alexey
>>>>>> >
>>>>>> > --
>>>>>> > Alexey V. Kozlov
>>>>>> >
>>>>>> > Research Scientist
>>>>>> > Department of Aerospace and Mechanical Engineering
>>>>>> > University of Notre Dame
>>>>>> >
>>>>>> > 117 Hessert Center
>>>>>> > Notre Dame, IN 46556-5684
>>>>>> > Phone: (574) 631-4335
>>>>>> > Fax: (574) 631-8355
>>>>>> > Email: akoz...@nd.edu
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Alexey V. Kozlov
>>>>>
>>>>> Research Scientist
>>>>> Department of Aerospace and Mechanical Engineering
>>>>> University of Notre Dame
>>>>>
>>>>> 117 Hessert Center
>>>>> Notre Dame, IN 46556-5684
>>>>> Phone: (574) 631-4335
>>>>> Fax: (574) 631-8355
>>>>> Email: akoz...@nd.edu
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Alexey V. Kozlov
>>>>
>>>> Research Scientist
>>>> Department of Aerospace and Mechanical Engineering
>>>> University of Notre Dame
>>>>
>>>> 117 Hessert Center
>>>> Notre Dame, IN 46556-5684
>>>> Phone: (574) 631-4335
>>>> Fax: (574) 631-8355
>>>> Email: akoz...@nd.edu
>>>>
>>>
>>
>> --
>> Alexey V. Kozlov
>>
>> Research Scientist
>> Department of Aerospace and Mechanical Engineering
>> University of Notre Dame
>>
>> 117 Hessert Center
>> Notre Dame, IN 46556-5684
>> Phone: (574) 631-4335
>> Fax: (574) 631-8355
>> Email: akoz...@nd.edu
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>


-- 
Alexey V. Kozlov

Research Scientist
Department of Aerospace and Mechanical Engineering
University of Notre Dame

117 Hessert Center
Notre Dame, IN 46556-5684
Phone: (574) 631-4335
Fax: (574) 631-8355
Email: akoz...@nd.edu

Re: [petsc-users] Preconditioner for Helmholtz-like problem

Reply via email to