Matt, Thank you for your reply! My system has 8 NUMA nodes, so the memory bandwidth can increase up to 8 times when doing parallel computations. In other words, each node of the big computer cluster works as a small cluster consisting of 8 nodes. Of course, this works only if the contribution of communications between the NUMA nodes is small. The total amount of memory on a single cluster node is 128GB, so it is enough to fit my application.
Below is the output of -log_view for three cases: (1) BUILT-IN PETSC LU SOLVER ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./caat on a arch-linux-c-opt named d24cepyc110.crc.nd.edu with 1 processor, by akozlov Sat Oct 17 03:58:23 2020 Using 0 OpenMP threads Using Petsc Release Version 3.13.6, unknown Max Max/Min Avg Total Time (sec): 5.551e+03 1.000 5.551e+03 Objects: 1.000e+01 1.000 1.000e+01 Flop: 1.255e+13 1.000 1.255e+13 1.255e+13 Flop/sec: 2.261e+09 1.000 2.261e+09 2.261e+09 MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00 MPI Reductions: 0.000e+00 0.000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flop and VecAXPY() for complex vectors of length N --> 8N flop Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total Count %Total Avg %Total Count %Total 0: Main Stage: 5.5509e+03 100.0% 1.2551e+13 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flop: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent AvgLen: average message length (bytes) Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flop in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec) Flop --- Global --- --- Stage ---- Total Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage MatSolve 1 1.0 7.3267e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 6246 MatLUFactorSym 1 1.0 1.0673e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatLUFactorNum 1 1.0 5.5350e+03 1.0 1.25e+13 1.0 0.0e+00 0.0e+00 0.0e+00100100 0 0 0 100100 0 0 0 2267 MatAssemblyBegin 1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyEnd 1 1.0 1.0247e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ 1 1.0 1.4306e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 1.2596e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 4 1.0 9.3985e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyBegin 2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyEnd 2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSetUp 1 1.0 1.6689e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 1 1.0 7.3284e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 6245 PCSetUp 1 1.0 5.5458e+03 1.0 1.25e+13 1.0 0.0e+00 0.0e+00 0.0e+00100100 0 0 0 100100 0 0 0 2262 PCApply 1 1.0 7.3267e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 6246 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. Reports information only for process 0. --- Event Stage 0: Main Stage Matrix 2 2 11501999992 0. Vector 2 2 3761520 0. Krylov Solver 1 1 1408 0. Preconditioner 1 1 1184 0. Index Set 3 3 1412088 0. Viewer 1 0 0 0. ======================================================================================================================== Average time to get PetscTime(): 7.15256e-08 #PETSc Option Table entries: -ksp_type preonly -log_view -pc_type lu #End of PETSc Option Table entries Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 16 sizeof(PetscInt) 4 Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl --with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi --with-scalar-type=complex --with-clanguage=c --with-openmp --with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist --download-mumps --download-scalapack --download-metis --download-cmake --download-parmetis --download-ptscotch ----------------------------------------- Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu Machine characteristics: Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc Using PETSc arch: arch-linux-c-opt ----------------------------------------- Using C compiler: mpicc -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2 -fopenmp Using Fortran compiler: mpif90 -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2 -fopenmp ----------------------------------------- Using include paths: -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/include -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include -I/opt/crc/v/valgrind/3.14/ompi/include ----------------------------------------- Using C linker: mpicc Using Fortran linker: mpif90 Using libraries: -Wl,-rpath,/afs/ crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/ crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc -Wl,-rpath,/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl -Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib -L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib -Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7 -L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7 -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64 -L/opt/crc/i/intel/19.0/mkl/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64 -L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64 -L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/ crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin -L/afs/ crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin -L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis -lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis -lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc -lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl ----------------------------------------- (2) EXTERNAL PACKAGE MUMPS, 1 MPI PROCESS ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./caat on a arch-linux-c-opt named d24cepyc068.crc.nd.edu with 1 processor, by akozlov Sat Oct 17 01:55:20 2020 Using 0 OpenMP threads Using Petsc Release Version 3.13.6, unknown Max Max/Min Avg Total Time (sec): 1.075e+02 1.000 1.075e+02 Objects: 9.000e+00 1.000 9.000e+00 Flop: 1.959e+12 1.000 1.959e+12 1.959e+12 Flop/sec: 1.823e+10 1.000 1.823e+10 1.823e+10 MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00 MPI Reductions: 0.000e+00 0.000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flop and VecAXPY() for complex vectors of length N --> 8N flop Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total Count %Total Avg %Total Count %Total 0: Main Stage: 1.0747e+02 100.0% 1.9594e+12 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flop: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent AvgLen: average message length (bytes) Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flop in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec) Flop --- Global --- --- Stage ---- Total Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage MatSolve 1 1.0 3.1965e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00 0.0e+00 0100 0 0 0 0100 0 0 0 6126201 MatLUFactorSym 1 1.0 2.3141e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 MatLUFactorNum 1 1.0 1.0001e+02 1.0 1.16e+09 1.0 0.0e+00 0.0e+00 0.0e+00 93 0 0 0 0 93 0 0 0 0 12 MatAssemblyBegin 1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyEnd 1 1.0 1.0067e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ 1 1.0 1.8650e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 1.3029e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecCopy 1 1.0 1.0943e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 4 1.0 9.2626e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyBegin 2 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyEnd 2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSetUp 1 1.0 1.6689e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 1 1.0 3.1981e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00 0.0e+00 0100 0 0 0 0100 0 0 0 6123146 PCSetUp 1 1.0 1.0251e+02 1.0 1.16e+09 1.0 0.0e+00 0.0e+00 0.0e+00 95 0 0 0 0 95 0 0 0 0 11 PCApply 1 1.0 3.1965e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00 0.0e+00 0100 0 0 0 0100 0 0 0 6126096 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. Reports information only for process 0. --- Event Stage 0: Main Stage Matrix 2 2 59441612 0. Vector 2 2 3761520 0. Krylov Solver 1 1 1408 0. Preconditioner 1 1 1184 0. Index Set 2 2 941392 0. Viewer 1 0 0 0. ======================================================================================================================== Average time to get PetscTime(): 4.76837e-08 #PETSc Option Table entries: -ksp_type preonly -log_view -pc_factor_mat_solver_type mumps -pc_type lu #End of PETSc Option Table entries Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 16 sizeof(PetscInt) 4 Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl --with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi --with-scalar-type=complex --with-clanguage=c --with-openmp --with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist --download-mumps --download-scalapack --download-metis --download-cmake --download-parmetis --download-ptscotch ----------------------------------------- Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu Machine characteristics: Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc Using PETSc arch: arch-linux-c-opt ----------------------------------------- Using C compiler: mpicc -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2 -fopenmp Using Fortran compiler: mpif90 -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2 -fopenmp ----------------------------------------- Using include paths: -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/include -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include -I/opt/crc/v/valgrind/3.14/ompi/include ----------------------------------------- Using C linker: mpicc Using Fortran linker: mpif90 Using libraries: -Wl,-rpath,/afs/ crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/ crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc -Wl,-rpath,/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl -Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib -L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib -Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7 -L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7 -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64 -L/opt/crc/i/intel/19.0/mkl/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64 -L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64 -L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/ crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin -L/afs/ crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin -L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis -lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis -lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc -lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl ----------------------------------------- (3) EXTERNAL PACKAGE MUMPS , 48 MPI PROCESSES ON A SINGLE CLUSTER NODE WITH 8 NUMA NODES ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./caat on a arch-linux-c-opt named d24cepyc069.crc.nd.edu with 48 processors, by akozlov Sat Oct 17 04:40:25 2020 Using 0 OpenMP threads Using Petsc Release Version 3.13.6, unknown Max Max/Min Avg Total Time (sec): 1.415e+01 1.000 1.415e+01 Objects: 3.000e+01 1.000 3.000e+01 Flop: 4.855e+10 1.637 4.084e+10 1.960e+12 Flop/sec: 3.431e+09 1.637 2.886e+09 1.385e+11 MPI Messages: 1.180e+02 2.682 8.169e+01 3.921e+03 MPI Message Lengths: 1.559e+05 5.589 1.238e+03 4.855e+06 MPI Reductions: 4.000e+01 1.000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flop and VecAXPY() for complex vectors of length N --> 8N flop Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total Count %Total Avg %Total Count %Total 0: Main Stage: 1.4150e+01 100.0% 1.9602e+12 100.0% 3.921e+03 100.0% 1.238e+03 100.0% 3.100e+01 77.5% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flop: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent AvgLen: average message length (bytes) Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flop in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec) Flop --- Global --- --- Stage ---- Total Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage BuildTwoSided 5 1.0 1.0707e-02 3.3 0.00e+00 0.0 7.8e+02 4.0e+00 5.0e+00 0 0 20 0 12 0 0 20 0 16 0 BuildTwoSidedF 3 1.0 8.6837e-03 7.8 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 8 0 0 0 0 10 0 MatSolve 1 1.0 6.6314e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03 6.0e+00 0100 90 87 15 0100 90 87 19 29529617 MatLUFactorSym 1 1.0 2.4322e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00 17 0 0 0 10 17 0 0 0 13 0 MatLUFactorNum 1 1.0 5.8816e+00 1.0 5.08e+07 1.8 0.0e+00 0.0e+00 0.0e+00 42 0 0 0 0 42 0 0 0 0 332 MatAssemblyBegin 1 1.0 7.3917e-0357.6 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 2 0 0 0 0 3 0 MatAssemblyEnd 1 1.0 2.5823e-02 1.0 0.00e+00 0.0 3.8e+02 1.6e+03 5.0e+00 0 0 10 13 12 0 0 10 13 16 0 MatGetRowIJ 1 1.0 3.5763e-06 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 9.2506e-05 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 4 1.0 5.3000e-0460.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyBegin 2 1.0 2.2390e-0319.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 5 0 0 0 0 6 0 VecAssemblyEnd 2 1.0 9.7752e-06 2.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecScatterBegin 2 1.0 1.6036e-0312.8 0.00e+00 0.0 5.9e+02 4.8e+03 1.0e+00 0 0 15 58 2 0 0 15 58 3 0 VecScatterEnd 2 1.0 2.0087e-0338.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 SFSetGraph 2 1.0 1.5259e-05 5.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 SFSetUp 3 1.0 3.3023e-03 2.9 0.00e+00 0.0 1.6e+03 7.0e+02 2.0e+00 0 0 40 23 5 0 0 40 23 6 0 SFBcastOpBegin 2 1.0 1.5953e-0313.7 0.00e+00 0.0 5.9e+02 4.8e+03 1.0e+00 0 0 15 58 2 0 0 15 58 3 0 SFBcastOpEnd 2 1.0 2.0008e-0345.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 SFPack 2 1.0 1.4646e-03361.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 SFUnpack 2 1.0 4.1723e-0529.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSetUp 1 1.0 3.0994e-06 3.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 1 1.0 6.6350e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03 6.0e+00 0100 90 87 15 0100 90 87 19 29513594 PCSetUp 1 1.0 8.4679e+00 1.0 5.08e+07 1.8 0.0e+00 0.0e+00 1.0e+01 60 0 0 0 25 60 0 0 0 32 230 PCApply 1 1.0 6.6319e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03 6.0e+00 0100 90 87 15 0100 90 87 19 29527282 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. Reports information only for process 0. --- Event Stage 0: Main Stage Matrix 4 4 1224428 0. Vec Scatter 3 3 2400 0. Vector 8 8 1923424 0. Index Set 9 9 32392 0. Star Forest Graph 3 3 3376 0. Krylov Solver 1 1 1408 0. Preconditioner 1 1 1160 0. Viewer 1 0 0 0. ======================================================================================================================== Average time to get PetscTime(): 7.15256e-08 Average time for MPI_Barrier(): 3.48091e-06 Average time for zero size MPI_Send(): 2.49843e-06 #PETSc Option Table entries: -ksp_type preonly -log_view -pc_factor_mat_solver_type mumps -pc_type lu #End of PETSc Option Table entries Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 16 sizeof(PetscInt) 4 Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl --with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi --with-scalar-type=complex --with-clanguage=c --with-openmp --with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist --download-mumps --download-scalapack --download-metis --download-cmake --download-parmetis --download-ptscotch ----------------------------------------- Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu Machine characteristics: Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc Using PETSc arch: arch-linux-c-opt ----------------------------------------- Using C compiler: mpicc -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2 -fopenmp Using Fortran compiler: mpif90 -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2 -fopenmp ----------------------------------------- Using include paths: -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/include -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include -I/opt/crc/v/valgrind/3.14/ompi/include ----------------------------------------- Using C linker: mpicc Using Fortran linker: mpif90 Using libraries: -Wl,-rpath,/afs/ crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/ crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc -Wl,-rpath,/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl -Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib -L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib -Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7 -L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7 -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64 -L/opt/crc/i/intel/19.0/mkl/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64 -L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64 -L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/ crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin -L/afs/ crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin -L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis -lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis -lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc -lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl ----------------------------------------- On Sat, Oct 17, 2020 at 12:33 AM Matthew Knepley <knep...@gmail.com> wrote: > On Fri, Oct 16, 2020 at 11:48 PM Alexey Kozlov <alexey.v.kozlo...@nd.edu> > wrote: > >> Thank you for your advice! My sparse matrix seems to be very stiff so I >> have decided to concentrate on the direct solvers. I have very good results >> with MUMPS. Due to a lack of time I haven’t got a good result with >> SuperLU_DIST and haven’t compiled PETSc with Pastix yet but I have a >> feeling that MUMPS is the best. I have run a sequential test case with >> built-in PETSc LU (-pc_type lu -ksp_type preonly) and MUMPs (-pc_type lu >> -ksp_type preonly -pc_factor_mat_solver_type mumps) with default settings >> and found that MUMPs was about 50 times faster than the built-in LU and >> used about 3 times less RAM. Do you have any idea why it could be? >> > The numbers do not sound realistic, but of course we do not have your > particular problem. In particular, the memory figure seems impossible. > >> My test case has about 100,000 complex equations with about 3,000,000 >> non-zeros. PETSc was compiled with the following options: ./configure >> --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl --enable-g >> --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi >> --with-scalar-type=complex --with-clanguage=c --with-openmp >> --with-debugging=0 COPTFLAGS='-mkl=parallel -O2 -mavx -axCORE-AVX2 >> -no-prec-div -fp-model fast=2' FOPTFLAGS='-mkl=parallel -O2 -mavx >> -axCORE-AVX2 -no-prec-div -fp-model fast=2' CXXOPTFLAGS='-mkl=parallel -O2 >> -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2' --download-superlu_dist >> --download-mumps --download-scalapack --download-metis --download-cmake >> --download-parmetis --download-ptscotch. >> >> Running MUPMS in parallel using MPI also gave me a significant gain in >> performance (about 10 times on a single cluster node). >> > Again, this does not appear to make sense. The performance should be > limited by memory bandwidth, and a single cluster node will not usually have > 10x the bandwidth of a CPU, although it might be possible with a very old > CPU. > > It would help to understand the performance if you would send the output > of -log_view. > > Thanks, > > Matt > >> Could you, please, advise me whether I can adjust some options for the >> direct solvers to improve performance? Should I try MUMPS in OpenMP mode? >> >> On Sat, Sep 19, 2020 at 7:40 AM Mark Adams <mfad...@lbl.gov> wrote: >> >>> As Jed said high frequency is hard. AMG, as-is, can be adapted ( >>> https://link.springer.com/article/10.1007/s00466-006-0047-8) with >>> parameters. >>> AMG for convection: use richardson/sor and not chebyshev smoothers and >>> in smoothed aggregation (gamg) don't smooth (-pc_gamg_agg_nsmooths 0). >>> Mark >>> >>> On Sat, Sep 19, 2020 at 2:11 AM Alexey Kozlov <alexey.v.kozlo...@nd.edu> >>> wrote: >>> >>>> Thanks a lot! I'll check them out. >>>> >>>> On Sat, Sep 19, 2020 at 1:41 AM Barry Smith <bsm...@petsc.dev> wrote: >>>> >>>>> >>>>> These are small enough that likely sparse direct solvers are the >>>>> best use of your time and for general efficiency. >>>>> >>>>> PETSc supports 3 parallel direct solvers, SuperLU_DIST, MUMPs and >>>>> Pastix. I recommend configuring PETSc for all three of them and then >>>>> comparing them for problems of interest to you. >>>>> >>>>> --download-superlu_dist --download-mumps --download-pastix >>>>> --download-scalapack (used by MUMPS) --download-metis --download-parmetis >>>>> --download-ptscotch >>>>> >>>>> Barry >>>>> >>>>> >>>>> On Sep 18, 2020, at 11:28 PM, Alexey Kozlov <alexey.v.kozlo...@nd.edu> >>>>> wrote: >>>>> >>>>> Thanks for the tips! My matrix is complex and unsymmetric. My typical >>>>> test case has of the order of one million equations. I use a 2nd-order >>>>> finite-difference scheme with 19-point stencil, so my typical test case >>>>> uses several GB of RAM. >>>>> >>>>> On Fri, Sep 18, 2020 at 11:52 PM Jed Brown <j...@jedbrown.org> wrote: >>>>> >>>>>> Unfortunately, those are hard problems in which the "good" methods >>>>>> are technical and hard to make black-box. There are "sweeping" methods >>>>>> that solve on 2D "slabs" with PML boundary conditions, H-matrix based >>>>>> methods, and fancy multigrid methods. Attempting to solve with STRUMPACK >>>>>> is probably the easiest thing to try (--download-strumpack). >>>>>> >>>>>> >>>>>> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MATSOLVERSSTRUMPACK.html >>>>>> >>>>>> Is the matrix complex symmetric? >>>>>> >>>>>> Note that you can use a direct solver (MUMPS, STRUMPACK, etc.) for a >>>>>> 3D problem like this if you have enough memory. I'm assuming the memory >>>>>> or >>>>>> time is unacceptable and you want an iterative method with much lower >>>>>> setup >>>>>> costs. >>>>>> >>>>>> Alexey Kozlov <alexey.v.kozlo...@nd.edu> writes: >>>>>> >>>>>> > Dear all, >>>>>> > >>>>>> > I am solving a convected wave equation in a frequency domain. This >>>>>> equation >>>>>> > is a 3D Helmholtz equation with added first-order derivatives and >>>>>> mixed >>>>>> > derivatives, and with complex coefficients. The discretized PDE >>>>>> results in >>>>>> > a sparse linear system (about 10^6 equations) which is solved in >>>>>> PETSc. I >>>>>> > am having difficulty with the code convergence at high frequency, >>>>>> skewed >>>>>> > grid, and high Mach number. I suspect it may be due to the >>>>>> preconditioner I >>>>>> > use. I am currently using the ILU preconditioner with the number of >>>>>> fill >>>>>> > levels 2 or 3, and BCGS or GMRES solvers. I suspect the state of >>>>>> the art >>>>>> > has evolved and there are better preconditioners for Helmholtz-like >>>>>> > problems. Could you, please, advise me on a better preconditioner? >>>>>> > >>>>>> > Thanks, >>>>>> > Alexey >>>>>> > >>>>>> > -- >>>>>> > Alexey V. Kozlov >>>>>> > >>>>>> > Research Scientist >>>>>> > Department of Aerospace and Mechanical Engineering >>>>>> > University of Notre Dame >>>>>> > >>>>>> > 117 Hessert Center >>>>>> > Notre Dame, IN 46556-5684 >>>>>> > Phone: (574) 631-4335 >>>>>> > Fax: (574) 631-8355 >>>>>> > Email: akoz...@nd.edu >>>>>> >>>>> >>>>> >>>>> -- >>>>> Alexey V. Kozlov >>>>> >>>>> Research Scientist >>>>> Department of Aerospace and Mechanical Engineering >>>>> University of Notre Dame >>>>> >>>>> 117 Hessert Center >>>>> Notre Dame, IN 46556-5684 >>>>> Phone: (574) 631-4335 >>>>> Fax: (574) 631-8355 >>>>> Email: akoz...@nd.edu >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Alexey V. Kozlov >>>> >>>> Research Scientist >>>> Department of Aerospace and Mechanical Engineering >>>> University of Notre Dame >>>> >>>> 117 Hessert Center >>>> Notre Dame, IN 46556-5684 >>>> Phone: (574) 631-4335 >>>> Fax: (574) 631-8355 >>>> Email: akoz...@nd.edu >>>> >>> >> >> -- >> Alexey V. Kozlov >> >> Research Scientist >> Department of Aerospace and Mechanical Engineering >> University of Notre Dame >> >> 117 Hessert Center >> Notre Dame, IN 46556-5684 >> Phone: (574) 631-4335 >> Fax: (574) 631-8355 >> Email: akoz...@nd.edu >> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> > -- Alexey V. Kozlov Research Scientist Department of Aerospace and Mechanical Engineering University of Notre Dame 117 Hessert Center Notre Dame, IN 46556-5684 Phone: (574) 631-4335 Fax: (574) 631-8355 Email: akoz...@nd.edu