On Tue, Aug 20, 2024 at 1:10 PM neil liu <liufi...@gmail.com> wrote:
> Thanks a lot for your explanation, Stefano. Very helpful. > Yes. I am using dmplex to read a tetrahdra mesh from gmsh. With parmetis, > the scaling performance is improved a lot. > I will read your paper about how to change the basis for Nedelec elements. > > cpu # time for 500 ksp steps (s) parallel efficiency > 2 546 > 4 224 120% > 8 170 80% > This results are much better than previous attempt. Then I checked the > time spent by several Petsc built-in functions for the ksp solver. > > Functions time(2 cpus) time(4 cpus) time(8 cpus) > VecMDot 78.32 43.28 30.47 > VecMAXPY 92.95 48.37 30.798 > MatMult 246.08 126.63 82.94 > > It seems from cpu 4 to cpu 8, the scaling is not as good as from cpu 2 to > cpu 4. > Am I missing something? > Did you normalize by the number of calls? Thanks, Matt > Thanks a lot, > > Xiaodong > > > On Mon, Aug 19, 2024 at 4:15 AM Stefano Zampini <stefano.zamp...@gmail.com> > wrote: > >> It seems you are using DMPLEX to handle the mesh, correct? >> If so, you should configure using --download-parmetis to have a better >> domain decomposition since the default one just splits the cells in chunks >> as they are ordered. >> This results in a large number of primal dofs on average (191, from the >> output of ksp_view) >> ... >> Primal dofs : 176 204 191 >> ... >> that slows down the solver setup. >> >> Again, you should not use approximate local solvers with BDDC unless you >> know what you are doing. >> The theory for approximate solvers for BDDC is small and only for SPD >> problems. >> Looking at the output of log_view, coarse problem setup (PCBDDCCSet), and >> primal functions setup (PCBDDCCorr) costs 35 + 63 seconds, respectively. >> Also, the 500 application of the GAMG preconditioner for the Neumann >> solver (PCBDDCNeuS) takes 129 seconds out of the 400 seconds of the total >> solve time. >> >> PCBDDCTopo 1 1.0 3.1563e-01 1.0 1.11e+06 3.4 1.6e+03 3.9e+04 >> 3.8e+01 0 0 1 0 2 0 0 1 0 2 19 >> PCBDDCLKSP 2 1.0 2.0423e+00 1.7 9.31e+08 1.2 0.0e+00 0.0e+00 >> 2.0e+00 0 0 0 0 0 0 0 0 0 0 3378 >> PCBDDCLWor 1 1.0 3.9178e-02 13.4 0.00e+00 0.0 0.0e+00 0.0e+00 >> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> PCBDDCCorr 1 1.0 6.3981e+01 2.2 8.16e+10 1.6 0.0e+00 0.0e+00 >> 0.0e+00 11 11 0 0 0 11 11 0 0 0 8900 >> PCBDDCCSet 1 1.0 3.5453e+01 4564.9 1.06e+05 1.7 1.2e+03 >> 5.3e+03 5.0e+01 2 0 1 0 3 2 0 1 0 3 0 >> PCBDDCCKSP 1 1.0 6.3266e-01 1.3 0.00e+00 0.0 3.3e+02 1.1e+02 >> 2.2e+01 0 0 0 0 1 0 0 0 0 1 0 >> PCBDDCScal 1 1.0 6.8274e-03 1.3 1.11e+06 3.4 5.6e+01 3.2e+05 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 894 >> PCBDDCDirS 1000 1.0 6.0420e+00 3.5 6.64e+09 5.4 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 2995 >> PCBDDCNeuS 500 1.0 1.2901e+02 2.1 8.28e+10 1.2 0.0e+00 0.0e+00 >> 0.0e+00 22 12 0 0 0 22 12 0 0 0 4828 >> PCBDDCCoaS 500 1.0 5.8757e-01 1.8 1.09e+09 1.0 2.8e+04 7.4e+02 >> 5.0e+02 0 0 17 0 28 0 0 17 0 31 14901 >> >> Finally, if I look at the residual history, I see a sharp decrease and a >> very long plateau. This indicates a bad coarse space; as I said before, >> there's no hope of finding a suitable coarse space without first changing >> the basis of the Nedelec elements, which is done automatically if you >> prescribe the discrete gradient operator (see the paper I have linked to in >> my previous communication). >> >> >> >> Il giorno dom 18 ago 2024 alle ore 00:37 neil liu <liufi...@gmail.com> >> ha scritto: >> >>> Hi, Stefano, >>> Please see the attached for the information with 4 and 8 CPUs for the >>> complex matrix. >>> I am solving Maxwell equations (Attahced) using 2nd-order Nedelec >>> elements (two dofs each edge, and two dofs each face). >>> The computational domain consists of different mediums, e.g., vacuum and >>> substrate (different permitivity). >>> The PML is used to truncate the computational domain, absorbing the >>> outgoing wave and introducing complex numbers for the matrix. >>> >>> Thanks a lot for your suggestions. I will try MUMPS. >>> For now, I just want to fiddle with Petsc's built-in features to know >>> more about it. >>> Yes. 5000 is larger. Smaller value. e.g., 30, converges very slowly. >>> >>> Thanks a lot. >>> >>> Have a good weekend. >>> >>> >>> On Sat, Aug 17, 2024 at 9:23 AM Stefano Zampini < >>> stefano.zamp...@gmail.com> wrote: >>> >>>> Please include the output of -log_view -ksp_view -ksp_monitor to >>>> understand what's happening. >>>> >>>> Can you please share the equations you are solving so we can provide >>>> suggestions on the solver configuration? >>>> As I said, solving for Nedelec-type discretizations is challenging, and >>>> not for off-the-shelf, black box solvers >>>> >>>> Below are some comments: >>>> >>>> >>>> - You use a redundant SVD approach for the coarse solve, which can >>>> be inefficient if your coarse space grows. You can use a parallel direct >>>> solver like MUMPS (reconfigure with --download-mumps and use >>>> -pc_bddc_coarse_pc_type lu -pc_bddc_coarse_pc_factor_mat_solver_type >>>> mumps) >>>> - Why use ILU for the Dirichlet problem and GAMG for the Neumann >>>> problem? With 8 processes and 300K total dofs, you will have around 40K >>>> dofs per process, which is ok for a direct solver like MUMPS >>>> (-pc_bddc_dirichlet_pc_factor_mat_solver_type mumps, same for Neumann). >>>> With Nedelec dofs and the sparsity pattern they induce, I believe you >>>> can >>>> push to 80K dofs per process with good performance. >>>> - Why 5000 of restart for GMRES? It is highly inefficient to >>>> re-orthogonalize such a large set of vectors. >>>> >>>> >>>> Il giorno ven 16 ago 2024 alle ore 00:04 neil liu <liufi...@gmail.com> >>>> ha scritto: >>>> >>>>> Dear Petsc developers, >>>>> >>>>> Thanks for your previous help. Now, the PCBDDC can converge to 1e-8 >>>>> with, >>>>> >>>>> petsc-3.21.1/petsc/arch-linux-c-opt/bin/mpirun -n 8 ./app -pc_type >>>>> bddc -pc_bddc_coarse_redundant_pc_type svd -ksp_error_if_not_converged >>>>> -mat_type is -ksp_monitor -ksp_rtol 1e-8 -ksp_gmres_restart 5000 -ksp_view >>>>> -pc_bddc_use_local_mat_graph 0 -pc_bddc_dirichlet_pc_type ilu >>>>> -pc_bddc_neumann_pc_type gamg -pc_bddc_neumann_pc_gamg_esteig_ksp_max_it >>>>> 10 >>>>> -ksp_converged_reason -pc_bddc_neumann_approximate -ksp_max_it 500 >>>>> -log_view >>>>> >>>>> Then I used 2 cases for strong scaling test. One case only involves >>>>> real numbers (tetra #: 49,152; dof #: 324, 224 ) for matrix and rhs. The >>>>> 2nd case involves complex numbers (tetra #: 95,336; dof #: 611,432) due >>>>> to PML. >>>>> >>>>> Case 1: >>>>> cpu # Time for 500 ksp steps (s) Parallel >>>>> efficiency PCsetup time(s) >>>>> 2 234.7 >>>>> 3.12 >>>>> 4 126.6 >>>>> 0.92 1.62 >>>>> 8 84.97 >>>>> 0.69 1.26 >>>>> However for Case 2, >>>>> cpu # Time for 500 ksp steps (s) Parallel >>>>> efficiency PCsetup time(s) >>>>> 2 584.5 >>>>> 8.61 >>>>> 4 376.8 >>>>> 0.77 6.56 >>>>> 8 459.6 >>>>> 0.31 66.47 >>>>> For these 2 cases, I checked the time for PCsetup as an example. It >>>>> seems 8 cpus for case 2 used too much time on PCsetup. >>>>> Do you have any ideas about what is going on here? >>>>> >>>>> Thanks, >>>>> Xiaodong >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Stefano >>>> >>> >> >> -- >> Stefano >> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cv8zEG85Ua5eMw-4Fw6dtM6pp_fpFPiPqLUZHoZeqOqx846JROreXMlDUQnUBwxMFMeAWi4j-2LV_1iMA-xF$ <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cv8zEG85Ua5eMw-4Fw6dtM6pp_fpFPiPqLUZHoZeqOqx846JROreXMlDUQnUBwxMFMeAWi4j-2LV_20fuHbB$ >