On Tue, Aug 20, 2024 at 2:31 PM neil liu <liufi...@gmail.com> wrote:
> Thanks a lot for this explanation, Matt. I will explore whether the matrix > has the same size and spaisity. > I think it is much more likely that you just exhausted bandwidth on the node. Thanks, Matt > On Tue, Aug 20, 2024 at 1:45 PM Matthew Knepley <knep...@gmail.com> wrote: > >> On Tue, Aug 20, 2024 at 1:36 PM neil liu <liufi...@gmail.com> wrote: >> >>> Hi, Matt, >>> I think the time listed here represents the maximum total time across >>> different processors. >>> >>> Thanks a lot. >>> 2 cpus >>> 4 cpus 8 cpus >>> Event Count Time (sec) Count >>> Time (sec) Count Time (sec) >>> Max Ratio Max Ratio Max >>> Ratio Max Ratio Max Ratio Max Ratio >>> VecMDot 530 1.0 7.8320e+01 1.0 530 1.0 >>> 4.3285e+01 1.1 530 1.0 3.0476e+01 1.1 >>> VecMAXPY 534 1.0 9.2954e+01 1.0 534 1.0 >>> 4.8378e+01 1.1 534 1.0 3.0798e+01 1.1 >>> MatMult 8055 1.0 2.4608e+02 1.0 8103 1.0 >>> 1.2663e+02 1.0 8367 1.0 8.2942e+01 1.1 >>> >> >> For the number of calls listed. >> >> 1) The number of MatMults goes up, so you should normalize for that, but >> you still have about 1.6 speedup. However, this is >> all multiplications. Are we sure they have the same size and sparsity? >> >> 2) MAXPY is also 1.6 >> >> 3) MDot probably does not see the latency of one node, so again it is not >> speeding up as you might want. >> >> This looks like you are using a single node with 2, 4, and 8 procs. The >> memory bandwidth is exhausted sometime before 8 procs >> (maybe 6), so you cease to see speedup. You can check this by running >> `make streams` on the node. >> >> Thanks, >> >> Matt >> >> >>> On Tue, Aug 20, 2024 at 1:16 PM Matthew Knepley <knep...@gmail.com> >>> wrote: >>> >>>> On Tue, Aug 20, 2024 at 1:10 PM neil liu <liufi...@gmail.com> wrote: >>>> >>>>> Thanks a lot for your explanation, Stefano. Very helpful. >>>>> Yes. I am using dmplex to read a tetrahdra mesh from gmsh. With >>>>> parmetis, the scaling performance is improved a lot. >>>>> I will read your paper about how to change the basis for Nedelec >>>>> elements. >>>>> >>>>> cpu # time for 500 ksp steps (s) parallel efficiency >>>>> 2 546 >>>>> 4 224 120% >>>>> 8 170 80% >>>>> This results are much better than previous attempt. Then I checked the >>>>> time spent by several Petsc built-in functions for the ksp solver. >>>>> >>>>> Functions time(2 cpus) time(4 cpus) time(8 cpus) >>>>> VecMDot 78.32 43.28 30.47 >>>>> VecMAXPY 92.95 48.37 30.798 >>>>> MatMult 246.08 126.63 82.94 >>>>> >>>>> It seems from cpu 4 to cpu 8, the scaling is not as good as from cpu 2 >>>>> to cpu 4. >>>>> Am I missing something? >>>>> >>>> >>>> Did you normalize by the number of calls? >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> >>>>> Thanks a lot, >>>>> >>>>> Xiaodong >>>>> >>>>> >>>>> On Mon, Aug 19, 2024 at 4:15 AM Stefano Zampini < >>>>> stefano.zamp...@gmail.com> wrote: >>>>> >>>>>> It seems you are using DMPLEX to handle the mesh, correct? >>>>>> If so, you should configure using --download-parmetis to have a >>>>>> better domain decomposition since the default one just splits the cells >>>>>> in >>>>>> chunks as they are ordered. >>>>>> This results in a large number of primal dofs on average (191, from >>>>>> the output of ksp_view) >>>>>> ... >>>>>> Primal dofs : 176 204 191 >>>>>> ... >>>>>> that slows down the solver setup. >>>>>> >>>>>> Again, you should not use approximate local solvers with BDDC unless >>>>>> you know what you are doing. >>>>>> The theory for approximate solvers for BDDC is small and only for SPD >>>>>> problems. >>>>>> Looking at the output of log_view, coarse problem setup (PCBDDCCSet), >>>>>> and primal functions setup (PCBDDCCorr) costs 35 + 63 seconds, >>>>>> respectively. >>>>>> Also, the 500 application of the GAMG preconditioner for the Neumann >>>>>> solver (PCBDDCNeuS) takes 129 seconds out of the 400 seconds of the total >>>>>> solve time. >>>>>> >>>>>> PCBDDCTopo 1 1.0 3.1563e-01 1.0 1.11e+06 3.4 1.6e+03 >>>>>> 3.9e+04 3.8e+01 0 0 1 0 2 0 0 1 0 2 19 >>>>>> PCBDDCLKSP 2 1.0 2.0423e+00 1.7 9.31e+08 1.2 0.0e+00 >>>>>> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 3378 >>>>>> PCBDDCLWor 1 1.0 3.9178e-02 13.4 0.00e+00 0.0 0.0e+00 >>>>>> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>>> PCBDDCCorr 1 1.0 6.3981e+01 2.2 8.16e+10 1.6 0.0e+00 >>>>>> 0.0e+00 0.0e+00 11 11 0 0 0 11 11 0 0 0 8900 >>>>>> PCBDDCCSet 1 1.0 3.5453e+01 4564.9 1.06e+05 1.7 1.2e+03 >>>>>> 5.3e+03 5.0e+01 2 0 1 0 3 2 0 1 0 3 0 >>>>>> PCBDDCCKSP 1 1.0 6.3266e-01 1.3 0.00e+00 0.0 3.3e+02 >>>>>> 1.1e+02 2.2e+01 0 0 0 0 1 0 0 0 0 1 0 >>>>>> PCBDDCScal 1 1.0 6.8274e-03 1.3 1.11e+06 3.4 5.6e+01 >>>>>> 3.2e+05 0.0e+00 0 0 0 0 0 0 0 0 0 0 894 >>>>>> PCBDDCDirS 1000 1.0 6.0420e+00 3.5 6.64e+09 5.4 0.0e+00 >>>>>> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 2995 >>>>>> PCBDDCNeuS 500 1.0 1.2901e+02 2.1 8.28e+10 1.2 0.0e+00 >>>>>> 0.0e+00 0.0e+00 22 12 0 0 0 22 12 0 0 0 4828 >>>>>> PCBDDCCoaS 500 1.0 5.8757e-01 1.8 1.09e+09 1.0 2.8e+04 >>>>>> 7.4e+02 5.0e+02 0 0 17 0 28 0 0 17 0 31 14901 >>>>>> >>>>>> Finally, if I look at the residual history, I see a sharp decrease >>>>>> and a very long plateau. This indicates a bad coarse space; as I said >>>>>> before, there's no hope of finding a suitable coarse space without first >>>>>> changing the basis of the Nedelec elements, which is done automatically >>>>>> if >>>>>> you prescribe the discrete gradient operator (see the paper I have linked >>>>>> to in my previous communication). >>>>>> >>>>>> >>>>>> >>>>>> Il giorno dom 18 ago 2024 alle ore 00:37 neil liu <liufi...@gmail.com> >>>>>> ha scritto: >>>>>> >>>>>>> Hi, Stefano, >>>>>>> Please see the attached for the information with 4 and 8 CPUs for >>>>>>> the complex matrix. >>>>>>> I am solving Maxwell equations (Attahced) using 2nd-order Nedelec >>>>>>> elements (two dofs each edge, and two dofs each face). >>>>>>> The computational domain consists of different mediums, e.g., >>>>>>> vacuum and substrate (different permitivity). >>>>>>> The PML is used to truncate the computational domain, absorbing the >>>>>>> outgoing wave and introducing complex numbers for the matrix. >>>>>>> >>>>>>> Thanks a lot for your suggestions. I will try MUMPS. >>>>>>> For now, I just want to fiddle with Petsc's built-in features to >>>>>>> know more about it. >>>>>>> Yes. 5000 is larger. Smaller value. e.g., 30, converges very slowly. >>>>>>> >>>>>>> Thanks a lot. >>>>>>> >>>>>>> Have a good weekend. >>>>>>> >>>>>>> >>>>>>> On Sat, Aug 17, 2024 at 9:23 AM Stefano Zampini < >>>>>>> stefano.zamp...@gmail.com> wrote: >>>>>>> >>>>>>>> Please include the output of -log_view -ksp_view -ksp_monitor to >>>>>>>> understand what's happening. >>>>>>>> >>>>>>>> Can you please share the equations you are solving so we can >>>>>>>> provide suggestions on the solver configuration? >>>>>>>> As I said, solving for Nedelec-type discretizations is challenging, >>>>>>>> and not for off-the-shelf, black box solvers >>>>>>>> >>>>>>>> Below are some comments: >>>>>>>> >>>>>>>> >>>>>>>> - You use a redundant SVD approach for the coarse solve, which >>>>>>>> can be inefficient if your coarse space grows. You can use a >>>>>>>> parallel >>>>>>>> direct solver like MUMPS (reconfigure with --download-mumps and use >>>>>>>> -pc_bddc_coarse_pc_type lu >>>>>>>> -pc_bddc_coarse_pc_factor_mat_solver_type mumps) >>>>>>>> - Why use ILU for the Dirichlet problem and GAMG for the >>>>>>>> Neumann problem? With 8 processes and 300K total dofs, you will >>>>>>>> have around >>>>>>>> 40K dofs per process, which is ok for a direct solver like MUMPS >>>>>>>> (-pc_bddc_dirichlet_pc_factor_mat_solver_type mumps, same for >>>>>>>> Neumann). >>>>>>>> With Nedelec dofs and the sparsity pattern they induce, I believe >>>>>>>> you can >>>>>>>> push to 80K dofs per process with good performance. >>>>>>>> - Why 5000 of restart for GMRES? It is highly inefficient to >>>>>>>> re-orthogonalize such a large set of vectors. >>>>>>>> >>>>>>>> >>>>>>>> Il giorno ven 16 ago 2024 alle ore 00:04 neil liu < >>>>>>>> liufi...@gmail.com> ha scritto: >>>>>>>> >>>>>>>>> Dear Petsc developers, >>>>>>>>> >>>>>>>>> Thanks for your previous help. Now, the PCBDDC can converge to >>>>>>>>> 1e-8 with, >>>>>>>>> >>>>>>>>> petsc-3.21.1/petsc/arch-linux-c-opt/bin/mpirun -n 8 ./app -pc_type >>>>>>>>> bddc -pc_bddc_coarse_redundant_pc_type svd >>>>>>>>> -ksp_error_if_not_converged >>>>>>>>> -mat_type is -ksp_monitor -ksp_rtol 1e-8 -ksp_gmres_restart 5000 >>>>>>>>> -ksp_view >>>>>>>>> -pc_bddc_use_local_mat_graph 0 -pc_bddc_dirichlet_pc_type ilu >>>>>>>>> -pc_bddc_neumann_pc_type gamg >>>>>>>>> -pc_bddc_neumann_pc_gamg_esteig_ksp_max_it 10 >>>>>>>>> -ksp_converged_reason -pc_bddc_neumann_approximate -ksp_max_it 500 >>>>>>>>> -log_view >>>>>>>>> >>>>>>>>> Then I used 2 cases for strong scaling test. One case only >>>>>>>>> involves real numbers (tetra #: 49,152; dof #: 324, 224 ) for matrix >>>>>>>>> and >>>>>>>>> rhs. The 2nd case involves complex numbers (tetra #: 95,336; dof #: >>>>>>>>> 611,432) due to PML. >>>>>>>>> >>>>>>>>> Case 1: >>>>>>>>> cpu # Time for 500 ksp steps (s) Parallel >>>>>>>>> efficiency PCsetup time(s) >>>>>>>>> 2 234.7 >>>>>>>>> 3.12 >>>>>>>>> 4 126.6 >>>>>>>>> 0.92 1.62 >>>>>>>>> 8 84.97 >>>>>>>>> 0.69 1.26 >>>>>>>>> However for Case 2, >>>>>>>>> cpu # Time for 500 ksp steps (s) Parallel >>>>>>>>> efficiency PCsetup time(s) >>>>>>>>> 2 584.5 >>>>>>>>> 8.61 >>>>>>>>> 4 376.8 >>>>>>>>> 0.77 6.56 >>>>>>>>> 8 459.6 >>>>>>>>> 0.31 66.47 >>>>>>>>> For these 2 cases, I checked the time for PCsetup as an example. >>>>>>>>> It seems 8 cpus for case 2 used too much time on PCsetup. >>>>>>>>> Do you have any ideas about what is going on here? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Xiaodong >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Stefano >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Stefano >>>>>> >>>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBKMFP6GD$ >>>> >>>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBMwGiak0$ >>>> > >>>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBKMFP6GD$ >> >> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBMwGiak0$ >> > >> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBKMFP6GD$ <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBMwGiak0$ >