See the detailed discussion at 
https://urldefense.us/v3/__https://petsc.org/main/manual/streams/__;!!G_uCfscf7eWS!a3P4JjUgPCzentaJNryo2MwVyxl-cDAbiuEsoucMRAbQELiLDTyLtn-3nuro0gjye5CW9EGD2cuep7AG667XDu4$
 


> On Aug 20, 2024, at 5:53 PM, Matthew Knepley <knep...@gmail.com> wrote:
> 
> On Tue, Aug 20, 2024 at 2:31 PM neil liu <liufi...@gmail.com 
> <mailto:liufi...@gmail.com>> wrote:
>> Thanks a lot for this explanation, Matt. I will explore whether the matrix 
>> has the same size and spaisity.
> 
> I think it is much more likely that you just exhausted bandwidth on the node.
> 
>   Thanks,
> 
>     Matt
>  
>> On Tue, Aug 20, 2024 at 1:45 PM Matthew Knepley <knep...@gmail.com 
>> <mailto:knep...@gmail.com>> wrote:
>>> On Tue, Aug 20, 2024 at 1:36 PM neil liu <liufi...@gmail.com 
>>> <mailto:liufi...@gmail.com>> wrote:
>>>> Hi, Matt, 
>>>> I think the time listed here represents the maximum total time across 
>>>> different processors.
>>>> 
>>>> Thanks a lot. 
>>>>                          2 cpus                                            
>>>>         4 cpus                                           8 cpus 
>>>> Event          Count                 Time (sec)              Count         
>>>>         Time (sec)                Count                 Time (sec)     
>>>>                    Max Ratio        Max        Ratio           Max Ratio   
>>>>      Max     Ratio               Max Ratio        Max     Ratio 
>>>> VecMDot      530 1.0         7.8320e+01 1.0         530    1.0         
>>>> 4.3285e+01 1.1           530   1.0          3.0476e+01   1.1
>>>> VecMAXPY  534 1.0         9.2954e+01 1.0         534    1.0          
>>>> 4.8378e+01 1.1          534   1.0          3.0798e+01   1.1
>>>> MatMult      8055 1.0         2.4608e+02 1.0        8103   1.0          
>>>> 1.2663e+02 1.0          8367 1.0           8.2942e+01 1.1
>>> 
>>> For the number of calls listed.
>>> 
>>> 1) The number of MatMults goes up, so you should normalize for that, but 
>>> you still have about 1.6 speedup. However, this is
>>>     all multiplications. Are we sure they have the same size and sparsity?
>>> 
>>> 2) MAXPY is also 1.6
>>> 
>>> 3) MDot probably does not see the latency of one node, so again it is not 
>>> speeding up as you might want.
>>> 
>>> This looks like you are using a single node with 2, 4, and 8 procs. The 
>>> memory bandwidth is exhausted sometime before 8 procs
>>> (maybe 6), so you cease to see speedup. You can check this by running `make 
>>> streams` on the node.
>>> 
>>>   Thanks,
>>> 
>>>      Matt
>>>  
>>>> On Tue, Aug 20, 2024 at 1:16 PM Matthew Knepley <knep...@gmail.com 
>>>> <mailto:knep...@gmail.com>> wrote:
>>>>> On Tue, Aug 20, 2024 at 1:10 PM neil liu <liufi...@gmail.com 
>>>>> <mailto:liufi...@gmail.com>> wrote:
>>>>>> Thanks a lot for your explanation, Stefano. Very helpful. 
>>>>>> Yes. I am using dmplex to read a tetrahdra mesh from gmsh. With 
>>>>>> parmetis, the scaling performance is improved a lot. 
>>>>>> I will read your paper about how to change the basis for Nedelec 
>>>>>> elements. 
>>>>>> 
>>>>>> cpu #    time for 500 ksp steps  (s)           parallel efficiency
>>>>>> 2           546
>>>>>> 4           224                                               120%
>>>>>> 8           170                                               80%  
>>>>>> This results are much better than previous attempt. Then I checked the 
>>>>>> time spent by several Petsc built-in functions for the ksp solver. 
>>>>>> 
>>>>>> Functions          time(2 cpus)     time(4 cpus)      time(8 cpus)
>>>>>> VecMDot           78.32                43.28                30.47
>>>>>> VecMAXPY       92.95                48.37                30.798  
>>>>>> MatMult          246.08               126.63                82.94
>>>>>> 
>>>>>> It seems from cpu 4 to cpu 8, the scaling is not as good as from cpu 2 
>>>>>> to cpu 4.
>>>>>> Am I  missing something? 
>>>>> 
>>>>> Did you normalize by the number of calls?
>>>>> 
>>>>>   Thanks,
>>>>> 
>>>>>      Matt
>>>>>  
>>>>>> Thanks a lot,
>>>>>> 
>>>>>> Xiaodong 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Aug 19, 2024 at 4:15 AM Stefano Zampini 
>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote:
>>>>>>> It seems you are using DMPLEX to handle the mesh, correct?
>>>>>>> If so, you should configure using --download-parmetis to have a better 
>>>>>>> domain decomposition since the default one just splits the cells in 
>>>>>>> chunks as they are ordered.
>>>>>>> This results in a large number of primal dofs on average (191, from the 
>>>>>>>  output of ksp_view)
>>>>>>> ...
>>>>>>> Primal    dofs   : 176 204 191
>>>>>>> ...
>>>>>>> that slows down the solver setup.
>>>>>>> 
>>>>>>> Again, you should not use approximate local solvers with BDDC unless 
>>>>>>> you know what you are doing.
>>>>>>> The theory for approximate solvers for BDDC is small and only for SPD 
>>>>>>> problems.
>>>>>>> Looking at the output of log_view, coarse problem setup (PCBDDCCSet), 
>>>>>>> and primal functions setup (PCBDDCCorr) costs 35 + 63 seconds, 
>>>>>>> respectively.
>>>>>>> Also, the 500 application of the GAMG preconditioner for the Neumann 
>>>>>>> solver (PCBDDCNeuS) takes 129 seconds out of the 400 seconds of the 
>>>>>>> total solve time.
>>>>>>> 
>>>>>>> PCBDDCTopo             1 1.0 3.1563e-01 1.0 1.11e+06 3.4 1.6e+03 
>>>>>>> 3.9e+04 3.8e+01  0  0  1  0  2   0  0  1  0  2    19
>>>>>>> PCBDDCLKSP             2 1.0 2.0423e+00 1.7 9.31e+08 1.2 0.0e+00 
>>>>>>> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0  3378
>>>>>>> PCBDDCLWor             1 1.0 3.9178e-02 13.4 0.00e+00 0.0 0.0e+00 
>>>>>>> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>>> PCBDDCCorr             1 1.0 6.3981e+01 2.2 8.16e+10 1.6 0.0e+00 
>>>>>>> 0.0e+00 0.0e+00 11 11  0  0  0  11 11  0  0  0  8900
>>>>>>> PCBDDCCSet             1 1.0 3.5453e+01 4564.9 1.06e+05 1.7 1.2e+03 
>>>>>>> 5.3e+03 5.0e+01  2  0  1  0  3   2  0  1  0  3     0
>>>>>>> PCBDDCCKSP             1 1.0 6.3266e-01 1.3 0.00e+00 0.0 3.3e+02 
>>>>>>> 1.1e+02 2.2e+01  0  0  0  0  1   0  0  0  0  1     0
>>>>>>> PCBDDCScal             1 1.0 6.8274e-03 1.3 1.11e+06 3.4 5.6e+01 
>>>>>>> 3.2e+05 0.0e+00  0  0  0  0  0   0  0  0  0  0   894
>>>>>>> PCBDDCDirS          1000 1.0 6.0420e+00 3.5 6.64e+09 5.4 0.0e+00 
>>>>>>> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0  2995
>>>>>>> PCBDDCNeuS           500 1.0 1.2901e+02 2.1 8.28e+10 1.2 0.0e+00 
>>>>>>> 0.0e+00 0.0e+00 22 12  0  0  0  22 12  0  0  0  4828
>>>>>>> PCBDDCCoaS           500 1.0 5.8757e-01 1.8 1.09e+09 1.0 2.8e+04 
>>>>>>> 7.4e+02 5.0e+02  0  0 17  0 28   0  0 17  0 31 14901
>>>>>>> 
>>>>>>> Finally, if I look at the residual history, I see a sharp decrease and 
>>>>>>> a very long plateau. This indicates a bad coarse space; as I said 
>>>>>>> before, there's no hope of finding a suitable coarse space without 
>>>>>>> first changing the basis of the Nedelec elements, which is done 
>>>>>>> automatically if you prescribe the discrete gradient operator (see the 
>>>>>>> paper I have linked to in my previous communication).
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Il giorno dom 18 ago 2024 alle ore 00:37 neil liu <liufi...@gmail.com 
>>>>>>> <mailto:liufi...@gmail.com>> ha scritto:
>>>>>>>> Hi, Stefano, 
>>>>>>>> Please see the attached for the information with 4 and 8 CPUs for the 
>>>>>>>> complex matrix.
>>>>>>>> I am solving Maxwell equations (Attahced) using 2nd-order Nedelec 
>>>>>>>> elements (two dofs each edge, and two dofs each face).
>>>>>>>> The computational domain consists of different mediums, e.g., vacuum 
>>>>>>>> and substrate (different permitivity).
>>>>>>>> The PML is used to truncate the computational domain, absorbing the 
>>>>>>>> outgoing wave and introducing complex numbers for the matrix.
>>>>>>>> 
>>>>>>>> Thanks a lot for your suggestions. I will try MUMPS. 
>>>>>>>> For now, I just want to fiddle with Petsc's built-in features to know 
>>>>>>>> more about it. 
>>>>>>>> Yes. 5000 is larger. Smaller value. e.g., 30, converges very slowly. 
>>>>>>>> 
>>>>>>>> Thanks a lot. 
>>>>>>>> 
>>>>>>>> Have a good weekend. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Aug 17, 2024 at 9:23 AM Stefano Zampini 
>>>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote:
>>>>>>>>> Please include the output of -log_view -ksp_view -ksp_monitor to 
>>>>>>>>> understand what's happening.
>>>>>>>>> 
>>>>>>>>> Can you please share the equations you are solving so we can provide 
>>>>>>>>> suggestions on the solver configuration?
>>>>>>>>> As I said, solving for Nedelec-type discretizations is challenging, 
>>>>>>>>> and not for off-the-shelf, black box solvers
>>>>>>>>> 
>>>>>>>>> Below are some comments:
>>>>>>>>> 
>>>>>>>>> You use a redundant SVD approach for the coarse solve, which can be 
>>>>>>>>> inefficient if your coarse space grows. You can use a parallel direct 
>>>>>>>>> solver like MUMPS (reconfigure with --download-mumps and use 
>>>>>>>>> -pc_bddc_coarse_pc_type lu -pc_bddc_coarse_pc_factor_mat_solver_type 
>>>>>>>>> mumps)
>>>>>>>>> Why use ILU for the Dirichlet problem and GAMG for the Neumann 
>>>>>>>>> problem? With 8 processes and 300K total dofs, you will have around 
>>>>>>>>> 40K dofs per process, which is ok for a direct solver like MUMPS 
>>>>>>>>> (-pc_bddc_dirichlet_pc_factor_mat_solver_type mumps, same for 
>>>>>>>>> Neumann). With Nedelec dofs and the sparsity pattern they induce,  I 
>>>>>>>>> believe you can push to 80K dofs per process with good performance.
>>>>>>>>> Why 5000 of restart for GMRES? It is highly inefficient to 
>>>>>>>>> re-orthogonalize such a large set of vectors.
>>>>>>>>> 
>>>>>>>>> Il giorno ven 16 ago 2024 alle ore 00:04 neil liu <liufi...@gmail.com 
>>>>>>>>> <mailto:liufi...@gmail.com>> ha scritto:
>>>>>>>>>> Dear Petsc developers, 
>>>>>>>>>> 
>>>>>>>>>> Thanks for your previous help. Now, the PCBDDC can converge to 1e-8 
>>>>>>>>>> with, 
>>>>>>>>>> 
>>>>>>>>>> petsc-3.21.1/petsc/arch-linux-c-opt/bin/mpirun -n 8 ./app -pc_type 
>>>>>>>>>> bddc -pc_bddc_coarse_redundant_pc_type svd   
>>>>>>>>>> -ksp_error_if_not_converged -mat_type is -ksp_monitor -ksp_rtol 1e-8 
>>>>>>>>>> -ksp_gmres_restart 5000 -ksp_view -pc_bddc_use_local_mat_graph 0  
>>>>>>>>>> -pc_bddc_dirichlet_pc_type ilu -pc_bddc_neumann_pc_type gamg 
>>>>>>>>>> -pc_bddc_neumann_pc_gamg_esteig_ksp_max_it 10 -ksp_converged_reason 
>>>>>>>>>> -pc_bddc_neumann_approximate -ksp_max_it 500 -log_view
>>>>>>>>>> 
>>>>>>>>>> Then I used 2 cases for strong scaling test. One case only involves 
>>>>>>>>>> real numbers (tetra #: 49,152; dof #: 324, 224 ) for matrix and rhs. 
>>>>>>>>>> The 2nd case involves complex numbers  (tetra #: 95,336; dof #: 
>>>>>>>>>> 611,432)  due to PML. 
>>>>>>>>>> 
>>>>>>>>>> Case 1: 
>>>>>>>>>> cpu #                Time for 500 ksp steps (s)    Parallel 
>>>>>>>>>> efficiency     PCsetup time(s)
>>>>>>>>>>           2              234.7                                       
>>>>>>>>>>                            3.12
>>>>>>>>>>           4              126.6                                     
>>>>>>>>>> 0.92                      1.62
>>>>>>>>>>           8              84.97                                     
>>>>>>>>>> 0.69                      1.26
>>>>>>>>>> However for Case 2, 
>>>>>>>>>> cpu #                Time for 500 ksp steps (s)    Parallel 
>>>>>>>>>> efficiency   PCsetup time(s)
>>>>>>>>>>           2              584.5                                       
>>>>>>>>>>                                8.61
>>>>>>>>>>           4              376.8                                    
>>>>>>>>>> 0.77                           6.56
>>>>>>>>>>           8              459.6                                    
>>>>>>>>>> 0.31                         66.47
>>>>>>>>>> For these 2 cases, I checked the time for PCsetup as an example. It 
>>>>>>>>>> seems 8 cpus for case 2 used too much time on PCsetup.
>>>>>>>>>> Do you have any ideas about what is going on here? 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Xiaodong 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Stefano
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Stefano
>>>>> 
>>>>> 
>>>>> --
>>>>> What most experimenters take for granted before they begin their 
>>>>> experiments is infinitely more interesting than any results to which 
>>>>> their experiments lead.
>>>>> -- Norbert Wiener
>>>>> 
>>>>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!a3P4JjUgPCzentaJNryo2MwVyxl-cDAbiuEsoucMRAbQELiLDTyLtn-3nuro0gjye5CW9EGD2cuep7AGveiw7Wc$
>>>>>   
>>>>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBMwGiak0$>
>>> 
>>> 
>>> --
>>> What most experimenters take for granted before they begin their 
>>> experiments is infinitely more interesting than any results to which their 
>>> experiments lead.
>>> -- Norbert Wiener
>>> 
>>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!a3P4JjUgPCzentaJNryo2MwVyxl-cDAbiuEsoucMRAbQELiLDTyLtn-3nuro0gjye5CW9EGD2cuep7AGveiw7Wc$
>>>   
>>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBMwGiak0$>
> 
> 
> --
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!a3P4JjUgPCzentaJNryo2MwVyxl-cDAbiuEsoucMRAbQELiLDTyLtn-3nuro0gjye5CW9EGD2cuep7AGveiw7Wc$
>   
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBMwGiak0$>

Reply via email to