> On 21 Jun 2024, at 6:42 AM, Junchao Zhang <junchao.zh...@gmail.com> wrote: > > This Message Is From an External Sender > This message came from outside your organization. > I remember there are some MKL env vars to print MKL routines called.
The environment variable is MKL_VERBOSE Thanks, Pierre > Maybe we can try it to see what MKL routines are really used and then we can > understand why some petsc functions did not speed up > > --Junchao Zhang > > > On Thu, Jun 20, 2024 at 10:39 PM Yongzhong Li <yongzhong...@mail.utoronto.ca > <mailto:yongzhong...@mail.utoronto.ca>> wrote: >> This Message Is From an External Sender >> This message came from outside your organization. >> >> Hi Barry, sorry for my last results. I didn’t fully understand the stage >> profiling and logging in PETSc, now I only record KSPSolve() stage of my >> program. Some sample codes are as follow, >> >> // Static variable to keep track of the stage counter >> >> static int stageCounter = 1; >> >> >> >> // Generate a unique stage name >> >> std::ostringstream oss; >> >> oss << "Stage " << stageCounter << " of Code"; >> >> std::string stageName = oss.str(); >> >> >> >> // Register the stage >> >> PetscLogStage stagenum; >> >> >> >> PetscLogStageRegister(stageName.c_str(), &stagenum); >> >> PetscLogStagePush(stagenum); >> >> >> >> KSPSolve(*ksp_ptr, b, x); >> >> >> >> PetscLogStagePop(); >> >> stageCounter++; >> >> I have attached my new logging results, there are 1 main stage and 4 other >> stages where each one is KSPSolve() call. >> >> To provide some additional backgrounds, if you recall, I have been trying to >> get efficient iterative solution using multithreading. I found out by >> compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to >> perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. >> This makes the shell matrix vector product in each iteration scale well with >> the #of threads. However, I found out the total GMERS solve time >> (~KSPSolve() time) is not scaling well the #of threads. >> >> From the logging results I learned that when performing KSPSolve(), there >> are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs >> using different number of threads and plotted the time consumption for >> PCApply() and KSPGMERSOrthog() against #of thread. I found out these two >> operations are not scaling with the threads at all! My results are attached >> as the pdf to give you a clear view. >> >> My questions is, >> >> From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() >> will have many vector operations, so why these two parts can’t scale well >> with the # of threads when the intel MKL library is linked? >> >> Thank you, >> Yongzhong >> >> >> >> From: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>> >> Date: Friday, June 14, 2024 at 11:36 AM >> To: Yongzhong Li <yongzhong...@mail.utoronto.ca >> <mailto:yongzhong...@mail.utoronto.ca>> >> Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> >> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, >> petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> >> <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>, Piero Triverio >> <piero.trive...@utoronto.ca <mailto:piero.trive...@utoronto.ca>> >> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance >> Issue >> >> >> >> I am a bit confused. Without the initial guess computation, there are >> still a bunch of events I don't understand >> >> >> >> MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 >> >> MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> >> >> in addition there are many more VecMAXPY then VecMDot (in GMRES they are >> each done the same number of times) >> >> >> >> VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 >> 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 >> >> VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 >> 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 >> >> >> >> Finally there are a huge number of >> >> >> >> MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 >> 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 >> >> >> >> Are you making calls to all these routines? Are you doing this inside your >> MatMult() or before you call KSPSolve? >> >> >> >> The reason I wanted you to make a simpler run without the initial guess code >> is that your events are far more complicated than would be produced by GMRES >> alone so it is not possible to understand the behavior you are seeing >> without fully understanding all the events happening in the code. >> >> >> >> Barry >> >> >> >> >> >> >> On Jun 14, 2024, at 1:19 AM, Yongzhong Li <yongzhong...@mail.utoronto.ca >> <mailto:yongzhong...@mail.utoronto.ca>> wrote: >> >> >> >> Thanks, I have attached the results without using any KSPGuess. At low >> frequency, the iteration steps are quite close to the one with KSPGuess, >> specifically >> >> KSPGuess Object: 1 MPI process >> >> type: fischer >> >> Model 1, size 200 >> >> However, I found at higher frequency, the # of iteration steps are >> significant higher than the one with KSPGuess, I have attahced both of the >> results for your reference. >> >> Moreover, could I ask why the one without the KSPGuess options can be used >> for a baseline comparsion? What are we comparing here? How does it relate to >> the performance issue/bottleneck I found? “I have noticed that the time >> taken by KSPSolve is almost two times greater than the CPU time for >> matrix-vector product multiplied by the number of iteration” >> >> Thank you! >> Yongzhong >> >> >> From: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>> >> Date: Thursday, June 13, 2024 at 2:14 PM >> To: Yongzhong Li <yongzhong...@mail.utoronto.ca >> <mailto:yongzhong...@mail.utoronto.ca>> >> Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> >> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, >> petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> >> <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>, Piero Triverio >> <piero.trive...@utoronto.ca <mailto:piero.trive...@utoronto.ca>> >> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance >> Issue >> >> >> >> Can you please run the same thing without the KSPGuess option(s) for a >> baseline comparison? >> >> >> Thanks >> >> >> Barry >> >> >> >> On Jun 13, 2024, at 1:27 PM, Yongzhong Li <yongzhong...@mail.utoronto.ca >> <mailto:yongzhong...@mail.utoronto.ca>> wrote: >> >> >> This Message Is From an External Sender >> >> This message came from outside your organization. >> >> Hi Matt, >> >> I have rerun the program with the keys you provided. The system output when >> performing ksp solve and the final petsc log output were stored in a .txt >> file attached for your reference. >> >> Thanks! >> Yongzhong >> >> >> From: Matthew Knepley <knep...@gmail.com <mailto:knep...@gmail.com>> >> Date: Wednesday, June 12, 2024 at 6:46 PM >> To: Yongzhong Li <yongzhong...@mail.utoronto.ca >> <mailto:yongzhong...@mail.utoronto.ca>> >> Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> >> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, >> petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> >> <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>, Piero Triverio >> <piero.trive...@utoronto.ca <mailto:piero.trive...@utoronto.ca>> >> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance >> Issue >> >> 你通常不会收到来自 knep...@gmail.com <mailto:knep...@gmail.com> 的电子邮件。了解这一点为什么很重要 >> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qk-oCEwo4$> >> >> On Wed, Jun 12, 2024 at 6:36 PM Yongzhong Li <yongzhong...@mail.utoronto.ca >> <mailto:yongzhong...@mail.utoronto.ca>> wrote: >> >> Dear PETSc’s developers, I hope this email finds you well. I am currently >> working on a project using PETSc and have encountered a performance issue >> with the KSPSolve function. Specifically, I have noticed that the time taken >> by KSPSolve is >> >> ZjQcmQRYFpfptBannerStart >> >> This Message Is From an External Sender >> >> This message came from outside your organization. >> >> >> ZjQcmQRYFpfptBannerEnd >> >> Dear PETSc’s developers, >> >> I hope this email finds you well. >> >> I am currently working on a project using PETSc and have encountered a >> performance issue with the KSPSolve function. Specifically, I have noticed >> that the time taken by KSPSolve is almost two times greater than the CPU >> time for matrix-vector product multiplied by the number of iteration steps. >> I use C++ chrono to record CPU time. >> >> For context, I am using a shell system matrix A. Despite my efforts to >> parallelize the matrix-vector product (Ax), the overall solve time remains >> higher than the matrix vector product per iteration indicates when multiple >> threads were used. Here are a few details of my setup: >> >> Matrix Type: Shell system matrix >> Preconditioner: Shell PC >> Parallel Environment: Using Intel MKL as PETSc’s BLAS/LAPACK library, >> multithreading is enabled >> I have considered several potential reasons, such as preconditioner setup, >> additional solver operations, and the inherent overhead of using a shell >> system matrix. However, since KSPSolve is a high-level API, I have been >> unable to pinpoint the exact cause of the increased solve time. >> >> Have you observed the same issue? Could you please provide some experience >> on how to diagnose and address this performance discrepancy? Any insights or >> recommendations you could offer would be greatly appreciated. >> >> >> >> For any performance question like this, we need to see the output of your >> code run with >> >> >> >> -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view >> >> >> >> Thanks, >> >> >> >> Matt >> >> >> >> Thank you for your time and assistance. >> >> Best regards, >> >> Yongzhong >> >> ----------------------------------------------------------- >> >> Yongzhong Li >> >> PhD student | Electromagnetics Group >> >> Department of Electrical & Computer Engineering >> >> University of Toronto >> >> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cxTM09LsKoYUA08P97agSWfNaQ7kgSux1FjxDwySQtW7Eg2OyUPt_464qMf8D4fDNGWVJRXvPqZTEgKvCtkt7A$ >> >> <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cuLttMJEcegaqu461Bt4QLsO4fASfLM5vjRbtyNhWJQiInbjgNwkGNdkFE1ebSbFjOUatYB0-jd2yQWMWzqkDFFjwMvNl3ZKAr8$> >> >> >> >> >> >> >> -- >> >> What most experimenters take for granted before they begin their experiments >> is infinitely more interesting than any results to which their experiments >> lead. >> -- Norbert Wiener >> >> >> >> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cxTM09LsKoYUA08P97agSWfNaQ7kgSux1FjxDwySQtW7Eg2OyUPt_464qMf8D4fDNGWVJRXvPqZTEgISAv2xYg$ >> >> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qkNOuenGA$> >> <ksp_petsc_log.txt> >> >> >> >> <ksp_petsc_log.txt><ksp_petsc_log_noguess.txt> >>