> On 21 Jun 2024, at 6:42 AM, Junchao Zhang <junchao.zh...@gmail.com> wrote:
> 
> This Message Is From an External Sender
> This message came from outside your organization.
> I remember there are some MKL env vars to print MKL routines called. 

The environment variable is MKL_VERBOSE

Thanks,
Pierre

> Maybe we can try it to see what MKL routines are really used and then we can 
> understand why some petsc functions did not speed up  
> 
> --Junchao Zhang
> 
> 
> On Thu, Jun 20, 2024 at 10:39 PM Yongzhong Li <yongzhong...@mail.utoronto.ca 
> <mailto:yongzhong...@mail.utoronto.ca>> wrote:
>> This Message Is From an External Sender
>> This message came from outside your organization.
>>  
>> Hi Barry, sorry for my last results. I didn’t fully understand the stage 
>> profiling and logging in PETSc, now I only record KSPSolve() stage of my 
>> program. Some sample codes are as follow,
>> 
>>                 // Static variable to keep track of the stage counter
>> 
>>                 static int stageCounter = 1;
>> 
>>  
>> 
>>                 // Generate a unique stage name
>> 
>>                 std::ostringstream oss;
>> 
>>                 oss << "Stage " << stageCounter << " of Code";
>> 
>>                 std::string stageName = oss.str();
>> 
>>  
>> 
>>                 // Register the stage
>> 
>>                 PetscLogStage stagenum;
>> 
>>  
>> 
>>                 PetscLogStageRegister(stageName.c_str(), &stagenum);
>> 
>>                 PetscLogStagePush(stagenum);
>> 
>>  
>> 
>>                 KSPSolve(*ksp_ptr, b, x);
>> 
>>  
>> 
>>                 PetscLogStagePop();
>> 
>>                 stageCounter++;
>> 
>> I have attached my new logging results, there are 1 main stage and 4 other 
>> stages where each one is KSPSolve() call.
>> 
>> To provide some additional backgrounds, if you recall, I have been trying to 
>> get efficient iterative solution using multithreading. I found out by 
>> compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to 
>> perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. 
>> This makes the shell matrix vector product in each iteration scale well with 
>> the #of threads. However, I found out the total GMERS solve time 
>> (~KSPSolve() time) is not scaling well the #of threads.
>> 
>> From the logging results I learned that when performing KSPSolve(), there 
>> are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs 
>> using different number of threads and plotted the time consumption for 
>> PCApply() and KSPGMERSOrthog() against #of thread. I found out these two 
>> operations are not scaling with the threads at all! My results are attached 
>> as the pdf to give you a clear view.
>> 
>> My questions is,
>> 
>> From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() 
>> will have many vector operations, so why these two parts can’t scale well 
>> with the # of threads when the intel MKL library is linked?
>> 
>> Thank you,
>> Yongzhong
>> 
>>  
>> 
>> From: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>>
>> Date: Friday, June 14, 2024 at 11:36 AM
>> To: Yongzhong Li <yongzhong...@mail.utoronto.ca 
>> <mailto:yongzhong...@mail.utoronto.ca>>
>> Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> 
>> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, 
>> petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> 
>> <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>, Piero Triverio 
>> <piero.trive...@utoronto.ca <mailto:piero.trive...@utoronto.ca>>
>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance 
>> Issue
>> 
>>  
>> 
>>    I am a bit confused. Without the initial guess computation, there are 
>> still a bunch of events I don't understand 
>> 
>>  
>> 
>> MatTranspose          79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> 
>> MatMatMultSym        110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> 
>> MatMatMultNum         90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> 
>> MatMatMatMultSym      20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> 
>> MatRARtSym            25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> 
>> MatMatTrnMultSym      25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> 
>> MatMatTrnMultNum      25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0   275
>> 
>> MatTrnMatMultSym      10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> 
>> MatTrnMatMultNum      10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> 
>>  
>> 
>> in addition there are many more VecMAXPY then VecMDot (in GMRES they are 
>> each done the same number of times)
>> 
>>  
>> 
>> VecMDot             5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  8 10  0  0  0   8 10  0  0  0 12016
>> 
>> VecMAXPY           22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 39 20  0  0  0  39 20  0  0  0  4913
>> 
>>  
>> 
>> Finally there are a huge number of 
>> 
>>  
>> 
>> MatMultAdd        258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  7 29  0  0  0   7 29  0  0  0 43025
>> 
>>  
>> 
>> Are you making calls to all these routines? Are you doing this inside your 
>> MatMult() or before you call KSPSolve?
>> 
>>  
>> 
>> The reason I wanted you to make a simpler run without the initial guess code 
>> is that your events are far more complicated than would be produced by GMRES 
>> alone so it is not possible to understand the behavior you are seeing 
>> without fully understanding all the events happening in the code.
>> 
>>  
>> 
>>   Barry
>> 
>>  
>> 
>> 
>> 
>> 
>> On Jun 14, 2024, at 1:19 AM, Yongzhong Li <yongzhong...@mail.utoronto.ca 
>> <mailto:yongzhong...@mail.utoronto.ca>> wrote:
>> 
>>  
>> 
>> Thanks, I have attached the results without using any KSPGuess. At low 
>> frequency, the iteration steps are quite close to the one with KSPGuess, 
>> specifically 
>> 
>>   KSPGuess Object: 1 MPI process
>> 
>>     type: fischer
>> 
>>     Model 1, size 200
>> 
>> However, I found at higher frequency, the # of iteration steps are  
>> significant higher than the one with KSPGuess, I have attahced both of the 
>> results for your reference.
>> 
>> Moreover, could I ask why the one without the KSPGuess options can be used 
>> for a baseline comparsion? What are we comparing here? How does it relate to 
>> the performance issue/bottleneck I found? “I have noticed that the time 
>> taken by KSPSolve is almost two times greater than the CPU time for 
>> matrix-vector product multiplied by the number of iteration” 
>> 
>> Thank you!
>> Yongzhong
>> 
>>  
>> From: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>>
>> Date: Thursday, June 13, 2024 at 2:14 PM
>> To: Yongzhong Li <yongzhong...@mail.utoronto.ca 
>> <mailto:yongzhong...@mail.utoronto.ca>>
>> Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> 
>> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, 
>> petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> 
>> <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>, Piero Triverio 
>> <piero.trive...@utoronto.ca <mailto:piero.trive...@utoronto.ca>>
>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance 
>> Issue
>> 
>>  
>> 
>>   Can you please run the same thing without the  KSPGuess option(s) for a 
>> baseline comparison?
>> 
>>  
>>    Thanks
>> 
>>  
>>    Barry
>> 
>>  
>> 
>> On Jun 13, 2024, at 1:27 PM, Yongzhong Li <yongzhong...@mail.utoronto.ca 
>> <mailto:yongzhong...@mail.utoronto.ca>> wrote:
>> 
>>  
>> This Message Is From an External Sender
>> 
>> This message came from outside your organization.
>> 
>> Hi Matt,
>> 
>> I have rerun the program with the keys you provided. The system output when 
>> performing ksp solve and the final petsc log output were stored in a .txt 
>> file attached for your reference.
>> 
>> Thanks!
>> Yongzhong
>> 
>>  
>> From: Matthew Knepley <knep...@gmail.com <mailto:knep...@gmail.com>>
>> Date: Wednesday, June 12, 2024 at 6:46 PM
>> To: Yongzhong Li <yongzhong...@mail.utoronto.ca 
>> <mailto:yongzhong...@mail.utoronto.ca>>
>> Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> 
>> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, 
>> petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> 
>> <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>, Piero Triverio 
>> <piero.trive...@utoronto.ca <mailto:piero.trive...@utoronto.ca>>
>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance 
>> Issue
>> 
>> 你通常不会收到来自 knep...@gmail.com <mailto:knep...@gmail.com> 的电子邮件。了解这一点为什么很重要 
>> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qk-oCEwo4$>
>>     
>> On Wed, Jun 12, 2024 at 6:36 PM Yongzhong Li <yongzhong...@mail.utoronto.ca 
>> <mailto:yongzhong...@mail.utoronto.ca>> wrote:
>> 
>> Dear PETSc’s developers, I hope this email finds you well. I am currently 
>> working on a project using PETSc and have encountered a performance issue 
>> with the KSPSolve function. Specifically, I have noticed that the time taken 
>> by KSPSolve is 
>> 
>> ZjQcmQRYFpfptBannerStart
>> 
>> This Message Is From an External Sender
>> 
>> This message came from outside your organization.
>> 
>>  
>> ZjQcmQRYFpfptBannerEnd
>> 
>> Dear PETSc’s developers,
>> 
>> I hope this email finds you well.
>> 
>> I am currently working on a project using PETSc and have encountered a 
>> performance issue with the KSPSolve function. Specifically, I have noticed 
>> that the time taken by KSPSolve is almost two times greater than the CPU 
>> time for matrix-vector product multiplied by the number of iteration steps. 
>> I use C++ chrono to record CPU time.
>> 
>> For context, I am using a shell system matrix A. Despite my efforts to 
>> parallelize the matrix-vector product (Ax), the overall solve time remains 
>> higher than the matrix vector product per iteration indicates when multiple 
>> threads were used. Here are a few details of my setup:
>> 
>> Matrix Type: Shell system matrix
>> Preconditioner: Shell PC
>> Parallel Environment: Using Intel MKL as PETSc’s BLAS/LAPACK library, 
>> multithreading is enabled
>> I have considered several potential reasons, such as preconditioner setup, 
>> additional solver operations, and the inherent overhead of using a shell 
>> system matrix. However, since KSPSolve is a high-level API, I have been 
>> unable to pinpoint the exact cause of the increased solve time.
>> 
>> Have you observed the same issue? Could you please provide some experience 
>> on how to diagnose and address this performance discrepancy? Any insights or 
>> recommendations you could offer would be greatly appreciated.
>> 
>>  
>> 
>> For any performance question like this, we need to see the output of your 
>> code run with
>> 
>>  
>> 
>>   -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view
>> 
>>  
>> 
>>   Thanks,
>> 
>>  
>> 
>>      Matt
>> 
>>  
>> 
>> Thank you for your time and assistance.
>> 
>> Best regards,
>> 
>> Yongzhong
>> 
>> -----------------------------------------------------------
>> 
>> Yongzhong Li
>> 
>> PhD student | Electromagnetics Group
>> 
>> Department of Electrical & Computer Engineering
>> 
>> University of Toronto
>> 
>> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cxTM09LsKoYUA08P97agSWfNaQ7kgSux1FjxDwySQtW7Eg2OyUPt_464qMf8D4fDNGWVJRXvPqZTEgKvCtkt7A$
>>   
>> <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cuLttMJEcegaqu461Bt4QLsO4fASfLM5vjRbtyNhWJQiInbjgNwkGNdkFE1ebSbFjOUatYB0-jd2yQWMWzqkDFFjwMvNl3ZKAr8$>
>>  
>> 
>> 
>> 
>>  
>> 
>> -- 
>> 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>>  
>> 
>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cxTM09LsKoYUA08P97agSWfNaQ7kgSux1FjxDwySQtW7Eg2OyUPt_464qMf8D4fDNGWVJRXvPqZTEgISAv2xYg$
>>   
>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qkNOuenGA$>
>> <ksp_petsc_log.txt>
>> 
>>  
>> 
>> <ksp_petsc_log.txt><ksp_petsc_log_noguess.txt>
>> 

Reply via email to