Hi, Steven Thank you for the bug report and test example. You were right. The MatCopy(A, B,..) implementation was wrong when B was on the device. I have a fix at https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/merge_requests/8288__;!!G_uCfscf7eWS!e15LlybplgvD-3BTmMELbBy9EiU1biKGxn_qlVmPVnTPUjMYT9nZ7kf3ZtjPPhbQL7zFDI8QGuKk_V-tW9IYCL3xFVri$ , and will add your test to the MR tomorrow.
Thanks! --Junchao Zhang On Mon, Apr 7, 2025 at 4:57 PM Steven Dargaville < dargaville.ste...@gmail.com> wrote: > Hi > > I have some code that computes a MatMatMult, then modifies one of the > matrices and calls the MatMatMult again with MAT_REUSE_MATRIX. This gives > the correct results with mat_types aij, aijkokkos and either run on a CPU > or a GPU. > > In some cases one of the matrices is diagonal so I have been modifying my > code to call MatDiagonalScale instead and have seen some peculiar > behaviour. If the mat type is aij, everything works correctly on the CPU > and GPU. If the mat type is aijkokkos, everything works correctly on the > CPU but when run on a machine with a GPU the results differ. > > I have some example code below that shows this, it prints out four > matrices; the first two should match and the last two should match. To see > the failure run with "-mat_type aijkokkos" on a machine with a GPU (again > this gives the correct results with aijkokkos run on a CPU). I get the > output: > > Mat Object: 1 MPI process > type: seqaijkokkos > row 0: (0, 4.) (1, 6.) > row 1: (0, 10.) (1, 14.) > Mat Object: 1 MPI process > type: seqaijkokkos > row 0: (0, 4.) (1, 6.) > row 1: (0, 10.) (1, 14.) > Mat Object: 1 MPI process > type: seqaijkokkos > row 0: (0, 6.) (1, 9.) > row 1: (0, 15.) (1, 21.) > Mat Object: 1 MPI process > type: seqaijkokkos > row 0: (0, 12.) (1, 18.) > row 1: (0, 30.) (1, 42.) > > I have narrowed down this failure to a MatCopy call, where the values of > result_diag should be overwritten with A before calling MatDiagonalScale. > The results with aijkokos on a GPU suggest the values of result_diag are > not being changed. If instead of using the MatCopy I destroy result_diag > and call MatDuplicate, the results are correct. You can trigger the correct > behaviour with "-mat_type aijkokkos -correct". > > To me this looks like the values are out of sync between the device/host, > but I would have thought a call to MatCopy would be safe. Any thoughts on > what I might be doing wrong? > > Thanks for your help! > Steven > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > static char help[] = "Test matmat products with matdiagonal on gpus \n\n"; > > #include <iostream> > #include <petscmat.h> > > > int main(int argc, char **args) > { > const PetscInt inds[] = {0, 1}; > PetscScalar avals[] = {2, 3, 5, 7}; > Mat A, B_diag, B_aij_diag, result, result_diag; > Vec diag; > > PetscCall(PetscInitialize(&argc, &args, NULL, help)); > > // Create matrix to start > PetscCall(MatCreateFromOptions(PETSC_COMM_WORLD, NULL, 1, 2, 2, 2, 2, > &A)); > PetscCall(MatSetUp(A)); > PetscCall(MatSetValues(A, 2, inds, 2, inds, avals, INSERT_VALUES)); > PetscCall(MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY)); > PetscCall(MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY)); > > // Create a matdiagonal matrix > // Will be the matching vec type as A > PetscCall(MatCreateVecs(A, &diag, NULL)); > PetscCall(VecSet(diag, 2.0)); > PetscCall(MatCreateDiagonal(diag, &B_diag)); > > // Create the same matrix as the matdiagonal but in aij format > PetscCall(MatCreateFromOptions(PETSC_COMM_WORLD, NULL, 1, 2, 2, 2, 2, > &B_aij_diag)); > PetscCall(MatSetUp(B_aij_diag)); > PetscCall(MatDiagonalSet(B_aij_diag, diag, INSERT_VALUES)); > PetscCall(MatAssemblyBegin(B_aij_diag, MAT_FINAL_ASSEMBLY)); > PetscCall(MatAssemblyEnd(B_aij_diag, MAT_FINAL_ASSEMBLY)); > PetscCall(VecDestroy(&diag)); > > // ~~~~~~~~~~~~~ > // Do an initial matmatmult > // A * B_aij_diag > // and then > // A * B_diag but just using MatDiagonalScale > // ~~~~~~~~~~~~~ > > // aij * aij > PetscCall(MatMatMult(A, B_aij_diag, MAT_INITIAL_MATRIX, 1.5, &result)); > PetscCall(MatView(result, PETSC_VIEWER_STDOUT_WORLD)); > > // aij * diagonal > PetscCall(MatDuplicate(A, MAT_COPY_VALUES, &result_diag)); > PetscCall(MatDiagonalGetDiagonal(B_diag, &diag)); > PetscCall(MatDiagonalScale(result_diag, NULL, diag)); > PetscCall(MatDiagonalRestoreDiagonal(B_diag, &diag)); > PetscCall(MatView(result_diag, PETSC_VIEWER_STDOUT_WORLD)); > > // ~~~~~~~~~~~~~ > // Now let's modify the diagonal and do it again with "reuse" > // ~~~~~~~~~~~~~ > PetscCall(MatDiagonalGetDiagonal(B_diag, &diag)); > PetscCall(VecSet(diag, 3.0)); > PetscCall(MatDiagonalSet(B_aij_diag, diag, INSERT_VALUES)); > PetscCall(MatDiagonalRestoreDiagonal(B_diag, &diag)); > > // aij * aij > PetscCall(MatMatMult(A, B_aij_diag, MAT_REUSE_MATRIX, 1.5, &result)); > PetscCall(MatView(result, PETSC_VIEWER_STDOUT_WORLD)); > > PetscBool correct = PETSC_FALSE; > PetscCall(PetscOptionsGetBool(NULL, NULL, "-correct", &correct, NULL)); > > if (!correct) > { > // This gives the wrong results below when run on gpu > // Results suggest this isn't copied > PetscCall(MatCopy(A, result_diag, SAME_NONZERO_PATTERN)); > } > else > { > // This gives the correct results below > PetscCall(MatDestroy(&result_diag)); > PetscCall(MatDuplicate(A, MAT_COPY_VALUES, &result_diag)); > } > > // aij * diagonal > PetscCall(MatDiagonalGetDiagonal(B_diag, &diag)); > PetscCall(MatDiagonalScale(result_diag, NULL, diag)); > PetscCall(MatDiagonalRestoreDiagonal(B_diag, &diag)); > PetscCall(MatView(result_diag, PETSC_VIEWER_STDOUT_WORLD)); > > PetscCall(PetscFinalize()); > return 0; > } >