Re: [PATCH] fortran: Factor the evaluation of MINLOCK/MAXLOC's BACK argument
Le 11/07/2024 à 22:49, Harald Anlauf a écrit : Hi Mikael, Am 11.07.24 um 21:55 schrieb Mikael Morin: From: Mikael Morin Hello, I discovered this while testing the inline MINLOC/MAXLOC (aka PR90608) patches. Regression tested on x86_64-linux. OK for master? this is a nice finding! (NAG seems to fail on the cases with array size 0, while Intel gets it right.) The commit message promises to cover all variations ("with/out NANs"?) but I fail to see these. Were these removed in the submission? No, it's actually type with NAN vs type without NAN; the code generated is different between integral types (which don't have NANs) and floating point types (which may have NANs). I'll rephrase to integral vs floating-point types. Otherwise the patch looks pretty simple and is OK for mainline. But do not forget to s/MINLOCK/MINLOC/ in the summary. Good catch, thanks. Thanks for the patch! Thanks for the review. Harald -- 8< -- Move the evaluation of the BACK argument out of the loop in the inline code generated for MINLOC or MAXLOC. For that, add a new (scalar) element associated with BACK to the scalarization loop chain, evaluate the argument with the context of that element, and let the scalarizer do its job. The problem was not only a missed optimisation, but also a wrong code one in the cases where the expression associated with BACK is not free of side-effects, making multiple evaluations observable. The new tests check the evaluation count of the BACK argument, and try to cover all the variations (with/out NANs, constant or unknown shape, absent or scalar or array MASK) supported by the inline implementation of the functions. Care has been taken to not check the case of a constant .FALSE. MASK, for which the evaluation of BACK can be elided. gcc/fortran/ChangeLog: * trans-intrinsic.cc (gfc_conv_intrinsic_minmaxloc): Create a new scalar scalarization chain element if BACK is present. Add it to the loop. Set the scalarization chain before evaluating the argument. gcc/testsuite/ChangeLog: * gfortran.dg/maxloc_5.f90: New test. * gfortran.dg/minloc_5.f90: New test. --- gcc/fortran/trans-intrinsic.cc | 10 + gcc/testsuite/gfortran.dg/maxloc_5.f90 | 257 + gcc/testsuite/gfortran.dg/minloc_5.f90 | 257 + 3 files changed, 524 insertions(+) create mode 100644 gcc/testsuite/gfortran.dg/maxloc_5.f90 create mode 100644 gcc/testsuite/gfortran.dg/minloc_5.f90 diff --git a/gcc/fortran/trans-intrinsic.cc b/gcc/fortran/trans-intrinsic.cc index 5ea10e84060..cadbd177452 100644 --- a/gcc/fortran/trans-intrinsic.cc +++ b/gcc/fortran/trans-intrinsic.cc @@ -5325,6 +5325,7 @@ gfc_conv_intrinsic_minmaxloc (gfc_se * se, gfc_expr * expr, enum tree_code op) gfc_actual_arglist *actual; gfc_ss *arrayss; gfc_ss *maskss; + gfc_ss *backss; gfc_se arrayse; gfc_se maskse; gfc_expr *arrayexpr; @@ -5390,6 +5391,11 @@ gfc_conv_intrinsic_minmaxloc (gfc_se * se, gfc_expr * expr, enum tree_code op) && maskexpr->symtree->n.sym->attr.dummy && maskexpr->symtree->n.sym->attr.optional; backexpr = actual->next->next->expr; + if (backexpr) + backss = gfc_get_scalar_ss (gfc_ss_terminator, backexpr); + else + backss = nullptr; + nonempty = NULL; if (maskexpr && maskexpr->rank != 0) { @@ -5449,6 +5455,9 @@ gfc_conv_intrinsic_minmaxloc (gfc_se * se, gfc_expr * expr, enum tree_code op) if (maskss) gfc_add_ss_to_loop (&loop, maskss); + if (backss) + gfc_add_ss_to_loop (&loop, backss); + gfc_add_ss_to_loop (&loop, arrayss); /* Initialize the loop. */ @@ -5535,6 +5544,7 @@ gfc_conv_intrinsic_minmaxloc (gfc_se * se, gfc_expr * expr, enum tree_code op) gfc_add_block_to_block (&block, &arrayse.pre); gfc_init_se (&backse, NULL); + backse.ss = backss; gfc_conv_expr_val (&backse, backexpr); gfc_add_block_to_block (&block, &backse.pre); diff --git a/gcc/testsuite/gfortran.dg/maxloc_5.f90 b/gcc/testsuite/gfortran.dg/maxloc_5.f90 new file mode 100644 index 000..5d722450c8f --- /dev/null +++ b/gcc/testsuite/gfortran.dg/maxloc_5.f90 @@ -0,0 +1,257 @@ +! { dg-do run } +! +! Check that the evaluation of MAXLOC's BACK argument is made only once +! before the scalarisation loops. + +program p + implicit none + integer, parameter :: data10(*) = (/ 7, 4, 7, 6, 6, 4, 6, 3, 9, 8 /) + logical, parameter :: mask10(*) = (/ .false., .true., .false., & + .false., .true., .true., & + .true. , .true., .false., & + .false. /) + integer :: calls_count = 0 + call check_int_const_shape + call check_int_const_shape_scalar_mask + call check_int_const_shape_array_mask + call check_int_const_shape_optional_mask_present + call check_int_const_shape_optional_mask_absent + call check_int_const_shape_empty + call check_int
RE: Lower zeroing array assignment to memset for allocatable arrays
> -Original Message- > From: Harald Anlauf > Sent: Friday, July 12, 2024 1:52 AM > To: Prathamesh Kulkarni ; gcc- > patc...@gcc.gnu.org; fortran@gcc.gnu.org > Subject: Re: Lower zeroing array assignment to memset for allocatable > arrays > > External email: Use caution opening links or attachments > > > Hi Prathamesh! Hi Harald, > > Am 11.07.24 um 12:16 schrieb Prathamesh Kulkarni: > > > > > >> -Original Message- > >> From: Harald Anlauf > >> Sent: Thursday, July 11, 2024 12:53 AM > >> To: Prathamesh Kulkarni ; gcc- > >> patc...@gcc.gnu.org; fortran@gcc.gnu.org > >> Subject: Re: Lower zeroing array assignment to memset for > allocatable > >> arrays > >> > >> External email: Use caution opening links or attachments > >> > >> > >> Hi Prathamesh, > >> > >> Am 10.07.24 um 13:22 schrieb Prathamesh Kulkarni: > >>> Hi, > >>> The attached patch lowers zeroing array assignment to memset for > >> allocatable arrays. > >>> > >>> For example: > >>> subroutine test(z, n) > >>> implicit none > >>> integer :: n > >>> real(4), allocatable :: z(:,:,:) > >>> > >>> allocate(z(n, 8192, 2048)) > >>> z = 0 > >>> end subroutine > >>> > >>> results in following call to memset instead of 3 nested loops for > z > >> = 0: > >>> (void) __builtin_memset ((void *) z->data, 0, (unsigned > long) > >>> MAX_EXPR dim[0].ubound - z->dim[0].lbound, -1> + 1) * > >>> (MAX_EXPR dim[1].ubound - z->dim[1].lbound, -1> + 1)) * > >> (MAX_EXPR > >>> dim[2].ubound - z->dim[2].lbound, -1> + 1)) * 4)); > >>> > >>> The patch significantly improves speedup for an internal Fortran > >> application on AArch64 -mcpu=grace (and potentially on other > AArch64 > >> cores too). > >>> Bootstrapped+tested on aarch64-linux-gnu. > >>> Does the patch look OK to commit ? > >> > >> no, it is NOT ok. > >> > >> Consider: > >> > >> subroutine test0 (n, z) > >> implicit none > >> integer :: n > >> real, pointer :: z(:,:,:) ! need not be contiguous! > >> z = 0 > >> end subroutine > >> > >> After your patch this also generates a memset, but this cannot be > >> true in general. One would need to have a test on contiguity of > the > >> array before memset can be used. > >> > >> In principle this is a nice idea, and IIRC there exists a very old > PR > >> on this (by Thomas König?). So it might be worth pursuing. > > Hi Harald, > > Thanks for the suggestions! > > The attached patch checks gfc_is_simply_contiguous(expr, true, > false) > > before lowering to memset, which avoids generating memset for your > example above. > > This is much better, as it avoids generating false memsets where it > should not. However, you now miss cases where the array is a > component reference, as in: > > subroutine test_dt (dt) >implicit none >type t > real, allocatable :: x(:,:,:) ! contiguous! > real, pointer, contiguous :: y(:,:,:) ! contiguous! > real, pointer :: z(:,:,:) ! need not be > contiguous! >end type t >type(t) :: dt >dt% x = 0 ! memset possible! >dt% y = 0 ! memset possible! >dt% z = 0 ! memset NOT possible! > end subroutine > > You'll need to cycle through the component references and apply the > check for contiguity to the ultimate component, not the top level. > > Can you have another look? Thanks for the review! It seems that component references are not currently handled even for static size arrays ? For eg: subroutine test_dt (dt, y) implicit none real :: y (10, 20, 30) type t real :: x(10, 20, 30) end type t type(t) :: dt y = 0 dt% x = 0 end subroutine With trunk, it generates memset for 'y' but not for dt%x. That happens because copyable_array_p returns false for dt%x, because expr->ref->next is non NULL: /* First check it's an array. */ if (expr->rank < 1 || !expr->ref || expr->ref->next) return false; and gfc_full_array_ref_p(expr) bails out if expr->ref->type != REF_ARRAY. Looking thru git history, it seems both the checks were added in 18eaa2c0cd20 to fix PR33370. (Even after removing these checks, the previous patch bails out from gfc_trans_zero_assign because GFC_DESCRIPTOR_TYPE_P (type) returns false for component ref and ends up returning NULL_TREE) I am working on extending the patch to handle component refs for statically sized as well as allocatable arrays. Since it looks like a bigger change and an extension to current functionality, will it be OK to commit the previous patch as-is (if it looks correct) and address component refs in follow up one ? Thanks, Prathamesh > > Thanks, > Harald > > > Bootstrapped+tested on aarch64-linux-gnu. > > Does the attached patch look OK ? > > > > Signed-off-by: Prathamesh Kulkarni > > > > Thanks, > > Prathamesh > >> > >> Thanks, > >> Harald > >> > >> > >>> Signed-off-by: Prathamesh Kulkarni > >>> > >>> Thanks, > >>> Prathamesh > >
Re: Lower zeroing array assignment to memset for allocatable arrays
Hi Prathamesh, Am 12.07.24 um 15:31 schrieb Prathamesh Kulkarni: It seems that component references are not currently handled even for static size arrays ? For eg: subroutine test_dt (dt, y) implicit none real :: y (10, 20, 30) type t real :: x(10, 20, 30) end type t type(t) :: dt y = 0 dt% x = 0 end subroutine With trunk, it generates memset for 'y' but not for dt%x. That happens because copyable_array_p returns false for dt%x, because expr->ref->next is non NULL: /* First check it's an array. */ if (expr->rank < 1 || !expr->ref || expr->ref->next) return false; and gfc_full_array_ref_p(expr) bails out if expr->ref->type != REF_ARRAY. Indeed that check (as is) prevents the use of component refs. (I just tried to modify the this part to cycle thru the refs, but then I get regressions in the testsuite for some of the coarray tests. Furthermore, gfc_trans_zero_assign would need further changes to handle even the constant shapes from above.) Looking thru git history, it seems both the checks were added in 18eaa2c0cd20 to fix PR33370. (Even after removing these checks, the previous patch bails out from gfc_trans_zero_assign because GFC_DESCRIPTOR_TYPE_P (type) returns false for component ref and ends up returning NULL_TREE) I am working on extending the patch to handle component refs for statically sized as well as allocatable arrays. Since it looks like a bigger change and an extension to current functionality, will it be OK to commit the previous patch as-is (if it looks correct) and address component refs in follow up one ? I agree that it is reasonable to defer the handling of arrays as components of derived types, and recommend to do the following: - replace "&& gfc_is_simply_contiguous (expr, true, false))" in your last patch by "&& gfc_is_simply_contiguous (expr, false, false))", as that would also allow to treat z(:,::1,:) = 0 as contiguous if z is allocatable or a contiguous pointer. - open a PR in bugzilla to track the missed-optimization for the cases we discussed here, and link the discussion in the ML. Your patch then will be OK for mainline. Thanks, Harald Thanks, Prathamesh Thanks, Harald Bootstrapped+tested on aarch64-linux-gnu. Does the attached patch look OK ? Signed-off-by: Prathamesh Kulkarni Thanks, Prathamesh Thanks, Harald Signed-off-by: Prathamesh Kulkarni Thanks, Prathamesh