[Bug sanitizer/97696] New: ICE since ASAN_MARK does not handle poly_int sized varibales

2020-11-03 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696

Bug ID: 97696
   Summary: ICE since ASAN_MARK does not handle poly_int sized
varibales
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: ice-checking
  Severity: normal
  Priority: P3
 Component: sanitizer
  Assignee: unassigned at gcc dot gnu.org
  Reporter: matmal01 at gcc dot gnu.org
CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org,
jakub at gcc dot gnu.org, kcc at gcc dot gnu.org, marxin at 
gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

asan_expand_mark_ifn asserts that the length to check is a SHWI.
(i.e. it uses `gcc_assert (tree_fits_shwi_p (len))` ).

It attempts to ensure this by avoiding VLA's in `gimplify_decl_expr`.
poly_int sized decls were added, and they were not treated as VLA's since
commit 22b62991 (SVN r275870).

Since then, poly_int sized variables can have ASAN_MARK called on them, which
means the `len` parameter of ASAN_MARK can be a poly_int causing an ICE in
asan_expand_mark_ifn  (n.b. in order to emit an ASAN_CHECK on a poly_int sized
variable so that the ASAN_MARK is not removed in the sanopt pass we need to
pass the poly_int sized variable to a builtin memory function).


An example  (modified from gcc/testsuite/c-c++-common/asan/pr80308.c):



(v3) work-lin:gcc [Tue 12:25:10] % cat ~/asan-ice.c
#include 

__attribute__((noinline, noclone)) int
foo (char *a)
{
  int i, j = 0;
  asm volatile ("" : "+r" (a) : : "memory");
  for (i = 0; i < 12; i++)
j += a[i];
  return j;
}

int
main ()
{
  int i, j = 0;
  for (i = 0; i < 4; i++)
{
  char a[12];
  __SVInt8_t freq;
  __builtin_bcmp (&freq, a, 10);
  __builtin_memset (a, 0, sizeof (a));
  j += foo (a);
}
  return j;
}


(v3) work-lin:gcc [Tue 12:31:53] %
/installdir/aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc
-march=armv8.6-a+sve -fsanitize=address -fsanitize-address-use-after-scope
~/asan-ice.c -S  -o /dev/null
during GIMPLE pass: sanopt
/home/matmal01/asan-ice.c: In function ‘main’:
/home/matmal01/asan-ice.c:14:1: internal compiler error: in
asan_expand_mark_ifn, at asan.c:3235
   14 | main ()
  | ^~~~
0xdde454 asan_expand_mark_ifn(gimple_stmt_iterator*)
/builddir/src/gcc/gcc/asan.c:3235
0xdf6b7a execute
/builddir/src/gcc/gcc/sanopt.c:1341
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales

2020-11-03 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696

--- Comment #1 from Matthew Malcomson  ---
I guess this may also happen for the emission of ASAN_MARK in
`gimple_target_expr`, but haven't yet been able to trigger that.

[Bug sanitizer/97941] [HWASAN] use After free not working as per expectation

2020-11-23 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97941

--- Comment #1 from Matthew Malcomson  ---
Hi Akhilesh,

No that's certainly not a known issue -- thanks for reporting it!

I'm having trouble reproducing your issue, do you mind giving a little more
information on your command line and the machine you're running on etc?

One point that seems worth looking into is that the line numbers on your
backtrace don't seem to match up with the line numbers in my source tree.
(e.g. GetAccessInfo is given line number 383 of hwasan_linux.cpp, while in my
source tree that function spans lines 328-376).  Have you made any
modifications to the source?  Or maybe you're running a different libsanitizer
version?
For reference I'm using libsanitizer from LLVM hash
6e7dd1e3e1170080b76b5dcc5716bdd974343233, and the sha256sum of hwasan_linux.cpp
in my source tree is
3986e9f4e519409e7c73a7b97722125300afc4dc1f44a3f966fedf679329fd0a.

Based on what line number `HwasanOnSIGTRAP` calls `GetAccessInfo` in my source
tree, and assuming the offset between our line numbers are the same for the
GetAccessInfo line in your stack trace, it seems that the SEGV happens when
dereferencing the address that caused the signal.

That value should be the address of the `brk` instruction in __hwasan_load1
(having been inlined from `SigTrap` in hwasan_checks.h) which caught the bad
access, but the value of 0x30 which caused this SEGV is clearly not that value.

If the offset between our line numbers is a bit different, then getting that
address might make a bit more sense.  There are various struct member accesses
via pointers that `GetAccessInfo` recieves.
However, those arguments are just taken from the siginfo_t and ucontext_t
pointers that the kernel provides on receipt of a deadly signal.
I haven't found any access in that function which look like they would have an
offset of 0x30 from a NULL pointer, although I guess different kernel versions
would have different offsets.

What kernel are you running on?  Is there any chance the signal handler
HwasanOnDeadlySignal is getting a NULL pointer as one of its arguments?
For reference I happen to be running on a linux kernel based off of commit
585e5b17b9 (but with some modifications that should not affect anything -- just
config changes so I can build the kernel itself with -fsanitize=hwaddress).


Just for reference -- what I see when compiling your testcase:


ubuntu@ubuntu:~/working-directory/temp/pr97941$
../../gcc-hwasan-install/bin/gcc -fsanitize=hwaddress ./test.c -o test
./test.c: In function ‘main’:
./test.c:2:20: warning: implicit declaration of function ‘malloc’
[-Wimplicit-function-declaration]
2 |   char *x = (char*)malloc(10 * sizeof(char*));
  |^~
./test.c:1:1: note: include ‘’ or provide a declaration of ‘malloc’
  +++ |+#include 
1 | int main() {
./test.c:2:20: warning: incompatible implicit declaration of built-in function
‘malloc’ [-Wbuiltin-declaration-mismatch]
2 |   char *x = (char*)malloc(10 * sizeof(char*));
  |^~
./test.c:2:20: note: include ‘’ or provide a declaration of ‘malloc’
./test.c:3:3: warning: implicit declaration of function ‘free’
[-Wimplicit-function-declaration]
3 |   free(x);
  |   ^~~~
./test.c:3:3: note: include ‘’ or provide a declaration of ‘free’
./test.c:3:3: warning: incompatible implicit declaration of built-in function
‘free’ [-Wbuiltin-declaration-mismatch]
./test.c:3:3: note: include ‘’ or provide a declaration of ‘free’
ubuntu@ubuntu:~/working-directory/temp/pr97941$
LD_LIBRARY_PATH=~/working-directory/gcc-hwasan-install/lib64 ./test
==8600==ERROR: HWAddressSanitizer: tag-mismatch on address 0xefe00005 at pc
0xa828be70
READ of size 1 at 0xefe00005 tags: e2/d5 (ptr/mem) in thread T0
#0 0xa828be6c in SigTrap<0>
../../../../gcc-source/libsanitizer/hwasan/hwasan_checks.h:27
#1 0xa828be6c in CheckAddress<(__hwasan::ErrorAction)0,
(__hwasan::AccessType)0, 0>
../../../../gcc-source/libsanitizer/hwasan/hwasan_checks.h:88
#2 0xa828be6c in __hwasan_load1
../../../../gcc-source/libsanitizer/hwasan/hwasan.cpp:375
#3 0x400944 in main
(/home/ubuntu/working-directory/temp/pr97941/test+0x400944)
#4 0xa81598dc in __libc_start_main
(/lib/aarch64-linux-gnu/libc.so.6+0x1f8dc)

[0xefe0,0xefe00060) is a small unallocated heap chunk; size: 96
offset: 5
0xefe00005 is located 5 bytes inside of 80-byte region
[0xefe0,0xefe00050)
freed by thread T0 here:
#0 0xa828d64c in __sanitizer_free
../../../../gcc-source/libsanitizer/hwasan/hwasan_interceptors.cpp:108
#1 0x400934 in main
(/home/ubuntu/working-directory/temp/pr97941/test+0x400934)
#2 0xa81598dc in __libc_start_main
(/lib/aarch64-linux-gnu/libc.so.6+0x1f8dc)
#3 0x400814  (/home/ubuntu/working-directory/temp/pr97941/test+0x400814)

previously allocated here:
#0 0xa828db30 in __sanitizer_malloc
../../../../gcc-source/libsanitizer/hwasan/hwasan

[Bug sanitizer/97941] [HWASAN] use After free not working as per expectation

2020-12-11 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97941

Matthew Malcomson  changed:

   What|Removed |Added

 Resolution|--- |WORKSFORME
 Status|NEW |RESOLVED

--- Comment #2 from Matthew Malcomson  ---
Resolving since this works for me and haven't any extra information to believe
that's a coincidence.

[Bug sanitizer/100665] [hwsanitizer] nested funtion pointer is tagged but never checked.

2021-05-27 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100665

--- Comment #1 from Matthew Malcomson  ---
Hi there.
I believe this is how it should work (if I'm understanding & remembering
correctly).

When creating a nested function, we make a single object on the stack that
includes all variables used in the nested function plus a trampoline.
This is called the "nonlocal frame struct" as described in gcc/tree-nested.c.

That single object gets a single tag like all other objects in tagged memory
(trying to separate the closed-over objects from the trampoline and argument
pointers would be pretty awkward when the object is just one struct as far as
the expand code is concerned).

That tag is checked when accessing the closed over variables (i.e. big_array in
the example), so we definitely want to tag the object.

Given that, the question of whether the function pointer (i.e. the pointer to
the trampoline inside that object) should be tagged when passed elsewhere then
has a few benefits:
1) In this case there is no check performed, but there may be checks performed
   if e.g. this function pointer gets cast to an integer pointer and some code
   elsewhere attempts to read that integer.
2) This is just more self-consistent.  Every pointer to a tagged object is
   tagged with the same value.
3) There are hardware extensions to automatically check memory accesses.  If
the
   function pointer is not tagged in this case then (at least for AArch64) the
   PC-relative ldr's in the trampoline stored in that structure will end up
   without a tag and I believe that would trigger a fault.

Point (1) is the main one.  In general when passing a pointer into another
function we don't know if it's going to be accessed or not, so we always need
to
pass tagged pointers.

[Bug sanitizer/100665] [hwsanitizer] nested funtion pointer is tagged but never checked.

2021-06-01 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100665

Matthew Malcomson  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Matthew Malcomson  ---
(In reply to Hongtao.liu from comment #2)
> (In reply to Matthew Malcomson from comment #1)
> > Given that, the question of whether the function pointer (i.e. the pointer 
> > to
> > the trampoline inside that object) should be tagged when passed elsewhere
> > then
> > has a few benefits:
> > 1) In this case there is no check performed, but there may be checks
> > performed
> >if e.g. this function pointer gets cast to an integer pointer and some
> > code
> >elsewhere attempts to read that integer.
> I'm not sure there're cases where code pointers are casted to integer
> pointers. But consider the above comment, I agree that tag is needed for the
> object.

Fair ;-).
My reasoning was along the lines of "it's an escaped pointer, and I don't know
what other code will do with it" than actually expecting that to happen.

[Bug sanitizer/101744] [12 regression] hwasan new failures since r12-2424

2021-08-05 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101744

--- Comment #7 from Matthew Malcomson  ---
Hi there,

I didn't check all the new tests that Christophe mentioned, but all those I
checked had `dg-require-effective-target hwaddress_exec` in them.

The test that determines that effective target should only pass with a modern
enough kernel (one that supports passing tagged pointers to its syscalls).
It is still failing on my native AArch64 machine.

For anyone that is seeing them -- what kernel version are you running?
If your kernel has not changed could you manually run the check and see if it
passes and why?

I've unfortunately lost my testing environment.  I'm working on getting it back
but will be a while before I can see if I can reproduce the failures on a
machine with the required kernel.

[Bug target/114905] New: aarch64 locally_streaming function ICE in dwarf2cfi due to mismatched CFA instructions in prologue/epilogue

2024-05-01 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114905

Bug ID: 114905
   Summary: aarch64 locally_streaming function ICE in dwarf2cfi
due to mismatched CFA instructions in
prologue/epilogue
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: matmal01 at gcc dot gnu.org
  Target Milestone: ---

Bug observed (testcase + ICE) is below.  I believe this happens because we use
`aarch64_add_sp` to adjust the stack pointer when `maybe_ne (sve_callee_saves,
0)` in `aarch64_expand_epilogue`.  This marks the adjustment as adjusting the
CFA.  However in `aarch64_expand_prologue` we might have set the CFA to the
frame pointer (instead of the stack pointer) if `frame_pointer_needed &&
frame_size.is_constant()`.
Hence when both these conditions are held we have a CFA adjust note that
affects a different register to the current CFA register.



vshcmd: > cat streaming-prologues.c 
[[arm::locally_streaming,arm::streaming_compatible]] void   
no_gprs_saved_very_streaming (__SVBool_t x) 
{   
  asm (""); 
}   

gnu-work [13:47:36] $   
vshcmd: > ${install_dir}/aarch64-none-linux-gnu-gcc \   
vshcmd: > streaming-prologues.c \   
vshcmd: > -fdiagnostics-plain-output -O -fomit-frame-pointer
-fstack-clash-protection\   
vshcmd: > -march=armv9-a+sme -mtune=generic -moverride=tune=none \  
vshcmd: > -fdump-rtl-all-all \  
vshcmd: > -S -o locally_streaming_1_scp.s   
gnu-work [13:47:38] $ > > > > > during RTL pass: dwarf2 
dump file: locally_streaming_1_scp.c.356r.dwarf2
streaming-prologues.c: In function ‘no_gprs_saved_very_streaming’:  
streaming-prologues.c:5:1: internal compiler error: in
dwarf2out_frame_debug_adjust_cfa, at dwarf2cfi.cc:1339  
0xa540bd dwarf2out_frame_debug_adjust_cfa   
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/dwarf2cfi.cc:1339
0xa540bd dwarf2out_frame_debug  
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/dwarf2cfi.cc:2277
0xa540bd scan_insn_after
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/dwarf2cfi.cc:2726
0xa557e0 scan_trace 
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/dwarf2cfi.cc:2893
0xa562cf create_cfi_notes   
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/dwarf2cfi.cc:2938
0xa562cf execute_dwarf2_frame   
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/dwarf2cfi.cc:3309
0xa562cf execute
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/dwarf2cfi.cc:3797
Please submit a full bug report, with preprocessed source (by using
-freport-bug).  
Please include the complete backtrace with any bug report.  
See  for instructions.   
gnu-work [13:47:39] $

[Bug target/114906] New: aarch64 locally_streaming ICE in aarch64_expand_prologue

2024-05-01 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114906

Bug ID: 114906
   Summary: aarch64 locally_streaming ICE in
aarch64_expand_prologue
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: matmal01 at gcc dot gnu.org
  Target Milestone: ---

Bug (testcase + ICE) below.
I believe this is because:
1) We save `r20` below `VG_REGNUM` in `aarch64_layout_frame` (and above the
point that `bytes_below_hard_fp` describes).
2) Despite that save of `r20` causing us to also set
`frame.wb_push_candidate1`, because we have a poly-int sized frame (due to the
-O0 in this case, but I don't think has to be -O0) we still end up in the
"General case" in `aarch64_layout_frame`.
3) Hence we end up with `initial_adjust` non zero, `sve_callee_adjust`
non-zero, and the `VG_REGNUM` not pointing to the same place as
`bytes_below_hard_fp` because there is that r20 saved in between.

My initial guess would be that we should simply change the assertion that
failed to check that VG_REGNUM is *greater than or equal to* `bytes_below_sp`.
To be honest I'm not entirely sure what this assertion is there for so would
not like to actually make that suggestion.  The commit message of ad4df8cd080c
seems to say the assertion is there to ensure that the allocation of VG_REGNUM
is not folded into the initial_allocation, but I don't 100% follow what's going
on.


vshcmd: > cat ../streaming-prologues.c  
[[arm::locally_streaming]] void 
with_callee_saved_regs (__SVBool_t x)   
{   
  asm ("" : : : "r20"); 
}   
testing [14:47:20] $
vshcmd: > ${install_dir}/aarch64-none-linux-gnu-gcc \   
vshcmd: >   ../streaming-prologues.c \  
vshcmd: >   -fdiagnostics-plain-output -O0 -fstack-clash-protection \   
vshcmd: >   -march=armv9-a+sme -mtune=generic -moverride=tune=none \
vshcmd: >   -S -o prologues-with-streaming-1.s  
> > > > during RTL pass: late_pro_and_epilogue  
> > > > 
> > > > 
> > > > 
> > > >
../streaming-prologues.c: In function ‘with_callee_saved_regs’: 
../streaming-prologues.c:5:1: internal compiler error: in
aarch64_expand_prologue, at config/aarch64/aarch64.cc:9705  
0x142e3af aarch64_expand_prologue() 
   
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/config/aarch64/aarch64.cc:9701   
0x1a7eee7 gen_prologue()
   
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/config/aarch64/aarch64.md:1008   
0x140219f target_gen_prologue   
   
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/config/aarch64/aarch64.md:8121   
0xb8d242 make_prologue_seq  
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/function.cc:5818 
0xb8d3aa thread_prologue_and_epilogue_insns()   
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/function.cc:6053 
0xb8dc4e rest_of_handle_thread_prologue_and_epilogue
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/function.cc:6567 
0xb8dcbf execute
/workspace/GNU-toolchain/fsf-trunk/src/gcc/gcc/function.cc:6692 
Please submit a full bug report, with preprocessed source (by using
-freport-bug).  
Please include the complete backtrace with any bug report.  
See  for instructions.   
testing [14:47:22] $

[Bug target/115043] New: aarch64 locally_streaming function appears to have CFA note on wrong instruction in prologue

2024-05-11 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115043

Bug ID: 115043
   Summary: aarch64 locally_streaming function appears to have CFA
note on wrong instruction in prologue
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: matmal01 at gcc dot gnu.org
  Target Milestone: ---

Apologies if I'm misunderstanding something here -- but I noticed this RTL
sequence and I believe the `REG_CFA_DEF_CFA` note is on the wrong insn.

I have not observed wrong behaviour coming from this, but figured still worth a
bug report in case it is indeed wrong.

There seem to be a pair of instructions, one doing some special SME operation
and another storing the stack pointer into x11.  The instruction doing the
special SME thing has a note saying that it sets the CFA to x11.  I would have
expected the note to be on the insn after that records SP into x11.



vshcmd: > cat basic-streaming.c 
[[arm::locally_streaming]] void 
no_gprs_saved (__SVBool_t x)
{   
  asm (""); 
}   
gnu-work [13:19:27] $   
vshcmd: > ${install_dir}/aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc
\   
vshcmd: >   basic-streaming.c \ 
vshcmd: >   -fdiagnostics-plain-output -march=armv8.2-a+sme+sve
-fno-stack-protector \  
vshcmd: >   -fdump-rtl-all-all \
vshcmd: >   -O -fshrink-wrap -fstack-clash-protection -g -S -o /dev/null
> > > > gnu-work [13:19:36] $   
> > > > 
> > > > 
> > > > 
> > > >
vshcmd: > # I'm surprised that the REG_CFA_DEF_CFA note is on the instruction   
vshcmd: > # just before we move the stack pointer into x11. 
vshcmd: > grep -C 4 REG_CFA_DEF_CFA.*x11
basic-streaming.c.*.late_pro_and_epilogue   
(insn/f 15 14 16 2 (set (reg:DI 13 x13) 
(const:DI (unspec:DI [  
(const_int 288 [0x120]) 
] UNSPEC_SME_VQ))) "basic-streaming.c":3:1 -1   
 (expr_list:REG_CFA_DEF_CFA (reg:DI 11 x11) 
(nil))) 
(insn 16 15 17 2 (set (reg:DI 11 x11)   
(reg/f:DI 31 sp)) "basic-streaming.c":3:1 -1
 (nil)) 
gnu-work [13:21:21] $

[Bug tree-optimization/116776] Complex if conditions not hoisted from loop

2024-09-19 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116776

--- Comment #1 from Matthew Malcomson  ---
N.b. from experimentation it seems that gcc 11 didn't move any part of the
condition outside of the loop, and since gcc 12 part of the condition has been
moved outside the loop.

I don't think this hoisting has ever happened.

[Bug tree-optimization/116776] New: Complex if conditions not hoisted from loop

2024-09-19 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116776

Bug ID: 116776
   Summary: Complex if conditions not hoisted from loop
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: matmal01 at gcc dot gnu.org
  Target Milestone: ---

The condition in the following loop does not get hoisted at `-O3` on GCC trunk.
Simplifying the condition (by either removing some of the `shouldthischange`
checks, or simplifying the `shouldthischange` function) allows hoisting.

N.b. Some of the condition gets hoisted, just not all.
N.b. having the condition inside the loop blocks vectorisation when compiled
with `-march=armv8.6-a+sve+sve2`.


```
struct teststruct {
unsigned long dims[2];
double *data;
bool ** allocated;
};

bool shouldthischange(struct teststruct *v, int b, int l) {
return 
// true ||
v->dims[1] > l
&& v->allocated[b][l]
;
}

void DoLoop(struct teststruct *x, struct teststruct *y, struct teststruct *z,
unsigned long len)
{
for (unsigned long i = 0; i < len; i++)
if (shouldthischange(x, 0, 0) && shouldthischange(y, 0, 0) &&
shouldthischange(z, 0, 0)) {
z->data[i] = x->data[i] + y->data[i];
}
}
```

[Bug target/117991] [15 regression] RISC-V: g++/template/builtin-speculation-overloads[14].C assertion error since addition in r15-6042-g9ed094a817e

2025-02-12 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117991

Matthew Malcomson  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |matmal01 at gcc dot 
gnu.org

--- Comment #3 from Matthew Malcomson  ---
Created attachment 60477
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60477&action=edit
Proposed patch -- update testsuite

[Bug target/117991] [15 regression] RISC-V: g++/template/builtin-speculation-overloads[14].C assertion error since addition in r15-6042-g9ed094a817e

2025-02-08 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117991

--- Comment #2 from Matthew Malcomson  ---
(In reply to Jeffrey A. Law from comment #1)
> Still occurring on the trunk.  In my case I saw them in a native build &
> test scenario.

Ah -- apologies I missed when this was raised -- will look into this next week.

[Bug middle-end/119108] New: [15 Regression] AArch64 Commit 'vect: Force alignment peeling ...' (https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=68326d5d1a593d) causes regression in Snappy workload for

2025-03-04 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119108

Bug ID: 119108
   Summary: [15 Regression] AArch64 Commit 'vect: Force alignment
peeling ...'
(https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=68326d5
d1a593d) causes regression  in Snappy workload for
-mcpu=neoverse-v2.
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: matmal01 at gcc dot gnu.org
  Target Milestone: ---

Created attachment 60650
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60650&action=edit
Script to reproduce the observed slowdown.

Have observed a slowdown after the referenced commit.

Attaching script for reproduction.  Results when `master` is commit 78380fd7f
inlined below (numbers are percentage change in time from "TOT with problem
commit reverted" to "TOT" -- positive numbers demonstrating the peeling has
caused a slowdown).

I ran the script with:
```
vshcmd: > cd $HOME
vshcmd: > rm -rf $HOME/testing-reproduction-script
vshcmd: > newdir $HOME/testing-reproduction-script
vshcmd: > git clone $HOME/gcc-source gcc_src
vshcmd: > parentdir=$HOME/testing-reproduction-script $HOME/Snappy/reproduce.sh
```


```
BM_UFlat/3/1 5.3
BM_UFlat/3/2 7.14286
BM_UFlat/4/2 2.59319
BM_UFlat/5/1 2.86533
BM_UFlat/5/2 5.10708
BM_UValidate/3/2 2.08333
BM_UValidate/5/2 2.41758
BM_UIOVecSource/1/2 4.21903
BM_UIOVecSource/5/2 5.44218
BM_UIOVecSource/6/2 4.21348
BM_UIOVecSource/7/1 -3.6036
BM_UIOVecSource/7/2 6.84039
BM_UIOVecSource/8/2 3.86905
BM_UIOVecSource/9/2 2.90987
BM_UIOVecSource/11/2 5
BM_UIOVecSink/0 21.3523
BM_UFlatSink/3/1 9.58904
BM_UFlatSink/3/2 10.1449
BM_UFlatSink/5/1 3.17919
BM_ZFlat/1/2 4.54959
BM_ZFlat/5/1 2.73973
BM_ZFlat/5/2 6.31579
BM_ZFlat/6/1 -2.35294
BM_ZFlat/6/2 3.9548
BM_ZFlat/7/1 -3.15315
BM_ZFlat/7/2 5.51948
BM_ZFlat/8/2 3.99202
BM_ZFlat/9/2 3.25145
BM_ZFlat/11/2 5.83942
BM_ZFlatAll/2 3.57955
```

[Bug target/119108] [15 Regression] AArch64 Commit 'vect: Force alignment peeling ...' (r15-6807-g68326d5d1a593d) causes regression in Snappy workload for -mcpu=neoverse-v2.

2025-03-11 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119108

--- Comment #9 from Matthew Malcomson  ---
(In reply to Tamar Christina from comment #8)
> Ok, so having looked at this I'm not sure the compiler is at fault here.
> 
> Similar to the SVN case the snappy code is misaligning the loads
> intentionally and  loading 64-bits at a time from the 8-bit pointer:

... 

> So I think this is a case where the compiler can't do anything. (I also
> think that the C code uses UB similar to SVN, they misalign the byte array
> to 4-bytes but load 8-bytes at a time. They get lucky that the vector code
> is never entered).

...

> 
> The could would be beneficial if they:
> 
> 1. added restrict to the functions, as eg in `FindMatchLengthPlain` values
> manually vectorized anyway so aliasing must not be a problem
> 2. they have a simple scalar loop variant that's left up to the vectorizer
> to vectorize.  This would actually give them faster code and allow e.g. SVE
> codegen.


Thanks for looking into it Tamar!

Few questions (some just because I want to make sure I understand -- some more
on topic ;-)

Just to understand:
- What SVN case are you referencing?
- How is this UB?  The UNALIGNED_LOAD64 seems to use `memcpy`, and they provide
a relevant limit on the reads of 8 bytes at a time.

More relevant to the issue:
- I tried by adding `__restrict__` to `s1` and `s2` in `FindMatchLengthPlain`
and replacing the function with a plain loop.  I saw a significant slowdown. 
Is  your point that this would allow the compiler to do something about the
code even though it may not be better right now?  Or did you mean inline the
loop or something.  (N.b. didn't double-check the codegen of that function --
just ran the benchmark naively again -- so if there was any obvious adjustment
in flags or the like I should make I didn't make it ;-)

[Bug target/119108] [15 Regression] AArch64 Commit 'vect: Force alignment peeling ...' (r15-6807-g68326d5d1a593d) causes regression in Snappy workload for -mcpu=neoverse-v2.

2025-03-05 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119108

--- Comment #3 from Matthew Malcomson  ---
I only looked into VecSource/5/2, and unfortunately I looked into it on an
internal setup that compiles slightly differently.

In that slightly different compilation I noticed that `FindMatchLengthPlain`
was affected by the patch, and perf pointed to extra branch mispredictions on
the changed code.  This was particularly noticeable in that different
compilation since `FindMatchLengthPlain` was not inlined.

Am currently looking to reproduce that finding with upstream sources so it's
more useful than hearsay.

[Bug target/119108] [15 Regression] AArch64 Commit 'vect: Force alignment peeling ...' (r15-6807-g68326d5d1a593d) causes regression in Snappy workload for -mcpu=neoverse-v2.

2025-03-05 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119108

--- Comment #7 from Matthew Malcomson  ---
FWIW I have managed to figure out what the difference between my internal build
and the upstream one was -- my reproduction script has the line
`-DCMAKE_BUILD_TYPE=Release` in it and the local build that I did some
performance analysis on does not.

>From looking at the build logs it seems the only real difference due to this
difference in flags is that `-DNDEBUG` is passed to the compiler.  So things
still got optimised -- though obviously this is not the best for a benchmark
run.


However it does seem somewhat useful that without the abvoe `cmake` argument I
can see the (now not inlined) `FindMatchLengthPlain` function change and start
to take up a much greater proportion in the perf statistics with 68326d5d.

[Bug libgomp/119588] New: Possible improvement in locking strategies for libgomp

2025-04-02 Thread matmal01 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119588

Bug ID: 119588
   Summary: Possible improvement in locking strategies for libgomp
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgomp
  Assignee: unassigned at gcc dot gnu.org
  Reporter: matmal01 at gcc dot gnu.org
CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Created attachment 60960
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60960&action=edit
Demonstrating locking differences

Hello,

Summary is:  I'm proposing that we implement the "hypercube-embedded tree"
locking strategy that LLVM libomp uses by default in libgomp.
Would appreciate feedback on whether this would be welcome and/or feasible.

Below contains the observations I've made to come to that suggestion.

Apologies for taking my time between asking on IRC and raising the PR.

--
We've seen on some internal workloads (NVPL BLAS running GEMM routine on a
small
matrix) that the overhead of a `#pragma omp parallel` statement when running
with a high number of cores (72 or 144) is much higher with the libgomp
implementation than with LLVM's libomp.

In a program which has both some work that can be handled with high parallelism
(so OMP is running with many threads) and a large number of small pieces of
work
that need to be performed with low overhead, this has been seen to cause a
significant overhead when accumulated.

I'm attaching a benchmark for just the creation of a `#pragma omp parallel`
region (around an `asm` statement so the region doesn't get optimised away). 
We
can see that with many threads libgomp scales worse than llvm's libomp.

When compiled with the below:
#+begin_example
  vshcmd: > ${gcc_install_path}/bin/g++ -O3 -fopenmp OpenMP-reproducer.cpp -o
bench.gcc.x
  vshcmd: > ${clang_install_path}/bin/clang++ -O3 -fopenmp
OpenMP-reproducer.cpp -o bench.clang.x
  lego-c2-qs-56:openmp-parallel-gomp-slow [04:02:41] $
lego-c2-qs-56:openmp-parallel-gomp-slow [04:02:44] $ 
#+end_example


Numbers I've observed are such showing that at 144 threads the cost of just the
barrier is much higher with GNU than with LLVM (N.b. this is on an AArch64
machine
with 144 cores):

#+begin_example
  vshcmd: > bench_gcc () {
  vshcmd: > LD_LIBRARY_PATH=${gcc_install_path}/lib64 ./bench.gcc.x
  vshcmd: > }
  vshcmd: > bench_clang () {
  vshcmd: > LD_LIBRARY_PATH=${clang_install_path}/lib ./bench.clang.x
  vshcmd: > }
  vshcmd: > three_times () {
  vshcmd: > for i in 1 2 3; do
  vshcmd: > $1
  vshcmd: > done
  vshcmd: > }
  vshcmd: > high_thread_counts () {
  vshcmd: > for num_threads in 72 144; do
  vshcmd: > export OMP_NUM_THREADS=$num_threads
  vshcmd: > echo " NUM = $num_threads"
  vshcmd: > OMP_PROC_BIND=true OMP_WAIT_POLICY=active three_times $1
  vshcmd: > done
  vshcmd: > }
  > > lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ > >
lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ > > > >
lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ > > > > > >
lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:02] $ 
  vshcmd: > # Without any specification of locking mechanisms, clang approx
thrice performance of GCC.
  vshcmd: > high_thread_counts bench_gcc
   NUM = 72
  creation maxthr:72 nthr:72 min_time:10.694 us max_time:11.181 us
avg_time:10.839 us stddev:23.127 us
  creation maxthr:72 nthr:72 min_time:10.214 us max_time:10.567 us
avg_time:10.335 us stddev:11.986 us
  creation maxthr:72 nthr:72 min_time:10.147 us max_time:10.615 us
avg_time:10.357 us stddev:19.212 us
   NUM = 144
  creation maxthr:144 nthr:144 min_time:31.421 us max_time:32.003 us
avg_time:31.735 us stddev:31.332 us
  creation maxthr:144 nthr:144 min_time:30.592 us max_time:31.953 us
avg_time:31.352 us stddev:132.466 us
  creation maxthr:144 nthr:144 min_time:31.089 us max_time:31.953 us
avg_time:31.640 us stddev:60.002 us
  lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:05] $ 
  vshcmd: > high_thread_counts bench_clang
   NUM = 72
  creation maxthr:72 nthr:72 min_time:8.574 us max_time:9.006 us avg_time:8.877
us stddev:17.170 us
  creation maxthr:72 nthr:72 min_time:8.601 us max_time:8.749 us avg_time:8.686
us stddev:3.635 us
  creation maxthr:72 nthr:72 min_time:8.206 us max_time:8.471 us avg_time:8.421
us stddev:6.070 us
   NUM = 144
  creation maxthr:144 nthr:144 min_time:9.958 us max_time:11.293 us
avg_time:10.388 us stddev:133.078 us
  creation maxthr:144 nthr:144 min_time:9.685 us max_time:10.618 us
avg_time:10.232 us stddev:83.710 us
  creation maxthr:144 nthr:144 min_time:9.132 us max_time:9.783 us
avg_time:9.434 us stddev:42.769 us
  lego-c2-qs-56:openmp-parallel-gomp-slow [04:37:06] $ 
#+end_example


I believe the difference to be the locking algorithm used.  There are
environment
variables that the LLVM libomp uses to adjust locking strateg