[Bug target/89838] [ARC] ICE building glibc testsuite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89838 Vineet Gupta changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #4 from Vineet Gupta --- closing per fix pointed by Claudiu !
[Bug target/92845] [ARC] gcc not generating hardware compare instruction FDCMP for -mcpu=hs38_linux
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92845 Vineet Gupta changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #2 from Vineet Gupta --- Addressed via 2019-12-12 48f13fb118fe [ARC] Use hardware support for double-precision compare instructions.
[Bug target/92846] [ARC] floating point compares not generating Invalid Operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92846 Vineet Gupta changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED Summary|[ARC] gloating point|[ARC] floating point |compares not generating |compares not generating |Invalid Operand |Invalid Operand --- Comment #4 from Vineet Gupta --- Resolved via 2019-12-12 fbf8314b0a8d [ARC] generate signaling FDCMPF for hard float comparisons
[Bug c/100363] New: gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 Bug ID: 100363 Summary: gcc generating wider load/store than warranted at -O3 Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: vgupta at synopsys dot com Target Milestone: --- Created attachment 50722 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50722&action=edit test case with an additional nop to annotate codegen In Linux kernel's initramfs gzip inflate code, an inner copy loop using unsigned short pointers (src/dst) is generated with wider 8 or 16-byte at a time (vs. 2 bytes at a time) causing extra/unintended bytes to be copied - leading to corruption of inflated files on target. The showed up on upstream v5.6 Linux kernel built for ARC (defaults to -O3). Issue doesn't happen at -O2. Full test case attached, but the gist of it is: lib/zlib_inflate/inffast.c if (dist > 2) { unsigned short *sfrom; sfrom = (unsigned short *)(from); loops = len >> 1; do *sout++ = *sfrom++; while (--loops); out = (unsigned char *)sout; from = (unsigned char *)sfrom; } ... @sfrom and @sout are unsigned short pointers and thus expected to work on 2 bytes. However at -O3 gcc is generating wide loads (8-byte LDD/STD on ARCv2, 16-byte LDR q0 on aarch64. For aarch64, it seems there's code generated for 16-byte access as well as 2-byte, and I haven't verified if it elides the 16-byte code based on size etc - but the code is generated nonetheless. For ARC 8-byte loop is certainly executed causing bad things as described The issue was originally seen with mainline gcc 10.2 (again both ARC and aarch64) at -O3 and I can confirm it exists in gcc 9.3 as well. Attaching preprocessed source file is from ARC linux build (but builds for aarch64 too since non of arch specific functions are used here.
[Bug middle-end/100363] gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 --- Comment #3 from Vineet Gupta --- Created attachment 50723 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50723&action=edit preprocessed source file (with extra nop annotation)
[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 --- Comment #6 from Vineet Gupta --- (In reply to Linus Torvalds from comment #4) > (In reply to Andrew Pinski from comment #1) > > The loop gets vectorized, I don't see the problem really. > > > See > > > https://github.com/foss-for-synopsys-dwc-arc-processors/toolchain/issues/372 > > and in particular the comment > >"In the first 8-byte copy, src and dst overlap" > > so apparently gcc has decided that they can't overlap, despite the two > pointers being literally generated from the same base pointer. Exactly: > But I don't real arc assembly, so I'll have to take Vineet's word for it. fwiw: LDD.a [base, off] is 8-byte load with pre-incr : eff addr = base + offset STD.ab [base, off] is 8-byte store with post-incr: eff addr = base > Vineet, have you been able to generate a smaller test-case? No I'm afraid not.
[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 --- Comment #12 from Vineet Gupta --- Created attachment 50742 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50742&action=edit kernel patch as proposed on comment #7
[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 --- Comment #13 from Vineet Gupta --- Sorry the workaround proposed by Alexander doesn't seem to cure it (patch attached), outcome is the same mov lp_count,r13;5 #, bnd.65 lp @.L201 ; lp_count:@.L50->@.L201#, .align 2 .L50: # ../lib/zlib_inflate/inffast.c:288: PUP(sout) = PUP(sfrom); ldd.a r18,[r21,8] # MEM[base: _496, offset: 0B], MEM[base: _496, offset: 0B] # ../lib/zlib_inflate/inffast.c:288: PUP(sout) = PUP(sfrom); std.ab r18,[r22,8] # MEM[base: vectp_prephitmp.73_741, offset: 0B], MEM[base: _496, offset: 0B] .align 2 .L201: ; ZOL_END, begins @.L50 #
[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 --- Comment #15 from Vineet Gupta --- (In reply to Linus Torvalds from comment #14) > (In reply to Vineet Gupta from comment #13) > > Sorry the workaround proposed by Alexander doesn't seem to cure it (patch > > attached), outcome is the same > > Vineet - it's not the ldd/std that is necessarily buggy, it's the earlier > tests of the address that guard that vectorized path. > > So your quoted parts of the code generation aren't necessarily the > problematic ones. /me slaps myself. How can I be so stupid. > Did you actually test the code and check whether it has the same issue? > Maybe it changed the address limit guards before that ldd/std? The problem is is indeed gone. I need to analyze the assembly fully how it prevents the bad case. e.g. I'm still not comfortable seeing the loop entered with following and it doing 8 byte ldd/std when we know it should only do 2 at a time. r21 = 0xbf178036 (pre-increment so 0x3e will be first src) r22 = 0xbf1780b2 LPC = 4 80d9a360: lp 12 ;80d9a36c 80d9a364: ldd.a r18r19,[r21,8] 80d9a368: std.ab r18r19,[r22,8] > I also sent you a separate patch to test if just upgrading to a newer > version of the zlib code helps. Although that may be buggy for other > reasons, it's not like I actually tested the end result.. But it would be > interesting to hear if that one works for you (again, ldd/std might be a > valid end result of trying to vectorize that code assuming the aliasing > tests are done correctly in the vectorized loop headers). Thx for that. And this seems to boot as well.
[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 --- Comment #18 from Vineet Gupta --- (In reply to Richard Biener from comment #9) > (In reply to Linus Torvalds from comment #8) > > (In reply to Alexander Monakov from comment #7) > > > > > > Most likely the issue is that sout/sfrom are misaligned at runtime, while > > > the vectorized code somewhere relies on them being sufficiently aligned > > > for > > > a 'short'. > > > > They absolutely are. > > > > And we build the kernel with -Wno-strict-aliasing exactly to make sure the > > compiler doesn't think that "oh, I can make aliasing decisions based on type > > information". > > > > Because we have those kinds of issues all over, and we know which > > architectures support unaligned loads etc, and all the tricks with > > "memcpy()" and unions make for entirely unreadable code. > > > > So please fix the aliasing logic to not be type-based when people explicitly > > tell you not to do that. > > > > Linus > > Note alignment has nothing to do with strict-aliasing (-fno-strict-aliasing > you mean btw). > > One thing we do is (I'm not 50% sure this explains the observed issue) assume > that if you have two accesses with type 'short' and they are aligned > according to this type then they will not partly overlap. Note this has > nothing to do with C strict aliasing rules but is basic pointer math when > you know lower zero bits. OK, given that source code has type short, they will assume these things are short aligned and thus won't overlap for short accesses. But then the code actually generated by loop vectorizer assumes they are 8 bytes apart - since that is what it is generating. > > I suggest to try the fix suggested in comment#7 and report back if that > fixes the observed issue.