[Bug rtl-optimization/111376] New: missed optimization of one bit test on MIPS32r1

2023-09-11 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

Bug ID: 111376
   Summary: missed optimization of one bit test on MIPS32r1
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lis8215 at gmail dot com
  Target Milestone: ---

Created attachment 55879
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55879&action=edit
Silly patch to enable SLL+BLTZ/BGEZ

Currently for testing bits above 14-th the following instructions emitted:

 LUI $t1, 0x1000 # 0x1000
 AND $t0, $t1, $t0
 BEQ/BNE $t0, $Lxx

However there's shorter & faster alternative, just need to
shift the bit of interest to the sign bit and jump with BLTZ/BGEZ.
The code above can be replaced with:

 SLL $t0, $0, 3
 BGEZ/BLTZ $t0, $Lxx

Not sure if it can be applied to MIPS64 without EXT/INS instructions
and to older MIPS revisions (I..V).
But for MIPS32 it helps reduce code size by removing 1 insn per ~700.
evaluated on linux kernel and python3.11.

[Bug rtl-optimization/111378] New: Missed optimization for comparing with exact_log2 constants

2023-09-11 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111378

Bug ID: 111378
   Summary: Missed optimization for comparing with exact_log2
constants
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lis8215 at gmail dot com
  Target Milestone: ---

The simple example below produces suboptimal code on many targets where
exact_log2 constant can't be represented as immediate operand.
(confirmed MIPS/PPC64/SPARC/RISC-V)

extern void do_something(char* p);
extern void do_something_other(char* p);

void test(char* p, uint32_t ch)
{
if (ch < 0x1)
{
do_something(p);
}
else /* ch >= 0x1 */
{
do_something_other(p);
}
}

However, instead of direct comparing with constant we can use shift & compare
to zero:
e.g. (ch < 0x1) can be transformed into ((ch >> 16) == 0) which is usually
shorter & faster on many targets.

The condition appears in real world rarely AFAIK - 20-30 occurences per million
asm instructions. Fun fact: many of them related to unicode transformations.

[Bug middle-end/111384] New: missed optimization: GCC adds extra any extend when storing subreg#0 multiple times

2023-09-12 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111384

Bug ID: 111384
   Summary: missed optimization: GCC adds extra any extend when
storing subreg#0 multiple times
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lis8215 at gmail dot com
  Target Milestone: ---

Simple example:

void store_hi_twice(uint32_t src, uint16_t *dst1, uint16_t *dst2)
{
*dst1 = src;
*dst2 = src;
}

shows that GCC can't opt out unnecessary zero extend of the src's low half
aimed to store two or more times. Many targets are affected, although x86-64
don't.

[Bug rtl-optimization/111384] missed optimization: GCC adds extra any extend when storing subreg#0 multiple times

2023-09-12 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111384

--- Comment #2 from Siarhei Volkau  ---
Well what the godbolt says with -O2 -fomit-frame-pointer.

ARM:
uxthr0, r0  @ << zero extend
strhr0, [r1]
strhr0, [r2]
bx  lr

ARM64:
and w0, w0, 65535   @ << zero extend
strhw0, [x1]
strhw0, [x2]
ret

MIPS64:
andi$4,$4,0x@ << zero extend
sh  $4,0($5)
jr  $31
sh  $4,0($6)

MRISC32:
shufr1, r1, #2888   @ << zero extend
sth r1, [r2]
sth r1, [r3]
ret

RISC-V:
sllia0,a0,16@ << zero extend
srlia0,a0,16@ << zero extend
sh  a0,0(a1)
sh  a0,0(a2)
ret

RISC-V (64-bit):
sllia0,a0,48@ << zero extend
srlia0,a0,48@ << zero extend
sh  a0,0(a1)
sh  a0,0(a2)
ret

Xtensa ESP32:
entry   sp, 32
extui   a2, a2, 0, 16   @ << zero extend
s16ia2, a3, 0
s16ia2, a4, 0
retw.n

Loongarch64:
bstrpick.w  $r4,$r4,15,0  @ << zero extend
st.h$r4,$r5,0
st.h$r4,$r6,0
jr  $r1

MIPS:
andi$4,$4,0x@ << zero extend
sh  $4,0($5)
jr  $31
sh  $4,0($6)

SH:
extu.w  r4,r4   @ << zero extend
mov.w   r4,@r5
rts 
mov.w   r4,@r6


Other available at godbolt (x86-64/Power/Power64/s390) unaffected.

[Bug middle-end/111626] New: missed optimization combining offset of array member in struct with offset inside the array

2023-09-28 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111626

Bug ID: 111626
   Summary: missed optimization combining offset of array member
in struct with offset inside the array
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lis8215 at gmail dot com
  Target Milestone: ---

The simple code:

struct some_struct {
uint32_t some_member;
uint32_t arr[4][16];
};

uint32_t fn(const struct some_struct *arr, int idx)
{
return arr->arr[1][idx];
}

is used to showcase a suboptimal optimization on some platforms including
RISC-V and MIPS (32 & 64 bit) even with `some_member` commented out.

while GCC emits:
addia1,a1,16
sllia1,a1,2
add a0,a0,a1
lw  a0,4(a0)
ret

Clang does better job:
sllia1, a1, 2
add a0, a0, a1
lw  a0, 68(a0)
ret

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-04 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #3 from Siarhei Volkau  ---
I know that the patch breaks condmove cases, that's why it is silly.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-06 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #6 from Siarhei Volkau  ---
Well, it is work mostly well.
However, it still has issues, addressed in my patch:
 1) Doesn't work for -Os : highly likely costing issue.
 2) Breaks condmoves, as mine does. I have no idea how to avoid that though.
 3) Overlaps preferable ANDI+BEQ/BNE cases: (as it don't break condmoves)

I think it will be okay whether fixed 1 and 3.

PS: tested by applying the patch on GCC 11, will try with upstream this
weekend.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-07 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #8 from Siarhei Volkau  ---
Created attachment 58377
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58377&action=edit
condmove testcase

Tested with current GCC master branch:

- Work with -Os confirmed.
- Condmove issue present in GCC 11 but not current master. Even for GCC 11 it
is very rare case, although found one relatively simple to reproduce: it is
excerpt from Python 3.8.x, reduced as much as I can.
Compilation flags tested: {-O2|-Os} -mips32 -DNDEBUG -mbranch-cost={1|10}

So, my opinion, the patch you propose is perfectly fine.
Condmove issue seems not relevant anymore.

[Bug middle-end/111835] New: Suboptimal codegen: zero extended load instead of sign extended one

2023-10-16 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111835

Bug ID: 111835
   Summary: Suboptimal codegen: zero extended load instead of sign
extended one
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lis8215 at gmail dot com
  Target Milestone: ---

In this simplified example:

int test (const uint8_t * src, uint8_t * dst)
{
int8_t tmp = (int8_t)*src;
*dst = tmp;
return tmp;
}

GCC prefers to use load with zero extension instead of more rational sign
extended load.
Then it needs to do explicit sign extension for making return value.

I know there's a lot of bugs related to zero/sign ext, but I guessed it's rare
special case, and it reproduces in any GCC version available at godbolt and any
architecture except x86-64.

[Bug rtl-optimization/104387] aarch64: Redundant SXTH for “bag of bits” moves

2023-10-31 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104387

--- Comment #4 from Siarhei Volkau  ---
*** Bug 111384 has been marked as a duplicate of this bug. ***

[Bug rtl-optimization/111384] missed optimization: GCC adds extra any extend when storing subreg#0 multiple times

2023-10-31 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111384

Siarhei Volkau  changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 Status|NEW |RESOLVED

--- Comment #5 from Siarhei Volkau  ---
Dup of bug 104387.

*** This bug has been marked as a duplicate of bug 104387 ***

[Bug rtl-optimization/111835] Suboptimal codegen: zero extended load instead of sign extended one

2023-10-31 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111835

--- Comment #3 from Siarhei Volkau  ---
I don't think that it is duplicate of the bug 104387 because there's only one
store.
And this bug is simply disappears if we change the source code a bit.
e.g.
 - change (int8_t)*src; to *(int8_t*)src;
or change argument uint8_t * dst to int8_t * dst

But if we have multiple stores, extension will remain in any condition.

[Bug middle-end/112398] New: Suboptimal code generation for xor pattern on subword data

2023-11-05 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112398

Bug ID: 112398
   Summary: Suboptimal code generation for xor pattern on subword
data
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lis8215 at gmail dot com
  Target Milestone: ---

These minimal examples showcase the issue:

uint8_t neg8 (const uint8_t *src)
{
return ~*src;
// or return *src ^ 0xff;
}

uint16_t neg16 (const uint16_t *src)
{
return ~*src;
// or return *src ^ 0x;
}

GCC transforms xor here to not + zero_extend, which isn't the best choice.
I guess combiner have to try xor pattern instead of not + zero_extend as it
might be cheaper.

[Bug target/112398] Suboptimal code generation for xor pattern on subword data

2023-11-05 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112398

--- Comment #3 from Siarhei Volkau  ---
Well, let's rewrite it in that way:

void neg8 (uint8_t *restrict dst, const uint8_t *restrict src)
{
uint8_t work = ~*src; // or *src ^ 0xff;
dst[0] = (work >> 4) | (work << 4);
}

Wherever upper bits have to be in zero state it is cheaper to use xor,
otherwise we're relying on techniques for eliminating redundant zero_extend and
at least on MIPS (prior to R2) and RISC-V GCC emits the zero_extend
instruction.

MIPS, neg8:
neg8:
lbu $2,0($5)
nop
nor $2,$0,$2
andi$3,$2,0x00ff
srl $3,$3,4
sll $2,$2,4
or  $2,$2,$3
jr  $31
sb  $2,0($4)

RISC-V, neg8:
lbu a5,0(a1)
not a5,a5
andia4,a5,0xff
srlia4,a4,4
sllia5,a5,4
or  a4,a4,a5
sb  a4,0(a0)
ret

Some other RISCs also emit zero_extend but I'm not sure about having cheaper
xor alternative on them (S390, SH, Xtensa).

[Bug rtl-optimization/112474] New: MIPS: missed optimization for assigning HI reg to zero

2023-11-10 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112474

Bug ID: 112474
   Summary: MIPS: missed optimization for assigning HI reg to zero
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lis8215 at gmail dot com
  Target Milestone: ---

Created attachment 56550
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56550&action=edit
the patch

At the moment GCC emits move $0 to a GPR register and then move to HI (mthi)
from that register as part of DI/TI MADD operation.

It is feasible to avoid such copying when intermediate register isn't used
anymore and reword RTL to emit only `mthi $0`.

So the following output:
...
move $3, $0
mthi $3
; reg dead $3
...

Can be simply reworded as:
...
move $3, $0 ; << will be removed by DCE later
mthi $0
...

Silly patch, which doing such optimization, provided.

Thanks.

[Bug rtl-optimization/112474] MIPS: missed optimization for assigning HI reg to zero

2023-11-10 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112474

--- Comment #1 from Siarhei Volkau  ---
Minimal example for showcase the issue:

#include 

uint64_t mthi_example(uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
uint64_t ret;
ret = (uint64_t)a * b + (uint64_t)c * d + 1u;
return ret;
}

compile command:
  mipsel-*-gcc -O2 -mips32

[Bug rtl-optimization/60749] combine is overly cautious when operating on volatile memory references

2023-05-26 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60749

Siarhei Volkau  changed:

   What|Removed |Added

 CC||lis8215 at gmail dot com

--- Comment #2 from Siarhei Volkau  ---
Created attachment 55167
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55167&action=edit
allow combine ld/st of volatile mem with any_extend op

Is anyone bothering on that? I'm, as embedded engineer, sadly looking on that
long standing issue.

I can propose a quick patch which enables combining volatile mem ld/st with
any_extend for most cases. And it seems, like platform specific test results
remain the same with it (arm/aarch64/mips were tested).

Post it in hope it can help for anyone who needs it.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-14 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #12 from Siarhei Volkau  ---
Highly likely it's because of data dependency, and not direct cost of shift
operations on LoongArch, although can't find information to prove that.
So, I guess it still might get performance benefit in cases where scheduler can
put some instruction(s) between SLL and BGEZ.

Since you have access to hardware you can  measure performace of two variants:
1) SLL+BGEZ
2) SLL+NOT+BGEZ
if their performance is equal then I'm correct and scheduling automaton for
GS464 seems have to be fixed.

>From my side I can confirm that SLL+BGEZ is faster than LUI+AND+BEQ on Ingenic
XBurst 1 cores.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-15 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #15 from Siarhei Volkau  ---
Created attachment 58437
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58437&action=edit
application to test performance of shift

Here is the test application (MIPS32 specific) I wrote.

It allows to detect execution cycles and extra pipeline stalls for SLL if they
take place.

for XBurst 1 (jz4725b) result is the following:

`SLL to use latency test` execution median: 168417 ns, min: 168416 ns
`SLL to use latency test with nop` execution median: 196250 ns, min: 196166 ns

`SLL to branch latency test` execution median: 196250 ns, min: 196166 ns
`SLL to branch latency test with nop` execution median: 224000 ns, min: 224000
ns

`SLL by 7 to use latency test` execution median: 168417 ns, min: 168416 ns
`SLL by 15 to use latency test` execution median: 168417 ns, min: 168416 ns
`SLL by 23 to use latency test` execution median: 168417 ns, min: 168416 ns
`SLL by 31 to use latency test` execution median: 168417 ns, min: 168416 ns

`LUI>AND>BEQZ reference test` execution median: 196250 ns, min: 196166 ns
`SLL>BGEZ reference test` execution median: 168417 ns, min: 168416 ns



and what does it mean:
`SLL to use latency test` 168417 ns and `.. with nop` 196250 ns
means that there's no extra stall cycles between SLL and further use by ALU
operation.

`SLL to branch latency test` and `.. with nop` result
means that there's no extra stall cycles between SLL and further use by branch
operations.

`SLL by N` results means that SLL execution time doesn't depend on shift
amount.

and finally, the reference test results showcases that SLL>BGEZ approach is
faster than LUI>AND>BEQZ.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-15 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #16 from Siarhei Volkau  ---
Might it be that LoongArch have register reuse dependency?

I observed similar behavior on XBurst with load/store/reuse pattern:

e.g. this code
LW $v0, 0($t1)# Xburst load latency is 4 but it has bypass 
SW $v0, 0($t2)# to subsequent store operation, thus no stall here
ADD $v0, $t1, $t2 # but it stalls here, because of register reuse
  # until LW op is not completed.

[Bug rtl-optimization/115505] New: missing optimization: thumb1 use ldmia/stmia for load store DI/DF data when possible

2024-06-15 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115505

Bug ID: 115505
   Summary: missing optimization: thumb1 use ldmia/stmia for load
store DI/DF data when possible
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lis8215 at gmail dot com
  Target Milestone: ---

Created attachment 58438
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58438&action=edit
possible solution patch

At the moment GCC emits two ldr/str instructions for DI/DF modes load/store.
However there's a trick to use ldmia/stmia when address register in not
used anymore/dead or reused.

I don't know if it affects arm and/or thumb2 as well.
Patch for possible solution for thumb1 provided.

Comparing code size with the patch gives for v6-m/nofp:
   libgcc:  -52 bytes / -0.10%
Newlib's libc:  -68 bytes / -0.03%
 libm:  -96 bytes / -0.10%
libstdc++: -140 bytes / -0.02%

[Bug target/115921] New: Missed optimization: and->ashift might be cheaper than ashift->and on typical RISC targets

2024-07-14 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115921

Bug ID: 115921
   Summary: Missed optimization: and->ashift might be cheaper than
ashift->and on typical RISC targets
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lis8215 at gmail dot com
  Target Milestone: ---

At the moment GCC prefers 'ashift first' flavor pattern.

However, it might ends up in emitting expensive constants for subsequent AND
operation.
It might be cheaper to do AND operation first, since there's a chance to match
variant of AND operation which accepts immediate.
Example:

target_wide_uint_t test_ashift_and (target_wide_uint_t x)
{
  return (x & 0x3f) << 12;
}

godbolt results are the following:

[Xtensa ESP32-S3 gcc 12.2.0 (-O3)]
test_ashift_and:
entry   sp, 32
l32ra8, .LC0
sllia2, a2, 12
and a2, a2, a8
retw.n
; missed constant in output

[SPARC gcc 14.1.0 (-O3)]
test_ashift_and:
sethi   %hi(258048), %g1
sll %o0, 12, %o0
jmp %o7+8
 and%o0, %g1, %o0

[sh gcc 14.1.0 (-O3)]
_test_ashift_and:
mov r4,r0
shll2   r0
extu.b  r0,r0
shll8   r0
rts 
shll2   r0

[s390x gcc 14.1.0 (-O3)]
test_ashift_and:
larl%r5,.L4
sllg%r2,%r2,12
ng  %r2,.L5-.L4(%r5)
br  %r14
.L4:
.L5:
.quad   258048

[RISC-V (64-bit) gcc 14.1.0 (-O3)]
test_ashift_and:
li  a5,258048
sllia0,a0,12
and a0,a0,a5
ret

[mips (el) gcc 14.1.0 (-O3)]
test_ashift_and:
li  $2,196608 # 0x3
sll $4,$4,12
ori $2,$2,0xf000
jr  $31
and $2,$4,$2

[mips64 (el) gcc 14.1.0 (-O3)]
test_ashift_and:
li  $2,196608 # 0x3
dsll$4,$4,12
ori $2,$2,0xf000
jr  $31
and $2,$4,$2

[loongarch64 gcc 14.1.0 (-O3)]
test_ashift_and:
lu12i.w $r12,258048>>12 # 0x3f000
slli.d  $r4,$r4,12
and $r4,$r4,$r12
jr  $r1

however, shifting to 33 got:
[mips64 (el) gcc 14.1.0 (-O3, ashift to 33)]
test_ashift_and:
andi$2,$4,0x3f
jr  $31
dsll$2,$2,33

[SPARC64 gcc 14.1.0 (-O3, ashift to 33)]:
test_ashift_and:
and %o0, 63, %o0
jmp %o7+8
 sllx   %o0, 33, %o0


It seems like RISC-V (32-bit) is aware of that in trunk (14.1.0 won't):
[RISC-V (32-bit) gcc (trunk) (-O3)]
test_ashift_and:
andia0,a0,63
sllia0,a0,12
ret

while RV64 is not so good.

While this situation appears rarely in general, it appears 85 times in pcre2
matching routine, which is ~2% of the overall routine's code size (on mips32).

Also, it might be profitable to match any bitwise operator here: e.g. OR,XOR in
addition to AND.

[Bug target/115922] New: Missed optimization: MIPS: clear bit 15

2024-07-14 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115922

Bug ID: 115922
   Summary: Missed optimization: MIPS: clear bit 15
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lis8215 at gmail dot com
  Target Milestone: ---

Created attachment 58659
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58659&action=edit
silly patch designed for mips32 mips32r2

simple testcase:
# define MASK (~(0x8000U))

host_width_type_t test(host_width_type_t x)
{
return x & MASK;
}

Now for clearing bit 15 GCC emits something like:
test:
li  $2,-65536# 0x
addiu   $2,$2,32767  # 0x7fff
and $2,$4,$2

while it's cheaper to use:
ori $2,$2,32768  #0x8000
xori$2,$2,32768

any mask in range ~0x8000 .. ~0xfffe seems profitable, even for MIPS32r2+
where INS instruction can be used to clear group of bits.

Such pattern appears rarely and mostly in low level software e.g. linux kernel.
for linux kernel it shows ~40 matches per million of insns.

Might also be profitable for RISC-V, not tested.

[Bug target/115921] Missed optimization: and->ashift might be cheaper than ashift->and on typical RISC targets

2024-07-16 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115921

--- Comment #1 from Siarhei Volkau  ---
Also take in account examples like this:

uint32_t high_const_and_compare(uint32_t x)
{
if ( (x & 0x7000) == 0x3000)
  return do_some();
return do_other();
}

It might be profitable to use right shift first there to lower constants.
Now, even if you do manual optimization, GCC throws it away.

[Bug target/70557] uint64_t zeroing on 32-bit hardware

2024-12-24 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70557

Siarhei Volkau  changed:

   What|Removed |Added

 CC||lis8215 at gmail dot com

--- Comment #9 from Siarhei Volkau  ---
Created attachment 59964
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59964&action=edit
RV patch

Observing same issue in GCC 14.2 but for RV32.

for example:

void clear64(uint64_t *ull)
{
  *ull = 0;
}

RV32 emits:
li  a5,0
li  a6,0
sw  a5,0(a0)
sw  a6,4(a0)
ret

Hopefully, the fix is trivial (one symbol).

[Bug target/70557] uint64_t zeroing on 32-bit hardware

2024-12-24 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70557

--- Comment #10 from Siarhei Volkau  ---
Created attachment 59965
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59965&action=edit
MIPS patch

Ditto for 32-bit MIPS.

MIPS emits:
move$3,$0
move$2,$0
sw  $3,4($4)
jr  $31
sw  $2,0($4)

[Bug target/70557] uint64_t zeroing on 32-bit hardware

2025-03-11 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70557

--- Comment #12 from Siarhei Volkau  ---
Hi Jeffrey,

Thanks for your interest in those patches. But unfortunately I'm not sure that
I can and will pass all required steps to make these patches ready for review.
I have no experience with the RV32 ecosystem yet, thus unable to perform
regression testing properly. Even for MIPS I'm unaware of every existing
combination of flags to test, just MIPS32r2 is where I'm experimenting with.

BR,
Siarhei

[Bug target/70557] uint64_t zeroing on 32-bit hardware

2025-03-11 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70557

--- Comment #13 from Siarhei Volkau  ---
Moreover, I think that the patches deal for limited possible cases.
E.g. if upper or lower part of DI memory shall be set to zero the patch won't
help.

It seems feasible to make a special code path for zero reg promotion during
regcprop pass.