[Bug rtl-optimization/111376] New: missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 Bug ID: 111376 Summary: missed optimization of one bit test on MIPS32r1 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: lis8215 at gmail dot com Target Milestone: --- Created attachment 55879 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55879&action=edit Silly patch to enable SLL+BLTZ/BGEZ Currently for testing bits above 14-th the following instructions emitted: LUI $t1, 0x1000 # 0x1000 AND $t0, $t1, $t0 BEQ/BNE $t0, $Lxx However there's shorter & faster alternative, just need to shift the bit of interest to the sign bit and jump with BLTZ/BGEZ. The code above can be replaced with: SLL $t0, $0, 3 BGEZ/BLTZ $t0, $Lxx Not sure if it can be applied to MIPS64 without EXT/INS instructions and to older MIPS revisions (I..V). But for MIPS32 it helps reduce code size by removing 1 insn per ~700. evaluated on linux kernel and python3.11.
[Bug rtl-optimization/111378] New: Missed optimization for comparing with exact_log2 constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111378 Bug ID: 111378 Summary: Missed optimization for comparing with exact_log2 constants Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: lis8215 at gmail dot com Target Milestone: --- The simple example below produces suboptimal code on many targets where exact_log2 constant can't be represented as immediate operand. (confirmed MIPS/PPC64/SPARC/RISC-V) extern void do_something(char* p); extern void do_something_other(char* p); void test(char* p, uint32_t ch) { if (ch < 0x1) { do_something(p); } else /* ch >= 0x1 */ { do_something_other(p); } } However, instead of direct comparing with constant we can use shift & compare to zero: e.g. (ch < 0x1) can be transformed into ((ch >> 16) == 0) which is usually shorter & faster on many targets. The condition appears in real world rarely AFAIK - 20-30 occurences per million asm instructions. Fun fact: many of them related to unicode transformations.
[Bug middle-end/111384] New: missed optimization: GCC adds extra any extend when storing subreg#0 multiple times
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111384 Bug ID: 111384 Summary: missed optimization: GCC adds extra any extend when storing subreg#0 multiple times Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: lis8215 at gmail dot com Target Milestone: --- Simple example: void store_hi_twice(uint32_t src, uint16_t *dst1, uint16_t *dst2) { *dst1 = src; *dst2 = src; } shows that GCC can't opt out unnecessary zero extend of the src's low half aimed to store two or more times. Many targets are affected, although x86-64 don't.
[Bug rtl-optimization/111384] missed optimization: GCC adds extra any extend when storing subreg#0 multiple times
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111384 --- Comment #2 from Siarhei Volkau --- Well what the godbolt says with -O2 -fomit-frame-pointer. ARM: uxthr0, r0 @ << zero extend strhr0, [r1] strhr0, [r2] bx lr ARM64: and w0, w0, 65535 @ << zero extend strhw0, [x1] strhw0, [x2] ret MIPS64: andi$4,$4,0x@ << zero extend sh $4,0($5) jr $31 sh $4,0($6) MRISC32: shufr1, r1, #2888 @ << zero extend sth r1, [r2] sth r1, [r3] ret RISC-V: sllia0,a0,16@ << zero extend srlia0,a0,16@ << zero extend sh a0,0(a1) sh a0,0(a2) ret RISC-V (64-bit): sllia0,a0,48@ << zero extend srlia0,a0,48@ << zero extend sh a0,0(a1) sh a0,0(a2) ret Xtensa ESP32: entry sp, 32 extui a2, a2, 0, 16 @ << zero extend s16ia2, a3, 0 s16ia2, a4, 0 retw.n Loongarch64: bstrpick.w $r4,$r4,15,0 @ << zero extend st.h$r4,$r5,0 st.h$r4,$r6,0 jr $r1 MIPS: andi$4,$4,0x@ << zero extend sh $4,0($5) jr $31 sh $4,0($6) SH: extu.w r4,r4 @ << zero extend mov.w r4,@r5 rts mov.w r4,@r6 Other available at godbolt (x86-64/Power/Power64/s390) unaffected.
[Bug middle-end/111626] New: missed optimization combining offset of array member in struct with offset inside the array
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111626 Bug ID: 111626 Summary: missed optimization combining offset of array member in struct with offset inside the array Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: lis8215 at gmail dot com Target Milestone: --- The simple code: struct some_struct { uint32_t some_member; uint32_t arr[4][16]; }; uint32_t fn(const struct some_struct *arr, int idx) { return arr->arr[1][idx]; } is used to showcase a suboptimal optimization on some platforms including RISC-V and MIPS (32 & 64 bit) even with `some_member` commented out. while GCC emits: addia1,a1,16 sllia1,a1,2 add a0,a0,a1 lw a0,4(a0) ret Clang does better job: sllia1, a1, 2 add a0, a0, a1 lw a0, 68(a0) ret
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #3 from Siarhei Volkau --- I know that the patch breaks condmove cases, that's why it is silly.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #6 from Siarhei Volkau --- Well, it is work mostly well. However, it still has issues, addressed in my patch: 1) Doesn't work for -Os : highly likely costing issue. 2) Breaks condmoves, as mine does. I have no idea how to avoid that though. 3) Overlaps preferable ANDI+BEQ/BNE cases: (as it don't break condmoves) I think it will be okay whether fixed 1 and 3. PS: tested by applying the patch on GCC 11, will try with upstream this weekend.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #8 from Siarhei Volkau --- Created attachment 58377 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58377&action=edit condmove testcase Tested with current GCC master branch: - Work with -Os confirmed. - Condmove issue present in GCC 11 but not current master. Even for GCC 11 it is very rare case, although found one relatively simple to reproduce: it is excerpt from Python 3.8.x, reduced as much as I can. Compilation flags tested: {-O2|-Os} -mips32 -DNDEBUG -mbranch-cost={1|10} So, my opinion, the patch you propose is perfectly fine. Condmove issue seems not relevant anymore.
[Bug middle-end/111835] New: Suboptimal codegen: zero extended load instead of sign extended one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111835 Bug ID: 111835 Summary: Suboptimal codegen: zero extended load instead of sign extended one Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: lis8215 at gmail dot com Target Milestone: --- In this simplified example: int test (const uint8_t * src, uint8_t * dst) { int8_t tmp = (int8_t)*src; *dst = tmp; return tmp; } GCC prefers to use load with zero extension instead of more rational sign extended load. Then it needs to do explicit sign extension for making return value. I know there's a lot of bugs related to zero/sign ext, but I guessed it's rare special case, and it reproduces in any GCC version available at godbolt and any architecture except x86-64.
[Bug rtl-optimization/104387] aarch64: Redundant SXTH for “bag of bits” moves
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104387 --- Comment #4 from Siarhei Volkau --- *** Bug 111384 has been marked as a duplicate of this bug. ***
[Bug rtl-optimization/111384] missed optimization: GCC adds extra any extend when storing subreg#0 multiple times
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111384 Siarhei Volkau changed: What|Removed |Added Resolution|--- |DUPLICATE Status|NEW |RESOLVED --- Comment #5 from Siarhei Volkau --- Dup of bug 104387. *** This bug has been marked as a duplicate of bug 104387 ***
[Bug rtl-optimization/111835] Suboptimal codegen: zero extended load instead of sign extended one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111835 --- Comment #3 from Siarhei Volkau --- I don't think that it is duplicate of the bug 104387 because there's only one store. And this bug is simply disappears if we change the source code a bit. e.g. - change (int8_t)*src; to *(int8_t*)src; or change argument uint8_t * dst to int8_t * dst But if we have multiple stores, extension will remain in any condition.
[Bug middle-end/112398] New: Suboptimal code generation for xor pattern on subword data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112398 Bug ID: 112398 Summary: Suboptimal code generation for xor pattern on subword data Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: lis8215 at gmail dot com Target Milestone: --- These minimal examples showcase the issue: uint8_t neg8 (const uint8_t *src) { return ~*src; // or return *src ^ 0xff; } uint16_t neg16 (const uint16_t *src) { return ~*src; // or return *src ^ 0x; } GCC transforms xor here to not + zero_extend, which isn't the best choice. I guess combiner have to try xor pattern instead of not + zero_extend as it might be cheaper.
[Bug target/112398] Suboptimal code generation for xor pattern on subword data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112398 --- Comment #3 from Siarhei Volkau --- Well, let's rewrite it in that way: void neg8 (uint8_t *restrict dst, const uint8_t *restrict src) { uint8_t work = ~*src; // or *src ^ 0xff; dst[0] = (work >> 4) | (work << 4); } Wherever upper bits have to be in zero state it is cheaper to use xor, otherwise we're relying on techniques for eliminating redundant zero_extend and at least on MIPS (prior to R2) and RISC-V GCC emits the zero_extend instruction. MIPS, neg8: neg8: lbu $2,0($5) nop nor $2,$0,$2 andi$3,$2,0x00ff srl $3,$3,4 sll $2,$2,4 or $2,$2,$3 jr $31 sb $2,0($4) RISC-V, neg8: lbu a5,0(a1) not a5,a5 andia4,a5,0xff srlia4,a4,4 sllia5,a5,4 or a4,a4,a5 sb a4,0(a0) ret Some other RISCs also emit zero_extend but I'm not sure about having cheaper xor alternative on them (S390, SH, Xtensa).
[Bug rtl-optimization/112474] New: MIPS: missed optimization for assigning HI reg to zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112474 Bug ID: 112474 Summary: MIPS: missed optimization for assigning HI reg to zero Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: lis8215 at gmail dot com Target Milestone: --- Created attachment 56550 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56550&action=edit the patch At the moment GCC emits move $0 to a GPR register and then move to HI (mthi) from that register as part of DI/TI MADD operation. It is feasible to avoid such copying when intermediate register isn't used anymore and reword RTL to emit only `mthi $0`. So the following output: ... move $3, $0 mthi $3 ; reg dead $3 ... Can be simply reworded as: ... move $3, $0 ; << will be removed by DCE later mthi $0 ... Silly patch, which doing such optimization, provided. Thanks.
[Bug rtl-optimization/112474] MIPS: missed optimization for assigning HI reg to zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112474 --- Comment #1 from Siarhei Volkau --- Minimal example for showcase the issue: #include uint64_t mthi_example(uint32_t a, uint32_t b, uint32_t c, uint32_t d) { uint64_t ret; ret = (uint64_t)a * b + (uint64_t)c * d + 1u; return ret; } compile command: mipsel-*-gcc -O2 -mips32
[Bug rtl-optimization/60749] combine is overly cautious when operating on volatile memory references
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60749 Siarhei Volkau changed: What|Removed |Added CC||lis8215 at gmail dot com --- Comment #2 from Siarhei Volkau --- Created attachment 55167 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55167&action=edit allow combine ld/st of volatile mem with any_extend op Is anyone bothering on that? I'm, as embedded engineer, sadly looking on that long standing issue. I can propose a quick patch which enables combining volatile mem ld/st with any_extend for most cases. And it seems, like platform specific test results remain the same with it (arm/aarch64/mips were tested). Post it in hope it can help for anyone who needs it.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #12 from Siarhei Volkau --- Highly likely it's because of data dependency, and not direct cost of shift operations on LoongArch, although can't find information to prove that. So, I guess it still might get performance benefit in cases where scheduler can put some instruction(s) between SLL and BGEZ. Since you have access to hardware you can measure performace of two variants: 1) SLL+BGEZ 2) SLL+NOT+BGEZ if their performance is equal then I'm correct and scheduling automaton for GS464 seems have to be fixed. >From my side I can confirm that SLL+BGEZ is faster than LUI+AND+BEQ on Ingenic XBurst 1 cores.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #15 from Siarhei Volkau --- Created attachment 58437 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58437&action=edit application to test performance of shift Here is the test application (MIPS32 specific) I wrote. It allows to detect execution cycles and extra pipeline stalls for SLL if they take place. for XBurst 1 (jz4725b) result is the following: `SLL to use latency test` execution median: 168417 ns, min: 168416 ns `SLL to use latency test with nop` execution median: 196250 ns, min: 196166 ns `SLL to branch latency test` execution median: 196250 ns, min: 196166 ns `SLL to branch latency test with nop` execution median: 224000 ns, min: 224000 ns `SLL by 7 to use latency test` execution median: 168417 ns, min: 168416 ns `SLL by 15 to use latency test` execution median: 168417 ns, min: 168416 ns `SLL by 23 to use latency test` execution median: 168417 ns, min: 168416 ns `SLL by 31 to use latency test` execution median: 168417 ns, min: 168416 ns `LUI>AND>BEQZ reference test` execution median: 196250 ns, min: 196166 ns `SLL>BGEZ reference test` execution median: 168417 ns, min: 168416 ns and what does it mean: `SLL to use latency test` 168417 ns and `.. with nop` 196250 ns means that there's no extra stall cycles between SLL and further use by ALU operation. `SLL to branch latency test` and `.. with nop` result means that there's no extra stall cycles between SLL and further use by branch operations. `SLL by N` results means that SLL execution time doesn't depend on shift amount. and finally, the reference test results showcases that SLL>BGEZ approach is faster than LUI>AND>BEQZ.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #16 from Siarhei Volkau --- Might it be that LoongArch have register reuse dependency? I observed similar behavior on XBurst with load/store/reuse pattern: e.g. this code LW $v0, 0($t1)# Xburst load latency is 4 but it has bypass SW $v0, 0($t2)# to subsequent store operation, thus no stall here ADD $v0, $t1, $t2 # but it stalls here, because of register reuse # until LW op is not completed.
[Bug rtl-optimization/115505] New: missing optimization: thumb1 use ldmia/stmia for load store DI/DF data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115505 Bug ID: 115505 Summary: missing optimization: thumb1 use ldmia/stmia for load store DI/DF data when possible Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: lis8215 at gmail dot com Target Milestone: --- Created attachment 58438 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58438&action=edit possible solution patch At the moment GCC emits two ldr/str instructions for DI/DF modes load/store. However there's a trick to use ldmia/stmia when address register in not used anymore/dead or reused. I don't know if it affects arm and/or thumb2 as well. Patch for possible solution for thumb1 provided. Comparing code size with the patch gives for v6-m/nofp: libgcc: -52 bytes / -0.10% Newlib's libc: -68 bytes / -0.03% libm: -96 bytes / -0.10% libstdc++: -140 bytes / -0.02%
[Bug target/115921] New: Missed optimization: and->ashift might be cheaper than ashift->and on typical RISC targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115921 Bug ID: 115921 Summary: Missed optimization: and->ashift might be cheaper than ashift->and on typical RISC targets Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: lis8215 at gmail dot com Target Milestone: --- At the moment GCC prefers 'ashift first' flavor pattern. However, it might ends up in emitting expensive constants for subsequent AND operation. It might be cheaper to do AND operation first, since there's a chance to match variant of AND operation which accepts immediate. Example: target_wide_uint_t test_ashift_and (target_wide_uint_t x) { return (x & 0x3f) << 12; } godbolt results are the following: [Xtensa ESP32-S3 gcc 12.2.0 (-O3)] test_ashift_and: entry sp, 32 l32ra8, .LC0 sllia2, a2, 12 and a2, a2, a8 retw.n ; missed constant in output [SPARC gcc 14.1.0 (-O3)] test_ashift_and: sethi %hi(258048), %g1 sll %o0, 12, %o0 jmp %o7+8 and%o0, %g1, %o0 [sh gcc 14.1.0 (-O3)] _test_ashift_and: mov r4,r0 shll2 r0 extu.b r0,r0 shll8 r0 rts shll2 r0 [s390x gcc 14.1.0 (-O3)] test_ashift_and: larl%r5,.L4 sllg%r2,%r2,12 ng %r2,.L5-.L4(%r5) br %r14 .L4: .L5: .quad 258048 [RISC-V (64-bit) gcc 14.1.0 (-O3)] test_ashift_and: li a5,258048 sllia0,a0,12 and a0,a0,a5 ret [mips (el) gcc 14.1.0 (-O3)] test_ashift_and: li $2,196608 # 0x3 sll $4,$4,12 ori $2,$2,0xf000 jr $31 and $2,$4,$2 [mips64 (el) gcc 14.1.0 (-O3)] test_ashift_and: li $2,196608 # 0x3 dsll$4,$4,12 ori $2,$2,0xf000 jr $31 and $2,$4,$2 [loongarch64 gcc 14.1.0 (-O3)] test_ashift_and: lu12i.w $r12,258048>>12 # 0x3f000 slli.d $r4,$r4,12 and $r4,$r4,$r12 jr $r1 however, shifting to 33 got: [mips64 (el) gcc 14.1.0 (-O3, ashift to 33)] test_ashift_and: andi$2,$4,0x3f jr $31 dsll$2,$2,33 [SPARC64 gcc 14.1.0 (-O3, ashift to 33)]: test_ashift_and: and %o0, 63, %o0 jmp %o7+8 sllx %o0, 33, %o0 It seems like RISC-V (32-bit) is aware of that in trunk (14.1.0 won't): [RISC-V (32-bit) gcc (trunk) (-O3)] test_ashift_and: andia0,a0,63 sllia0,a0,12 ret while RV64 is not so good. While this situation appears rarely in general, it appears 85 times in pcre2 matching routine, which is ~2% of the overall routine's code size (on mips32). Also, it might be profitable to match any bitwise operator here: e.g. OR,XOR in addition to AND.
[Bug target/115922] New: Missed optimization: MIPS: clear bit 15
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115922 Bug ID: 115922 Summary: Missed optimization: MIPS: clear bit 15 Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: lis8215 at gmail dot com Target Milestone: --- Created attachment 58659 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58659&action=edit silly patch designed for mips32 mips32r2 simple testcase: # define MASK (~(0x8000U)) host_width_type_t test(host_width_type_t x) { return x & MASK; } Now for clearing bit 15 GCC emits something like: test: li $2,-65536# 0x addiu $2,$2,32767 # 0x7fff and $2,$4,$2 while it's cheaper to use: ori $2,$2,32768 #0x8000 xori$2,$2,32768 any mask in range ~0x8000 .. ~0xfffe seems profitable, even for MIPS32r2+ where INS instruction can be used to clear group of bits. Such pattern appears rarely and mostly in low level software e.g. linux kernel. for linux kernel it shows ~40 matches per million of insns. Might also be profitable for RISC-V, not tested.
[Bug target/115921] Missed optimization: and->ashift might be cheaper than ashift->and on typical RISC targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115921 --- Comment #1 from Siarhei Volkau --- Also take in account examples like this: uint32_t high_const_and_compare(uint32_t x) { if ( (x & 0x7000) == 0x3000) return do_some(); return do_other(); } It might be profitable to use right shift first there to lower constants. Now, even if you do manual optimization, GCC throws it away.
[Bug target/70557] uint64_t zeroing on 32-bit hardware
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70557 Siarhei Volkau changed: What|Removed |Added CC||lis8215 at gmail dot com --- Comment #9 from Siarhei Volkau --- Created attachment 59964 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59964&action=edit RV patch Observing same issue in GCC 14.2 but for RV32. for example: void clear64(uint64_t *ull) { *ull = 0; } RV32 emits: li a5,0 li a6,0 sw a5,0(a0) sw a6,4(a0) ret Hopefully, the fix is trivial (one symbol).
[Bug target/70557] uint64_t zeroing on 32-bit hardware
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70557 --- Comment #10 from Siarhei Volkau --- Created attachment 59965 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59965&action=edit MIPS patch Ditto for 32-bit MIPS. MIPS emits: move$3,$0 move$2,$0 sw $3,4($4) jr $31 sw $2,0($4)
[Bug target/70557] uint64_t zeroing on 32-bit hardware
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70557 --- Comment #12 from Siarhei Volkau --- Hi Jeffrey, Thanks for your interest in those patches. But unfortunately I'm not sure that I can and will pass all required steps to make these patches ready for review. I have no experience with the RV32 ecosystem yet, thus unable to perform regression testing properly. Even for MIPS I'm unaware of every existing combination of flags to test, just MIPS32r2 is where I'm experimenting with. BR, Siarhei
[Bug target/70557] uint64_t zeroing on 32-bit hardware
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70557 --- Comment #13 from Siarhei Volkau --- Moreover, I think that the patches deal for limited possible cases. E.g. if upper or lower part of DI memory shall be set to zero the patch won't help. It seems feasible to make a special code path for zero reg promotion during regcprop pass.