[Bug target/80479] [7/8 Regression] strcmp() produces valgrind errors on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80479 jreiser at bitwagon dot com changed: What|Removed |Added CC||jreiser at bitwagon dot com --- Comment #12 from jreiser at bitwagon dot com --- Working well with valgrind(memcheck) might be worth more than a slight increase in speed. How much faster [measured] is the inline version in contrast to calling strcmp() closed subroutine, and over what distributions of inputs? I see the inline version use 10 registers (3,4,7,8,9,10,26,28,30,31) and at least 332 bytes of instructions, assuming at least one instruction at .L10 [not shown] and 7 repetitions of the block at .L22 (for bytes 9 through 64 in 8-byte chunks.) At first glance that seems to be expensive. Could much the same speed be obtained by re-coding the strcmp() closed subroutine to use the technique of the inlining? Then valgrind(memcheck) could intercept and re-direct the whole subroutine easily [by name], avoiding tedious analysis. "addi 31,31,1" at .L11+8 is dead. The opcode 'xor.' might use less energy (no carry chain) than 'subf.'
[Bug target/80479] [7/8 Regression] strcmp() produces valgrind errors on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80479 --- Comment #14 from jreiser at bitwagon dot com --- Here's how to retain the increased speed (and save around 300 bytes per call) while enabling valgrind happiness. Make a closed subroutine __gcc_strcmp_ppc64le whose calling sequence is: la r3,arg1 // address of first string la r4,arg2 // address of second string ldbrx r5,0,r3 // first 8 bytes of arg1, big endian ldbrx r6,0,r4 // first 8 bytes of arg2, big endian bl __gcc_strcmp_ppc64le Put this subroutine in archive library libgcc_s.a only, and not in shared library libgcc_s.so. Then the linkage for 'bl' is direct, avoiding PLT (ProgramLinkageTable), and the time for 'bl' is hidden by cache latency for 'ldbrx'. The return 'blr' often is free, but may cost 1 cycle if it immediately follows a conditional branch that tests for termination. Valgrind(memcheck) can be happy because it can intercept and re-direct the entire routine by name, thus avoiding having to analyze 'cmpb'. The simplest implementation of __gcc_strcmp_ppc64le is just "b strcmp", because arg1 and arg2 have not been incremented before the call. Otherwise the two "addi r,r,8" probably can fit into unused superscalar ALU slots early in the subroutine, or the code can just remember that the addresses always are 8 behind. There can be multiple named entry points, each specialized differently, such as for known alignment of operands, etc. Notes: ldbrx and lwbrx are functional for non-aligned addresses. The UPX (de-)compressor for executables uses those opcodes, and they work correctly on 64-bit PPC970FX (PowerMac8,2) and 32-bit 7447A (PowerMac10,1), both running Debian 8 (jessie). The hardware documentation warns that the implementation may be significantly slower than a regular load. Trapping to operating system emulation (always, or only for unaligned address, or ...) is an option. From the viewpoint of chip design, there can be a great temptation to implement only the 32-bit lwbrx, but always trap for 64-bit ldbrx. The 32-bit lwbrx has notable use cases for the network functions htonl and ntohl, while the 64-bit ldbrx has lacked such high-profile clients.
[Bug libgcc/66874] New: RFE: x86_64_fallback_frame_state more robust
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66874 Bug ID: 66874 Summary: RFE: x86_64_fallback_frame_state more robust Product: gcc Version: 5.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgcc Assignee: unassigned at gcc dot gnu.org Reporter: jreiser at bitwagon dot com Target Milestone: --- In libgcc/config/i386/linux-unwind.h function x86_64_fallback_frame_state() please check the value of pc before accessing memory in the statement: - unsigned char *pc = context->ra; // snip if (*(unsigned char *)(pc+0) == 0x48 && *(unsigned long long *)(pc+1) == RT_SIGRETURN_SYSCALL) - I have seen pc values of 0, 2, 0x, etc due to missing or incorrect debug info, particularly when the code that is being unwound was compiled with no frame pointer, or was compiled by other compilers. The result is SIGSEGV, which is a major disappointment. I suggest a check in the spirit of: if ((unsigned long)pc < 4096) return _URC_END_OF_STACK; or similar. Obviously this is heuristic, but it is much better than SIGSEGV.