[Bug target/80479] [7/8 Regression] strcmp() produces valgrind errors on ppc64le

2017-04-25 Thread jreiser at bitwagon dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80479

jreiser at bitwagon dot com changed:

   What|Removed |Added

 CC||jreiser at bitwagon dot com

--- Comment #12 from jreiser at bitwagon dot com ---
Working well with valgrind(memcheck) might be worth more than a slight increase
in speed.  How much faster [measured] is the inline version in contrast to
calling strcmp() closed subroutine, and over what distributions of inputs?  I
see the inline version use 10 registers (3,4,7,8,9,10,26,28,30,31) and at least
332 bytes of instructions, assuming at least one instruction at .L10 [not
shown] and 7 repetitions of the block at .L22 (for bytes 9 through 64 in 8-byte
chunks.)  At first glance that seems to be expensive.  Could much the same
speed be obtained by re-coding the strcmp() closed subroutine to use the
technique of the inlining?  Then valgrind(memcheck) could intercept and
re-direct the whole subroutine easily [by name], avoiding tedious analysis.

"addi 31,31,1" at .L11+8 is dead.

The opcode 'xor.' might use less energy (no carry chain) than 'subf.'

[Bug target/80479] [7/8 Regression] strcmp() produces valgrind errors on ppc64le

2017-04-26 Thread jreiser at bitwagon dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80479

--- Comment #14 from jreiser at bitwagon dot com ---
Here's how to retain the increased speed (and save around 300 bytes per call)
while enabling valgrind happiness. 

Make a closed subroutine __gcc_strcmp_ppc64le whose calling sequence is:
la r3,arg1  // address of first string
la r4,arg2  // address of second string
ldbrx r5,0,r3  // first 8 bytes of arg1, big endian
ldbrx r6,0,r4  // first 8 bytes of arg2, big endian
bl __gcc_strcmp_ppc64le
Put this subroutine in archive library libgcc_s.a only, and not in shared
library libgcc_s.so.  Then the linkage for 'bl' is direct, avoiding PLT
(ProgramLinkageTable), and the time for 'bl' is hidden by cache latency for
'ldbrx'.  The return 'blr' often is free, but may cost 1 cycle if it
immediately follows a conditional branch that tests for termination. 
Valgrind(memcheck) can be happy because it can intercept and re-direct the
entire routine by name, thus avoiding having to analyze 'cmpb'.

The simplest implementation of __gcc_strcmp_ppc64le is just "b strcmp", because
arg1 and arg2 have not been incremented before the call.  Otherwise the two
"addi r,r,8" probably can fit into unused superscalar ALU slots early in the
subroutine, or the code can just remember that the addresses always are 8
behind.

There can be multiple named entry points, each specialized differently, such as
for known alignment of operands, etc.

Notes: ldbrx and lwbrx are functional for non-aligned addresses.  The UPX
(de-)compressor for executables uses those opcodes, and they work correctly on
64-bit PPC970FX (PowerMac8,2) and 32-bit 7447A (PowerMac10,1), both running
Debian 8 (jessie).  The hardware documentation warns that the implementation
may be significantly slower than a regular load.  Trapping to operating system
emulation (always, or only for unaligned address, or ...) is an option.  From
the viewpoint of chip design, there can be a great temptation to implement only
the 32-bit lwbrx, but always trap for 64-bit ldbrx.  The 32-bit lwbrx has
notable use cases for the network functions htonl and ntohl, while the 64-bit
ldbrx has lacked such high-profile clients.

[Bug libgcc/66874] New: RFE: x86_64_fallback_frame_state more robust

2015-07-14 Thread jreiser at bitwagon dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66874

Bug ID: 66874
   Summary: RFE: x86_64_fallback_frame_state more robust
   Product: gcc
   Version: 5.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgcc
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jreiser at bitwagon dot com
  Target Milestone: ---

In libgcc/config/i386/linux-unwind.h function x86_64_fallback_frame_state()
please check the value of pc before accessing memory in the statement:
-
  unsigned char *pc = context->ra;
  // snip
  if (*(unsigned char *)(pc+0) == 0x48
  && *(unsigned long long *)(pc+1) == RT_SIGRETURN_SYSCALL)
-
I have seen pc values of 0, 2, 0x, etc due to missing or incorrect
debug info, particularly when the code that is being unwound was compiled with
no frame pointer, or was compiled by other compilers.  The result is SIGSEGV,
which is a major disappointment.

I suggest a check in the spirit of:
if ((unsigned long)pc < 4096)
 return _URC_END_OF_STACK;
or similar.  Obviously this is heuristic, but it is much better than SIGSEGV.