Re: Writing a dot product that vectorizes without -fassociative-math -fno-signed-zeros -fno-trapping-math

2015-06-08 Thread Richard Biener
On Thu, Jun 4, 2015 at 1:17 PM, Thomas Koenig  wrote:
> Hello world,
>
> Assume I want to calculate a dot product,
>
> s = sum(a[i]*b[i], i=1..n)
>
> The order of summation in this case should be arbitrary.
>
> Currently, the way to do this is to write out an explicit loop
> (either by by the user or by the compliler, such as a DOT_PRODUCT)
> and specify the options (for the whole translation unit) that
> allow associative math.
>
> Could there be a way to specify more finegrained approch which can
> set a 'yes, you can use associative math on this particular expression'
> to enable automatic vectorization of, for exaple, DOT_PRODUCT?

There isn't currently a way to do this (apart from a hack to outline
the loop to a function and stick a -fassociative-math option attribute
on it...).  A full middle-end solution would be to either have alternate
tree codes for associatable ops or flags on the expression tree
(see the undefined-overflow branch work / discussion on both alternatives -
they both have downsides).  Of course there are very many more
options that would need similar handling...

An alternative would be to make the option changing more localized
by say, a new wrapping GENERIC tree (like we have PAREN_EXPR
for the reverse effect).  So

  REGION_EXPR (reassoc,  GENERIC code ...)

and then lower the 'reassoc' (or whatever bits we invent later) during
gimplification to flags on the GIMPLE stmt (avoiding the flags on the
GENERIC expression trees) or perform the outlining to a function
there.

Another alternative is to allow this kind of flag-changing only on
loops and use ANNOTATE_EXPR to say (this loop has ops
that are associatable across iterations) and thus only have an
effect on loop optimizations (in this case vectorization).

All of this is quite some work (see the unfinished no-undefined-overflow
work).

Richard.

>
> Regards
>
> Thomas


Re: [patch] fix _OBJC_Module defined but not used warning

2015-06-08 Thread Iain Sandoe
Hi Aldy,

On 7 Jun 2015, at 12:37, Aldy Hernandez wrote:

> On 06/07/2015 06:19 AM, Andreas Schwab wrote:
>> Another fallout:
>> 
>> FAIL: obj-c++.dg/try-catch-5.mm -fgnu-runtime (test for excess errors)
>> Excess errors:
>> : warning: '_OBJC_Module' defined but not used [-Wunused-variable]
> 
> check_global_declarations is called for more symbols now.  All the defined 
> but not used errors I've seen in development have been legitimate.  For 
> tests, the tests should be fixed.  For built-ins such as these, does the 
> attached fix the problem?
> 
> It is up to the objc maintainers, we can either fix this with the attached 
> patch,

The current patch is OK.

> or setting DECL_IN_SYSTEM_HEADER.

This seems a better long-term idea; however, I would prefer to go through all 
the cases where it would be applicable (including for the NeXT runtime) and 
apply that change as a coherent patch.  At the moment dealing with the NeXT 
stuff is a bit hampered by pr66448.

thanks,
Iain



Proposal for merging scalar-storage-order branch into mainline

2015-06-08 Thread Eric Botcazou
Hi,

I'd like to propose merging the scalar-storage-order branch that I have been 
maintaining for a couple of years into mainline.  Original announcement at:
  https://gcc.gnu.org/ml/gcc/2013-05/msg00249.html

It implements an attribute (C/C++/Ada only) that makes it possible to specify 
the storage order (aka endianness) of scalar components of aggregate types;
for example, you can declare a structure with big-endian SSO containing only 
scalar fields and it will have the same representation in memory on x86 and on 
PowerPC or SPARC.  Nesting of structures with different SSO is also supported.

The feature has been present in the GCC-based compilers released by AdaCore 
for a few more years and the users generally find it very useful (some of them 
even asked why we hadn't implemented it earlier).

As the initial plan was to maintain it in AdaCore's tree until it reached a 
sufficient level of maturity, the implementation was designed to be relatively 
light and maintainable, with the following basic principle: specifying the 
same SSO as that of the target machine is equivalent to specifying no SSO.
This principle holds for the entire implementation, which means that only the 
reverse SSO is tracked, which in turn means that the target machine must be 
uniform wrt endianness (e.g. PDP endianness is not supported).

Only GENERIC is extended (one flag on aggregate types and one flag on some 
_REF nodes) by using the following guidelines:

   The overall strategy is to preserve the invariant that every scalar in
   memory is associated with a single storage order, i.e. all accesses to
   this scalar are done with the same storage order.  This invariant makes
   it possible to factor out the storage order in most transformations, as
   only the address and/or the value (in target order) matter for them.
   But, of course, the storage order must be preserved when the accesses
   themselves are rewritten or transformed.

GIMPLE proper and RTL are not changed.  The byte swapping operations are made 
explicit during RTL expansion and use the bswap patterns of the target machine 
if present.

The bulk of the implementation is in the FEs (sanity checks, propagation, etc) 
and the RTL expander (+ varasm.c for aggregate literals).  RTL optimizers are 
not changed.  GIMPLE optimizers are minimally changed: they can either punt if 
they see _REF nodes with reverse SSO or choose to locally maintain the SSO.

Again this was designed with maintainability and simplicity in mind, so no  
attempt was made at generating optimal code.  At the same time the support is 
transparent for most GIMPLE optimizers so there is no definitive blocker 
towards this goal if it is deemed worth pursuing for this kind of feature.

-- 
Eric Botcazou


ARM's Changing Call Used Registers Causes Weird Bugs

2015-06-08 Thread lin zuojian
Hi,
in arm.c 
static void
arm_conditional_register_usage (void)
...
  if (TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_VFP)
{
  /* VFPv3 registers are disabled when earlier VFP
 versions are selected due to the definition of
 LAST_VFP_REGNUM.  */
  for (regno = FIRST_VFP_REGNUM;
   regno <= LAST_VFP_REGNUM; ++ regno)
{
  fixed_regs[regno] = 0;
  call_used_regs[regno] = regno < FIRST_VFP_REGNUM + 16
|| regno >= FIRST_VFP_REGNUM + 32;
}
}

these lines will change the called used registers, when using
compiler flags like: -mfpu=neon.
That causes weird bugs. Consider the situation in Android ARM
architecture: I have a shared object supposed to run in a neon cpu,
and -mfpu=neon added. But the system is not compiled using this
flag. So when calling the system's library,  my code will risk using
the clobbered d8-d16.
The example will be:
while (true) {
struct my_struct s = {0}; // my_struct is 8 bytes long.
call_system_library...
}
in this example. d8 is used to initialize s to zero. The assembly
code like:
push {d8} // because d8 is not call used.
// the loop header
vmov.i32 d8, #0
// the loop body
vstr d8, &s
bl system_library
b loop_body

d8 is clobbered after branch link to system library, so the second
loop will initialize s to random value, which causes crash.

So I am forced to remove the -mfpu=neon for compatibility. My
question is whether the gcc code show above confront to ARM
standard. If so, why ARM make such a weird standard.
--
Lin Zuojian


Re: Proposal for merging scalar-storage-order branch into mainline

2015-06-08 Thread Andrew Pinski
On Mon, Jun 8, 2015 at 4:05 PM, Eric Botcazou  wrote:
> Hi,
>
> I'd like to propose merging the scalar-storage-order branch that I have been
> maintaining for a couple of years into mainline.  Original announcement at:
>   https://gcc.gnu.org/ml/gcc/2013-05/msg00249.html
>
> It implements an attribute (C/C++/Ada only) that makes it possible to specify
> the storage order (aka endianness) of scalar components of aggregate types;
> for example, you can declare a structure with big-endian SSO containing only
> scalar fields and it will have the same representation in memory on x86 and on
> PowerPC or SPARC.  Nesting of structures with different SSO is also supported.
>
> The feature has been present in the GCC-based compilers released by AdaCore
> for a few more years and the users generally find it very useful (some of them
> even asked why we hadn't implemented it earlier).
>
> As the initial plan was to maintain it in AdaCore's tree until it reached a
> sufficient level of maturity, the implementation was designed to be relatively
> light and maintainable, with the following basic principle: specifying the
> same SSO as that of the target machine is equivalent to specifying no SSO.
> This principle holds for the entire implementation, which means that only the
> reverse SSO is tracked, which in turn means that the target machine must be
> uniform wrt endianness (e.g. PDP endianness is not supported).
>
> Only GENERIC is extended (one flag on aggregate types and one flag on some
> _REF nodes) by using the following guidelines:
>
>The overall strategy is to preserve the invariant that every scalar in
>memory is associated with a single storage order, i.e. all accesses to
>this scalar are done with the same storage order.  This invariant makes
>it possible to factor out the storage order in most transformations, as
>only the address and/or the value (in target order) matter for them.
>But, of course, the storage order must be preserved when the accesses
>themselves are rewritten or transformed.
>
> GIMPLE proper and RTL are not changed.  The byte swapping operations are made
> explicit during RTL expansion and use the bswap patterns of the target machine
> if present.

The only problem I see with this implementation is that the RTL level
optimizers are not always up to removing the byteswaps.
GCSE is very weak on the RTL level compared to PRE on the gimple level.

Thanks,
Andrew Pinski

>
> The bulk of the implementation is in the FEs (sanity checks, propagation, etc)
> and the RTL expander (+ varasm.c for aggregate literals).  RTL optimizers are
> not changed.  GIMPLE optimizers are minimally changed: they can either punt if
> they see _REF nodes with reverse SSO or choose to locally maintain the SSO.
>
> Again this was designed with maintainability and simplicity in mind, so no
> attempt was made at generating optimal code.  At the same time the support is
> transparent for most GIMPLE optimizers so there is no definitive blocker
> towards this goal if it is deemed worth pursuing for this kind of feature.
>
> --
> Eric Botcazou


Re: Proposal for merging scalar-storage-order branch into mainline

2015-06-08 Thread Andrew Pinski
On Mon, Jun 8, 2015 at 4:19 PM, Andrew Pinski  wrote:
> On Mon, Jun 8, 2015 at 4:05 PM, Eric Botcazou  wrote:
>> Hi,
>>
>> I'd like to propose merging the scalar-storage-order branch that I have been
>> maintaining for a couple of years into mainline.  Original announcement at:
>>   https://gcc.gnu.org/ml/gcc/2013-05/msg00249.html
>>
>> It implements an attribute (C/C++/Ada only) that makes it possible to specify
>> the storage order (aka endianness) of scalar components of aggregate types;
>> for example, you can declare a structure with big-endian SSO containing only
>> scalar fields and it will have the same representation in memory on x86 and 
>> on
>> PowerPC or SPARC.  Nesting of structures with different SSO is also 
>> supported.
>>
>> The feature has been present in the GCC-based compilers released by AdaCore
>> for a few more years and the users generally find it very useful (some of 
>> them
>> even asked why we hadn't implemented it earlier).
>>
>> As the initial plan was to maintain it in AdaCore's tree until it reached a
>> sufficient level of maturity, the implementation was designed to be 
>> relatively
>> light and maintainable, with the following basic principle: specifying the
>> same SSO as that of the target machine is equivalent to specifying no SSO.
>> This principle holds for the entire implementation, which means that only the
>> reverse SSO is tracked, which in turn means that the target machine must be
>> uniform wrt endianness (e.g. PDP endianness is not supported).
>>
>> Only GENERIC is extended (one flag on aggregate types and one flag on some
>> _REF nodes) by using the following guidelines:
>>
>>The overall strategy is to preserve the invariant that every scalar in
>>memory is associated with a single storage order, i.e. all accesses to
>>this scalar are done with the same storage order.  This invariant makes
>>it possible to factor out the storage order in most transformations, as
>>only the address and/or the value (in target order) matter for them.
>>But, of course, the storage order must be preserved when the accesses
>>themselves are rewritten or transformed.
>>
>> GIMPLE proper and RTL are not changed.  The byte swapping operations are made
>> explicit during RTL expansion and use the bswap patterns of the target 
>> machine
>> if present.
>
> The only problem I see with this implementation is that the RTL level
> optimizers are not always up to removing the byteswaps.
> GCSE is very weak on the RTL level compared to PRE on the gimple level.


Oh and I see a case where we want to remove byteswaps at IPA level. If
we can see the variable value does not escape.

Thanks,
Andrew

>
> Thanks,
> Andrew Pinski
>
>>
>> The bulk of the implementation is in the FEs (sanity checks, propagation, 
>> etc)
>> and the RTL expander (+ varasm.c for aggregate literals).  RTL optimizers are
>> not changed.  GIMPLE optimizers are minimally changed: they can either punt 
>> if
>> they see _REF nodes with reverse SSO or choose to locally maintain the SSO.
>>
>> Again this was designed with maintainability and simplicity in mind, so no
>> attempt was made at generating optimal code.  At the same time the support is
>> transparent for most GIMPLE optimizers so there is no definitive blocker
>> towards this goal if it is deemed worth pursuing for this kind of feature.
>>
>> --
>> Eric Botcazou


Re: ARM's Changing Call Used Registers Causes Weird Bugs

2015-06-08 Thread Ramana Radhakrishnan
On Mon, Jun 8, 2015 at 9:07 AM, lin zuojian  wrote:
> Hi,
> in arm.c
> static void
> arm_conditional_register_usage (void)
> ...
>   if (TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_VFP)
> {
>   /* VFPv3 registers are disabled when earlier VFP
>  versions are selected due to the definition of
>  LAST_VFP_REGNUM.  */
>   for (regno = FIRST_VFP_REGNUM;
>regno <= LAST_VFP_REGNUM; ++ regno)
> {
>   fixed_regs[regno] = 0;
>   call_used_regs[regno] = regno < FIRST_VFP_REGNUM + 16
> || regno >= FIRST_VFP_REGNUM + 32;
> }
> }
>
> these lines will change the called used registers, when using
> compiler flags like: -mfpu=neon.
> That causes weird bugs. Consider the situation in Android ARM
> architecture: I have a shared object supposed to run in a neon cpu,
> and -mfpu=neon added. But the system is not compiled using this
> flag. So when calling the system's library,  my code will risk using
> the clobbered d8-d16

No, you are misunderstanding this - because of the packing nature of
the various s and d registers in the VFP register file, we need to use
the register numbering with respect to the "S" register file. Thus
FIRST_VFP_REGNUM + 16 is the correct boundary check as S0-S15 are call
clobbered (mapping to D0-D7, Q0-Q3) , (S16-S31) are thus marked callee
save or not call_used.

> The example will be:
> while (true) {
> struct my_struct s = {0}; // my_struct is 8 bytes long.
> call_system_library...
> }
> in this example. d8 is used to initialize s to zero. The assembly
> code like:
> push {d8} // because d8 is not call used.
> // the loop header
> vmov.i32 d8, #0
> // the loop body
> vstr d8, &s
> bl system_library
> b loop_body
>
> d8 is clobbered after branch link to system library, so the second
> loop will initialize s to random value, which causes crash.
>
> So I am forced to remove the -mfpu=neon for compatibility. My
> question is whether the gcc code show above confront to ARM
> standard. If so, why ARM make such a weird standard.

No, you do not need to remove the option for any compatibility. The
ABI has been carefully designed for precisely allowing this sort of
usage mixing code with -mfpu=neon and -mfpu=vfpv3-d16. The failure you
describe indicates that something else is broken in your system
library or that your system libraryfunction is not obeying the ABI and
clobbering D8.

If you have an actual reproducible issue please report it on bugzilla
following the rules for reporting bugs by producing a standalone
testcase. as documented here https://gcc.gnu.org/bugs/

Thanks,
Ramana


> --
> Lin Zuojian


Re: Proposal for merging scalar-storage-order branch into mainline

2015-06-08 Thread Richard Biener
On Mon, Jun 8, 2015 at 10:05 AM, Eric Botcazou  wrote:
> Hi,
>
> I'd like to propose merging the scalar-storage-order branch that I have been
> maintaining for a couple of years into mainline.  Original announcement at:
>   https://gcc.gnu.org/ml/gcc/2013-05/msg00249.html
>
> It implements an attribute (C/C++/Ada only) that makes it possible to specify
> the storage order (aka endianness) of scalar components of aggregate types;
> for example, you can declare a structure with big-endian SSO containing only
> scalar fields and it will have the same representation in memory on x86 and on
> PowerPC or SPARC.  Nesting of structures with different SSO is also supported.
>
> The feature has been present in the GCC-based compilers released by AdaCore
> for a few more years and the users generally find it very useful (some of them
> even asked why we hadn't implemented it earlier).
>
> As the initial plan was to maintain it in AdaCore's tree until it reached a
> sufficient level of maturity, the implementation was designed to be relatively
> light and maintainable, with the following basic principle: specifying the
> same SSO as that of the target machine is equivalent to specifying no SSO.
> This principle holds for the entire implementation, which means that only the
> reverse SSO is tracked, which in turn means that the target machine must be
> uniform wrt endianness (e.g. PDP endianness is not supported).
>
> Only GENERIC is extended (one flag on aggregate types and one flag on some
> _REF nodes) by using the following guidelines:
>
>The overall strategy is to preserve the invariant that every scalar in
>memory is associated with a single storage order, i.e. all accesses to
>this scalar are done with the same storage order.  This invariant makes
>it possible to factor out the storage order in most transformations, as
>only the address and/or the value (in target order) matter for them.
>But, of course, the storage order must be preserved when the accesses
>themselves are rewritten or transformed.
>
> GIMPLE proper and RTL are not changed.  The byte swapping operations are made
> explicit during RTL expansion and use the bswap patterns of the target machine
> if present.

What's the reason to not expose the byte swapping operations earlier, like on
GIMPLE?  (or even on GENERIC?)

> The bulk of the implementation is in the FEs (sanity checks, propagation, etc)

What frontends are affected?

Thanks,
Richard.

> and the RTL expander (+ varasm.c for aggregate literals).  RTL optimizers are
> not changed.  GIMPLE optimizers are minimally changed: they can either punt if
> they see _REF nodes with reverse SSO or choose to locally maintain the SSO.
>
> Again this was designed with maintainability and simplicity in mind, so no
> attempt was made at generating optimal code.  At the same time the support is
> transparent for most GIMPLE optimizers so there is no definitive blocker
> towards this goal if it is deemed worth pursuing for this kind of feature.
>
> --
> Eric Botcazou


Re: devbranches: ambigous characterisation of branches

2015-06-08 Thread Jonathan Wakely
On 7 June 2015 at 21:28, Wolfgang Hospital wrote:
> in the repository contents description at
> , numerous branch names are
> listed as inactive, with some further comments. Right at the start there is
> the longest list of such names, followed by "These branches have been merged
> into the mainline.". Without "preceding" or "following",
> or at least leading dash or a trailing colon, I'm at a loss whether that
> refers to the branches named before or after.

Before.

It's an HTML definition list, of the form:

term
term
term
definition
term
definition
term
definition

It seems fairly clear to me, from the indentation, and the fact that
every set of branch names is followed by an indented description.


Re: [patch] fix _OBJC_Module defined but not used warning

2015-06-08 Thread Aldy Hernandez

On 06/08/2015 04:03 AM, Iain Sandoe wrote:

Hi Aldy,

On 7 Jun 2015, at 12:37, Aldy Hernandez wrote:


On 06/07/2015 06:19 AM, Andreas Schwab wrote:

Another fallout:

FAIL: obj-c++.dg/try-catch-5.mm -fgnu-runtime (test for excess errors)
Excess errors:
: warning: '_OBJC_Module' defined but not used [-Wunused-variable]


check_global_declarations is called for more symbols now.  All the defined but 
not used errors I've seen in development have been legitimate.  For tests, the 
tests should be fixed.  For built-ins such as these, does the attached fix the 
problem?

It is up to the objc maintainers, we can either fix this with the attached 
patch,


The current patch is OK.


Committed.




or setting DECL_IN_SYSTEM_HEADER.


This seems a better long-term idea; however, I would prefer to go through all 
the cases where it would be applicable (including for the NeXT runtime) and 
apply that change as a coherent patch.  At the moment dealing with the NeXT 
stuff is a bit hampered by pr66448.


On my list next.

Aldy



Re: Proposal for merging scalar-storage-order branch into mainline

2015-06-08 Thread Mark Wielaard
On Mon, 2015-06-08 at 10:05 +0200, Eric Botcazou wrote:
> It implements an attribute (C/C++/Ada only) that makes it possible to specify 
> the storage order (aka endianness) of scalar components of aggregate types;
> for example, you can declare a structure with big-endian SSO containing only 
> scalar fields and it will have the same representation in memory on x86 and 
> on 
> PowerPC or SPARC.  Nesting of structures with different SSO is also supported.

How is this represented in DWARF?

I am sorry, I normally use the git mirror and this branch doesn't seem
to be there and I don't know how to get the svn branch. So I don't know
if this question is easily satisfied by the code already by just looking
at the dwarf2out.c changes. If so, my apologies and please just point me
at the patch or commit.

Thanks,

Mark


Re: Proposal for merging scalar-storage-order branch into mainline

2015-06-08 Thread Andreas Schwab
Mark Wielaard  writes:

> I am sorry, I normally use the git mirror and this branch doesn't seem
> to be there and I don't know how to get the svn branch.

http://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/remotes/scalar-storage-order

Andreas.

-- 
Andreas Schwab, SUSE Labs, sch...@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."


Re: Static Chain Register on iOS AArch64

2015-06-08 Thread Richard Henderson

On 06/06/2015 06:24 AM, Richard Earnshaw wrote:

That's going to make it impossible to implement Go closures on AArch32,
then, since the only call-clobbered register not used for parameter
passing is r12 (ip) and that can be clobbered by function calls.


No, because r12 is only clobbered by plt stubs, and go closures never use them. 
 They're *already working* on aarch32.



r~



Re: Static Chain Register on iOS AArch64

2015-06-08 Thread Richard Earnshaw
On 08/06/15 17:58, Richard Henderson wrote:
> On 06/06/2015 06:24 AM, Richard Earnshaw wrote:
>> That's going to make it impossible to implement Go closures on AArch32,
>> then, since the only call-clobbered register not used for parameter
>> passing is r12 (ip) and that can be clobbered by function calls.
> 
> No, because r12 is only clobbered by plt stubs, and go closures never
> use them.  They're *already working* on aarch32.
> 
> 
> r~
> 

r12 can *also* be clobbered by interworking calls or calls that span
more than the branch range of a call instruction.  Rare, but possible.

R.


Re: Static Chain Register on iOS AArch64

2015-06-08 Thread Richard Henderson

On 06/08/2015 10:00 AM, Richard Earnshaw wrote:

r12 can *also* be clobbered by interworking calls or calls that span
more than the branch range of a call instruction.  Rare, but possible.


I can only presume from this that nested functions are not reliable now, for 
very large programs.  Unless you somehow force the nested function to be within 
N bytes of the parent function.  A direct call to a static function, but with a 
static chain to its parent frame.


That said, most go closures are already called indirectly.  It's rare, but 
possible, for optimization to see through the construction of a closure and 
produce a direct call.


It ought to be possible to modify the aarch32 backend to force a call to be 
indirect, and thus not be subject to branch islands, whenever the 
SYMBOL_REF_DECL has DECL_STATIC_CHAIN set?



r~


Re: Builtin/headers: Constant arguments and adding extra entry points.

2015-06-08 Thread Richard Henderson

On 06/04/2015 12:35 PM, Ondřej Bílka wrote:

char *strchr_c(char *x, unsigned long u);
#define strchr(x,c) \
(__builtin_constant_p(c) ? strchr_c (x, c * (~0ULL / 255)) : strchr (x,c))



Certainly not a universal win, especially for 64-bit RISC.  This constant can 
be just as expensive to construct as the original multiplication.


Consider PPC64, where 4 insns are required to form this kind of replicated 
64-bit constant, and 3 insns are required to replicate C.


Then there's other RISC for which replicating C is easily done in parallel with 
the initial alignment checks.



r~


Re: Builtin/headers: Constant arguments and adding extra entry points.

2015-06-08 Thread Ondřej Bílka
On Mon, Jun 08, 2015 at 01:55:47PM -0700, Richard Henderson wrote:
> On 06/04/2015 12:35 PM, Ondřej Bílka wrote:
> >char *strchr_c(char *x, unsigned long u);
> >#define strchr(x,c) \
> >(__builtin_constant_p(c) ? strchr_c (x, c * (~0ULL / 255)) : strchr (x,c))
> >
> 
> Certainly not a universal win, especially for 64-bit RISC.  This
> constant can be just as expensive to construct as the original
> multiplication.
> 
> Consider PPC64, where 4 insns are required to form this kind of
> replicated 64-bit constant, and 3 insns are required to replicate C.
> 
> Then there's other RISC for which replicating C is easily done in
> parallel with the initial alignment checks.
> 
Thats another problem that these transformations depend on platform so
you need to maintain somewhere table what is profitable and what is not.

As these functions go its better than you write as users frequently call
strchr in loop, there is potential of savings, like 75% of strchr calls
happened within 128 cycles of previous one which is evidence of that use
case.

Second saving would be in header checks. Unless you need to write then a
best way looks to initially check s % 4096 < 4096 - 32 to avoid page
fault. There could be entry point if gcc could prove that there are 32
more bytes allocated after s to other entry point.

I have todo project to add a interface which tranform
while(s=strchr(s+1,'c')) into something like

struct *strchrp = strchr_init (s,'c');
while (s = strchr_next (strchrp))

to avoid overhead of repeated calls, strchr_next inline will first check
mask with values in say 16 current bytes and if it insn't there it will
do libcall.