Re: Problem cropping up in Value Range Propogation

2020-08-24 Thread Richard Biener via Gcc
On Tue, Aug 11, 2020 at 6:15 AM Gary Oblock via Gcc  wrote:
>
> I'm trying to debug a problem cropping up in value range propagation.
> Ironically I probably own an original copy 1995 copy of the paper it's
> based on but that's not going to be much help since I'm lost in the
> weeds.  It's running on some optimization (my structure reorg
> optimization) generated GIMPLE statements.
>
> Here's the GIMPLE dump:
>
> Function max_of_y (max_of_y, funcdef_no=1, decl_uid=4391, cgraph_uid=2, 
> symbol_order=20) (executed once)
>
> max_of_y (unsigned long data, size_t len)
> {
>   double value;
>   double result;
>   size_t i;
>
>[local count: 118111600]:
>   field_arry_addr_14 = _reorg_base_var_type_t.y;
>   index_15 = (sizetype) data_27(D);
>   offset_16 = index_15 * 8;
>   field_addr_17 = field_arry_addr_14 + offset_16;
>   field_val_temp_13 = MEM  [(void *)field_addr_17];
>   result_8 = field_val_temp_13;
>   goto ; [100.00%]
>
>[local count: 955630225]:
>   _1 = i_3 * 16;
>   PPI_rhs1_cast_18 = (unsigned long) data_27(D);
>   PPI_rhs2_cast_19 = (unsigned long) _1;
>   PtrPlusInt_Adj_20 = PPI_rhs2_cast_19 / 16;
>   PtrPlusInt_21 = PPI_rhs1_cast_18 + PtrPlusInt_Adj_20;
>   dedangled_27 = (unsigned long) PtrPlusInt_21;
>   field_arry_addr_23 = _reorg_base_var_type_t.y;
>   index_24 = (sizetype) dedangled_27;
>   offset_25 = index_24 * 8;
>   field_addr_26 = field_arry_addr_23 + offset_25;
>   field_val_temp_22 = MEM  [(void *)field_addr_26];
>   value_11 = field_val_temp_22;
>   if (result_5 < value_11)
> goto ; [50.00%]
>   else
> goto ; [50.00%]
>
>[local count: 477815112]:
>
>[local count: 955630225]:
>   # result_4 = PHI 
>   i_12 = i_3 + 1;
>
>[local count: 1073741824]:
>   # i_3 = PHI <1(2), i_12(5)>
>   # result_5 = PHI 
>   if (i_3 < len_9(D))
> goto ; [89.00%]
>   else
> goto ; [11.00%]
>
>[local count: 118111600]:
>   # result_10 = PHI 
>   return result_10;
> }
>
> The failure in VRP is occurring on
>
> offset_16 = data_27(D) * 8;
>
> which is the from two adjacent statements above
>
>   index_15 = (sizetype) data_27(D);
>   offset_16 = index_15 * 8;
>
> being merged together.
>
> Note, the types of index_15/16 are sizetype and data_27 is unsigned
> long.
> The error message is:
>
> internal compiler error: tree check: expected class ‘type’, have 
> ‘exceptional’ (error_mark) in to_wide,

This means the SSA name looked at is released and should no longer be
refered from in the IL.

> Things only start to look broken in value_range::lower_bound in
> value-range.cc when
>
> return wi::to_wide (t);
>
> is passed error_mark_node in t. It's getting it from m_min just above.
> My observation is that m_min is not always error_mark_node. In fact, I
> seem to think you need to use set_varying to get this to even happen.
>
> Note, the ssa_propagation_engine processed the statement "offset_16 =
> data..."  multiple times before failing on it. What oh what is
> happening and how in the heck did I cause it???
>
> Please, somebody throw me a life preserver on this.
>
> Thanks,
>
> Gary
>
>
> CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is 
> for the sole use of the intended recipient(s) and contains information that 
> is confidential and proprietary to Ampere Computing or its subsidiaries. It 
> is to be used solely for the purpose of furthering the parties' business 
> relationship. Any review, copying, or distribution of this email (or any 
> attachments thereto) is strictly prohibited. If you are not the intended 
> recipient, please contact the sender immediately and permanently delete the 
> original and any copies of this email and any attachments thereto.


Re: GCC Plugins and global_options

2020-08-24 Thread Richard Biener via Gcc
On Thu, Aug 13, 2020 at 10:39 AM Jakub Jelinek via Gcc  wrote:
>
> Hi!
>
> Any time somebody adds or removes an option in some *.opt file (which e.g.
> on the 10 branch after branching off 11 happened 5 times already), many
> offsets in global_options variable change.  It is true we don't guarantee
> ABI stability for plugins, but we change the most often used data structures
> on the release branches only very rarely and so the options changes are the
> most problematic for ABI stability of plugins.
>
> Annobin uses a way to remap accesses to some of the global_options.x_* by
> looking them up in the cl_options array where we have
> offsetof (struct gcc_options, x_flag_lto)
> etc. remembered, but sadly doesn't do it for all options (e.g. some flag_*
> etc. option accesses may be hidden in various macros like POINTER_SIZE),
> and more importantly some struct gcc_options offsets are not covered at all.
> E.g. there is no offsetof (struct gcc_options, x_optimize),
> offsetof (struct gcc_options, x_flag_sanitize) etc.  Those are usually:
> Variable
> int optimize
> in the *.opt files.
>
> So, couldn't our opt*.awk scripts generate another array that would either
> cover just the offsets not covered in struct cl_options that a plugin could
> use to remap struct global_options offsets at runtime, which would include
> e.g. the offsetof value and the name of the variable and perhaps sizeof for
> verification purposes?
> Or couldn't we in plugin/include/ install a modified version of options.h
> that instead of all the:
> #define flag_opts_finished global_options.x_flag_opts_finished
> will do:
> #define flag_opts_finished gcc_lookup_option (flag_opts_finished)
> where lookup_option would be a macro that does something like:
> __attribute__((__pure__))
> void *gcc_lookup_option_2 (unsigned short, const char *, unsigned short);
> template 
> T &gcc_lookup_option_1 (unsigned short offset, const char *name)
> {
>   T *ptr = static_cast  (gcc_lookup_option_2 (offset, name, sizeof (T)));
>   return *ptr;
> }
> #define lookup_option(var) \
>   gcc_lookup_option_1  \
> (offsetof (struct gcc_options, x_##var), #var, \
>  sizeof (global_options.x_##var))
> where the gcc_lookup_option_2 function would lookup the variable in an
> opt*.awk generated table, containing entries like:
>   "ix86_stack_protector_guard_offset", NULL, NULL, NULL, NULL, NULL, NULL, 
> NULL,
>   "ix86_stack_protector_guard_reg", "", NULL, NULL,
>   "recip_mask", NULL, NULL, NULL,
> ...
> As struct gcc_options is around 5KB now, that table would need 5K entries,
> NULL and "" stand for no variable starts here, and "" additionally says that
> padding starts here.
> So, if no options have changed since the plugin has been built, it would be
> very cheap, it would just verify that at the given offset the table contains
> the corresponding string (i.e. non-NULL and strcmp == 0 and that the size
> matches (that size - 1 following entries are NULL and then there is
> non-NULL)). If not, it would keep looking around (one loop that looks in
> both directions in the table, so first it would check offsets -1 and +1 from
> the original, then -2 and +2, etc.
> And would gcc_unreachable () if it can't find it in the table, or can find
> it, but the size has changed.
>
> If that is unacceptable, at least having a table with variables not covered
> in struct cl_options offsets would allow the plugin to do it itself
> (basically by constructing a remapping table, original offsetof (struct 
> gcc_options, XXX)
> remaps to offset ABC (and have some value like (unsigned short) -1 to signal
> it is gone and there should be assertion failure.
>
> Thoughts on this?

I'd say we ignore this since we do not provide any ABI stability guarantees.
Instead we maybe want to "export" the genchecksum result and embed
it into plugin objects and refuse to load plugins that were not built against
the very same ABI [unless --force is given]?

Richard.

> Jakub
>


Re: RFC: -fno-share-inlines

2020-08-24 Thread Allan Sandfeld Jensen
On Montag, 24. August 2020 08:52:04 CEST Richard Biener wrote:
> On Mon, Aug 10, 2020 at 9:36 AM Allan Sandfeld Jensen
> 
>  wrote:
> > Following the previous discussion, this is a proposal for a patch that
> > adds
> > the flag -fno-share-inlines that can be used when compiling singular
> > source
> > files with a different set of flags than the rest of the project.
> > 
> > It basically turns off comdat for inline functions, as if you compiled
> > without support for 'weak' symbols. Turning them all into "static"
> > functions, even if that wouldn't normally be possible for that type of
> > function. Not sure if it breaks anything, which is why I am not sending
> > it to the patch list.
> > 
> > I also considered alternatively to turn the comdat generation off later
> > during assembler production to ensure all processing and optimization of
> > comdat functions would occur as normal.
> 
> We already have -fvisibility-inlines-hidden so maybe call it
> -fvisibility-inlines-static?
That could make sense.

> Does this option also imply 'static' vtables?
> 
I don't think so. It affects functions declarations. It probably also affect 
virtual functions, but as far as I know the place changed only affects the 
functions not the virtual table. So there could still be a duplicated virtual 
table with the wrong methods, that then depends on linking if it will be used 
globally. I am not sure I dare change that, having different virtual tables 
for the same class depending on where it is allocated sounds like a way to 
break some things.

I wouldn't use virtual tables anyway when dealing with mixed architectures.

Best regards
Allan




Re: Question about IPA-PTA and build_alias

2020-08-24 Thread Richard Biener via Gcc
On Mon, Aug 17, 2020 at 3:22 PM Erick Ochoa
 wrote:
>
> Hello,
>
> I'm looking to understand better the points-to analysis (IPA-PTA) and
> the alias analysis (build_alias).
>
> How is the information produced by IPA-PTA consumed?
>
> Are alias sets in build_alias computed by the intersections of the
> points_to_set(s) (computed by IPA-PTA)?
>
> My intuition tells me that it could be relatively simple to move
> build_alias to be an SIMPLE_IPA_PASS performed just after IPA-PTA, but I
> do not have enough experience in GCC to tell if this is correct. What
> could be some difficulties which I am not seeing? (Either move, or
> create a new IPA-ALIAS SIMPLE_IPA_PASS.) This pass would have the same
> sensitivity as IPA-PTA { flow-insensitive, context-insensitive,
> field-sensitive } because the alias sets could be computed by the
> intersection of points-to-sets.

Both IPA-PTA and build_alias do the same, they build PTA constraint
sets, solve them and attach points-to info to SSA names.  Just IPA-PTA
does this for the whole TU while build_alias does it for a function at a time.

So I guess I do not understand your question.

Richard.

>
> Thanks!


Re: Peephole optimisation: isWhitespace()

2020-08-24 Thread Richard Biener via Gcc
On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak  wrote:
>
> "Allan Sandfeld Jensen"  wrote:
>
> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
> >> Hi @ll,
> >>
> >> in his ACM queue article ,
> >> Matt Godbolt used the function
> >>
> >> | bool isWhitespace(char c)
> >> | {
> >> |
> >> | return c == ' '
> >> |
> >> |   || c == '\r'
> >> |   || c == '\n'
> >> |   || c == '\t';
> >> |
> >> | }
> >>
> >> as an example, for which GCC 9.1 emits the following assembly for AMD64
> >>
> >> processors (see ):
> >> |xoreax, eax  ; result = false
> >> |cmpdil, 32   ; is c > 32
> >> |ja .L4   ; if so, exit with false
> >> |movabs rax, 4294977024   ; rax = 0x12600
> >> |shrx   rax, rax, rdi ; rax >>= c
> >> |andeax, 1; result = rax & 1
> >> |
> >> |.L4:
> >> |ret
> >>
> > No it doesn't. As your example shows if you took the time to read it, it is
> > what gcc emit when generating code to run on a _haswell_ architecture.
>
> Matt's article does NOT specify the architecture for THIS example.
> He specified it for another example he named "(q)":
>
> | When targeting the Haswell microarchitecture, GCC 8.2 compiles this code
> | to the assembly in (q) (https://godbolt.org/z/acm19_bits):
>
> WHat about CAREFUL reading?
>
> > If you remove -march=haswell from the command line you get:
> >
> >xor eax, eax
> >cmp dil, 32
> >ja  .L1
> >movabs  rax, 4294977024
> >mov ecx, edi
> >shr rax, cl
> >and eax, 1
> >
> > It uses one mov more, but no shrx.
>
> The SHRX is NOT the point here; its the avoidable conditional branch that
> matters!

Whether or not the conditional branch sequence is faster depends on whether
the branch is well-predicted which very much depends on the data you
feed the isWhitespace function with but I guess since this is the
c == ' ' test it _will_ be a well-predicted branch which means the
conditional branch sequence will be usually faster.  The proposed
change turns the control into a data dependence which constrains
instruction scheduling and retirement.  Indeed a mispredicted branch
will likely be more costly.

x86 CPUs do not perform data speculation.

Richard.

>  mov ecx, edi
>  movabs  rax, 4294977024
>  shr rax, cl
>  xor edi, edi
>  cmp ecx, 33
>  setbdil
>  and eax, edi
>
> Stefan


Re: Question about IPA-PTA and build_alias

2020-08-24 Thread Erick Ochoa




On 24/08/2020 09:40, Richard Biener wrote:

On Mon, Aug 17, 2020 at 3:22 PM Erick Ochoa
 wrote:


Hello,

I'm looking to understand better the points-to analysis (IPA-PTA) and
the alias analysis (build_alias).

How is the information produced by IPA-PTA consumed?

Are alias sets in build_alias computed by the intersections of the
points_to_set(s) (computed by IPA-PTA)?

My intuition tells me that it could be relatively simple to move
build_alias to be an SIMPLE_IPA_PASS performed just after IPA-PTA, but I
do not have enough experience in GCC to tell if this is correct. What
could be some difficulties which I am not seeing? (Either move, or
create a new IPA-ALIAS SIMPLE_IPA_PASS.) This pass would have the same
sensitivity as IPA-PTA { flow-insensitive, context-insensitive,
field-sensitive } because the alias sets could be computed by the
intersection of points-to-sets.


Both IPA-PTA and build_alias do the same, they build PTA constraint
sets, solve them and attach points-to info to SSA names.  Just IPA-PTA
does this for the whole TU while build_alias does it for a function at a time.

So I guess I do not understand your question.


Hi Richard,

I'm just trying to imagine what a data-layout optimization would look 
like if instead of using the type-escape analysis we used the points-to 
analysis to find out which variables/memory locations escape and what 
that would mean for the transformation itself.


One of the things that I think would be needed are alias-sets. I thought 
that build_alias was building alias sets but I was mistaken. However, 
computing the alias sets should not be too difficult.


Also continuing imagining what a data-layout optimization would look 
like in GCC, since IPA-PTA is a SIMPLE_IPA_PASS and if alias sets are 
indeed needed, I was asking what would be the reception to a 
SIMPLE_IPA_PASS that computes alias sets just after IPA-PTA. (As opposed 
to a full ipa pass).






Richard.



Thanks!


Re: Peephole optimisation: isWhitespace()

2020-08-24 Thread Alexander Monakov via Gcc
On Mon, 24 Aug 2020, Richard Biener via Gcc wrote:

> Whether or not the conditional branch sequence is faster depends on whether
> the branch is well-predicted which very much depends on the data you
> feed the isWhitespace function with but I guess since this is the
> c == ' ' test it _will_ be a well-predicted branch which means the
> conditional branch sequence will be usually faster.  The proposed
> change turns the control into a data dependence which constrains
> instruction scheduling and retirement.  Indeed a mispredicted branch
> will likely be more costly.

There's also the question how the caller is using the return value. In all
likelihood, the caller branches on it, so making isWhitespace branchless
just moves the misprediction cost to the caller.

On x86, we should be aiming to produce the BT instruction. GIMPLE reassoc
nicely transforms multiple branches into a bit test, but unfortunately it
uses right shift, while RTL matches for a left shift, but not right..
With hand-written code it's easy to make GCC produce BT as desired:

void is_ws_cb(unsigned char c, void f(void))
{
unsigned long long mask = 1ll<<' ' | 1<<'\t' | 1<<'\r' | 1<<'\n';
if (c <= 32 && (mask & (1ll<

Re: Question about Gimple Variables named D.[0-9]*

2020-08-24 Thread Richard Biener via Gcc
On Thu, Aug 20, 2020 at 11:51 AM Erick Ochoa
 wrote:
>
> Hello,
>
> I am looking at the dump for the build_alias pass. I see a lot of
> variables with the naming convention D.[0-9]* in the points-to sets
> being printed.
>
> When I compile with
>
> -fdump-tree-all-all
>
> I can see that the suffix D.[0-9]* is appended to some gimple variables.
> I initially imagined that variables in the points-to variable set could
> map to a variable declaration in gimple, but this does not seem to be
> the case. I have confirmed this by searching for some known variable
> name in the points-to set and finding no matches in the gimple code, the
> other way around seems to also be true.
>
> Are these variables just constraint variables used to solve the
> points-to analysis? In other words, the variables in points-to sets
> printed out in build_alias do not have a simple map to variables in
> gimple. The only relation is that the intersection between to points-to
> set for variable A with the points-to set of variable B will yield an
> is_alias(A, B) relationship. Is the above true?

The points-to sets in SSA_NAME_POINTER_INFO record DECL_UIDs
which are those printed as D.[0-9]* which is appended to all variables
if you dump with -uid.  The points-to set "names" in the points-to dumps
are internal names created for the constraint vairables - most of the time
based on the program variable names but only loosely coupled.

The translation between constraint variables and program variables is
done in set_uids_in_ptset.

Richard.

>
> Thanks!
>
>


Re: Peephole optimisation: isWhitespace()

2020-08-24 Thread Stefan Kanthak
"Richard Biener"  wrote:

> On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak  
> wrote:
>>
>> "Allan Sandfeld Jensen"  wrote:
>>
>> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
>> >> Hi @ll,
>> >>
>> >> in his ACM queue article ,
>> >> Matt Godbolt used the function
>> >>
>> >> | bool isWhitespace(char c)
>> >> | {
>> >> |
>> >> | return c == ' '
>> >> |
>> >> |   || c == '\r'
>> >> |   || c == '\n'
>> >> |   || c == '\t';
>> >> |
>> >> | }
>> >>
>> >> as an example, for which GCC 9.1 emits the following assembly for AMD64
>> >>
>> >> processors (see ):
>> >> |xoreax, eax  ; result = false
>> >> |cmpdil, 32   ; is c > 32
>> >> |ja .L4   ; if so, exit with false
>> >> |movabs rax, 4294977024   ; rax = 0x12600
>> >> |shrx   rax, rax, rdi ; rax >>= c
>> >> |andeax, 1; result = rax & 1
>> >> |
>> >> |.L4:
>> >> |ret

[...]
 
> Whether or not the conditional branch sequence is faster depends on whether
> the branch is well-predicted which very much depends on the data you
> feed the isWhitespace function with

Correct.

> but I guess since this is the c == ' ' test it _will_ be a well-predicted 
> branch

Also correct, but you miss a point: the typical use case is

while (isWhitespace(*ptr)) ptr++;

> which means the conditional branch sequence will be usually faster.

And this is wrong!
The (well-predicted) branch is usually NOT taken, so both code variants
usually execute (with one exception the same) 6 or 7 instructions.

> The proposed change turns the control into a data dependence which
> constrains instruction scheduling and retirement.

It doesn't matter: the branch has the same data dependency too!

> Indeed a mispredicted branch will likely be more costly.

And no branch is even better: the branch predictor has a limited capacity,
so every removed branch instruction can help improve its efficiency.

> x86 CPUs do not perform data speculation.

>>  mov ecx, edi
>>  movabs  rax, 4294977024
>>  shr rax, cl
>>  xor edi, edi
>>  cmp ecx, 33
>>  setbdil
>>  and eax, edi

I already presented measured numbers: with random data, the branch-free
code is faster, with ordered data the original code.

Left column 1 billion sequential characters
for (int i=10; i; --i) ...(i);
right column 1 billion random characters, in cycles per character:

GCC:   2.43.4
branch-free:   3.02.5

Now perform a linear interpolation and find the break-even point at
p=0.4, with p=0 for ordered data and p=1 for random data, or just use
the average of these numbers: 2.9 cycles vs. 2.75 cycles.
That's small, but measurable!

Stefan


Re: Peephole optimisation: isWhitespace()

2020-08-24 Thread Richard Biener via Gcc
On Mon, Aug 24, 2020 at 1:22 PM Stefan Kanthak  wrote:
>
> "Richard Biener"  wrote:
>
> > On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak  
> > wrote:
> >>
> >> "Allan Sandfeld Jensen"  wrote:
> >>
> >> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
> >> >> Hi @ll,
> >> >>
> >> >> in his ACM queue article ,
> >> >> Matt Godbolt used the function
> >> >>
> >> >> | bool isWhitespace(char c)
> >> >> | {
> >> >> |
> >> >> | return c == ' '
> >> >> |
> >> >> |   || c == '\r'
> >> >> |   || c == '\n'
> >> >> |   || c == '\t';
> >> >> |
> >> >> | }
> >> >>
> >> >> as an example, for which GCC 9.1 emits the following assembly for AMD64
> >> >>
> >> >> processors (see ):
> >> >> |xoreax, eax  ; result = false
> >> >> |cmpdil, 32   ; is c > 32
> >> >> |ja .L4   ; if so, exit with false
> >> >> |movabs rax, 4294977024   ; rax = 0x12600
> >> >> |shrx   rax, rax, rdi ; rax >>= c
> >> >> |andeax, 1; result = rax & 1
> >> >> |
> >> >> |.L4:
> >> >> |ret
>
> [...]
>
> > Whether or not the conditional branch sequence is faster depends on whether
> > the branch is well-predicted which very much depends on the data you
> > feed the isWhitespace function with
>
> Correct.
>
> > but I guess since this is the c == ' ' test it _will_ be a well-predicted 
> > branch
>
> Also correct, but you miss a point: the typical use case is
>
> while (isWhitespace(*ptr)) ptr++;
>
> > which means the conditional branch sequence will be usually faster.
>
> And this is wrong!
> The (well-predicted) branch is usually NOT taken, so both code variants
> usually execute (with one exception the same) 6 or 7 instructions.

Whether or not the branch is predicted taken does not matter, what
matters is that the continuation is not data dependent on the branch
target computation and thus can execute in parallel to it.

> > The proposed change turns the control into a data dependence which
> > constrains instruction scheduling and retirement.
>
> It doesn't matter: the branch has the same data dependency too!
>
> > Indeed a mispredicted branch will likely be more costly.
>
> And no branch is even better: the branch predictor has a limited capacity,
> so every removed branch instruction can help improve its efficiency.
>
> > x86 CPUs do not perform data speculation.
>
> >>  mov ecx, edi
> >>  movabs  rax, 4294977024
> >>  shr rax, cl
> >>  xor edi, edi
> >>  cmp ecx, 33
> >>  setbdil
> >>  and eax, edi
>
> I already presented measured numbers: with random data, the branch-free
> code is faster, with ordered data the original code.
>
> Left column 1 billion sequential characters
> for (int i=10; i; --i) ...(i);
> right column 1 billion random characters, in cycles per character:

I guess feeding it Real Text (TM) is the only relevant benchmark,
doing sth like

  for (;;)
 cnt[isWhitespace(*ptr++)]++;

> GCC:   2.43.4
> branch-free:   3.02.5

I'd call that unconclusive data - you also failed to show your test data
is somehow relevant.  We do know that mispredicted branches are bad.
You show well-predicted branches are good.  By simple statistics
singling out 4 out of 255 values will make the branches well-predicted.

> Now perform a linear interpolation and find the break-even point at
> p=0.4, with p=0 for ordered data and p=1 for random data, or just use
> the average of these numbers: 2.9 cycles vs. 2.75 cycles.
> That's small, but measurable!


Re: Peephole optimisation: isWhitespace()

2020-08-24 Thread Stefan Kanthak
"Richard Biener"  wrote:


> On Mon, Aug 24, 2020 at 1:22 PM Stefan Kanthak  
> wrote:
>>
>> "Richard Biener"  wrote:
>>
>> > On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak  
>> > wrote:
>> >>
>> >> "Allan Sandfeld Jensen"  wrote:
>> >>
>> >> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:

[...]

> Whether or not the branch is predicted taken does not matter, what
> matters is that the continuation is not data dependent on the branch
> target computation and thus can execute in parallel to it.

My benchmark shows that this doesn't matter!

>> > The proposed change turns the control into a data dependence which
>> > constrains instruction scheduling and retirement.
>>
>> It doesn't matter: the branch has the same data dependency too!
>>
>> > Indeed a mispredicted branch will likely be more costly.
>>
>> And no branch is even better: the branch predictor has a limited capacity,
>> so every removed branch instruction can help improve its efficiency.
>>
>> > x86 CPUs do not perform data speculation.
>>
>> >>  mov ecx, edi
>> >>  movabs  rax, 4294977024
>> >>  shr rax, cl
>> >>  xor edi, edi
>> >>  cmp ecx, 33
>> >>  setbdil
>> >>  and eax, edi
>>
>> I already presented measured numbers: with random data, the branch-free
>> code is faster, with ordered data the original code.
>>
>> Left column 1 billion sequential characters
>> for (int i=10; i; --i) ...(i);
>> right column 1 billion random characters, in cycles per character:
> 
> I guess feeding it Real Text (TM) is the only relevant benchmark,
> doing sth like
> 
>  for (;;)
> cnt[isWhitespace(*ptr++)]++;

I approximated that using a PRNG...

>> GCC:   2.43.4
>> branch-free:   3.02.5
> 
> I'd call that unconclusive data - you also failed to show your test data
> is somehow relevant.

Since nobody can predict real world data all test data are irrelevant,
somehow. I thus call your argument a NULL argument.

> We do know that mispredicted branches are bad.
> You show well-predicted branches are good.

Wrong: I show that no branches are still better.

> By simple statistics singling out 4 out of 255 values will make the
> branches well-predicted.

Your statistic is wrong:
1. the branch singles out 224 of 256 values, i.e. 7/8 of all data;
2. whitespace lies in the 1/8 which is not singled out.

>> Now perform a linear interpolation and find the break-even point at
>> p=0.4, with p=0 for ordered data and p=1 for random data, or just use
>> the average of these numbers: 2.9 cycles vs. 2.75 cycles.
>> That's small, but measurable!

Stefan


Alloca documentation

2020-08-24 Thread Philip R Brenan via Gcc
Hi *GCC*:

Given the example for *alloca*() at:

https://www.gnu.org/software/libc/manual/html_mono/libc.html#Alloca-Example

May I recommend that the code below portrays the use of alloca() more
completely in that it demonstrates that it is exiting the containing
function, not the {}
of the first for loop that frees the storage allocated by alloca()?

#include 
#include 

int main(void)
 {const int N = 26;
  char *(p[N]);
  for(int i = 0; i < N; ++i) {*(p[i] = alloca(1)) = 'a'+i;}
  for(int i = 0; i < N; ++i) putchar(*(p[i]));
  return 0;
 }

// abcdefghijklmnopqrstuvwxyz


-- 
Thanks,

Phil 

Philip R Brenan 


Re: Alloca documentation

2020-08-24 Thread Jonathan Wakely via Gcc
On Mon, 24 Aug 2020 at 15:39, Philip R Brenan via Gcc  wrote:
>
> Hi *GCC*:
>
> Given the example for *alloca*() at:
>
> https://www.gnu.org/software/libc/manual/html_mono/libc.html#Alloca-Example
>
> May I recommend that the code below portrays the use of alloca() more
> completely in that it demonstrates that it is exiting the containing
> function, not the {}
> of the first for loop that frees the storage allocated by alloca()?

That documentation is part of glibc, not GCC.

See https://www.gnu.org/software/libc


Re: Clobber REG_CC only for some constraint alternatives?

2020-08-24 Thread Jeff Law via Gcc
On Thu, 2020-08-20 at 21:36 +0530, Senthil Kumar Selvaraj via Gcc wrote:
> Pip Cet writes:
> 
> > On Tue, Aug 18, 2020 at 6:52 AM Senthil Kumar Selvaraj
> >  wrote:
> > > > recognize such insns, but as it stands that define_insn would
> > > > recognize the incorrect insn:
> > > > 
> > > > [(set (reg:QI 0) (const_int 0))
> > > >  (clobber (scratch:CC))]
> > > 
> > > get_cc_reg_clobber_rtx also looks at the insn itself (i.e. whatever
> > > the erstwhile NOTICE_UPDATE_CC hook did), so if the cc attribute is LDI,
> > > and operands[1] is const0_rtx, it would return cc_reg_rtx as the clobber 
> > > reg.
> > > 
> > > AFAICT, some passes after reload verify if operands match constraints,
> > > and that would work in this case - I'm not sure if the pattern you
> > > mentioned could show up, outside of wrong splitters/peepholes in the md 
> > > file.
> > 
> > I don't think they can, but it's still another case of lying to GCC.
> > 
> > At the very least, output_movqi should assert that it isn't asked to
> > produce code for this situation.
> 
> I agree.
> > > Another approach would be to conservatively use a 'c' constraint and
> > > clobber REG_CC for all reg-reg moves.
> > 
> > I'd prefer splitting the constraint alternatives to have one clear-r0
> > alternative, an ldi alternative, and a clear -r1_31 alternative.
> > 
> > As for define_subst, is it really worth it? If I'm reading the
> > documentation correctly, it's not powerful enough to deal with scratch
> > operands on its own, so we'd probably need three or four variants of
> > define_subst just to handle those cases.
> 
> Is this related to cmpelim not looking at REG_CC clobber if there are
> other (clobber scratch..) preceding it?
> > I'm probably missing something obvious, but what's the reason for
> > keeping the CC-clobbering post-reload splitter when we already have a
> > CC-setting one? Are there significant post-reload optimizations that
> > can deal with a clobber but not an arbitrary assignment to CC? If so,
> > wouldn't it be better to fix those optimizations?
> 
> The post-reload splitter introduces the clobber. The wiki
> suggests that approach if most insns clobber REG_CC, perhaps because of
> the missed optimizations you describe below?
If most patterns set/clobber the flags, then yes, it's slightly better to only
expose them after reload.   Various passes that directly grub through RTL rather
than using helpers like single_set will optimize things better.

That's the approach I've taken with the H8 conversion, which is basically 
working
at this point and when I have the time I walk through various minor codegen
inefficiences.

Jeff



Have a look

2020-08-24 Thread Brittnay Wall
Hi,



Would you like to check out the contacts of *HR and Benefit administration
professionals*?



If you are interested please drop me a note so that we can connect and
discuss about the opportunity.



Thanks in advance!



Regards,

*Brittany Wall *|Manager Demand Generation|



If you do not wish further mail please reply with “Leave Out” in subject
line