Dear all,

Another benefit of the new if converter that perhaps I neglected to 
mention/explain...

TLDR: for some source code which can reasonably be expected to exist in "real-world 
code",
when the false/true values of the condition bits of a given "if" in a given 
loop are very
well-clustered, the code produced by the new converter runs _much_ faster for 
the same
inputs than the code produced by the old converter when write-back caching is 
in effect.

The long explanation follows.



In the case of a loop with a "half-hammock" that looks something like this:

  if (C[index])  A[index] = foo(bar);
  // important: no else here

... or problem-wise-equivalently:

  if (C[index])  ; // empty "then" section
  else           B[index] = foo(bar);

... the latter of which is semantically equivalent to:

  if (! C[index])  B[index] = foo(bar);
  // important: no else here


... the old if converter does something that may massively damage performance.

Basically, the old if converter converts...

  if (C[index])  A[index] = foo(bar);
  // important: no else here

... to the equivalent of:

  __compiler_temp = foo(bar);
  A[index] = C[index] ? __compiler_temp : A[index];


For now, let`s assume the preceding conversion is valid even in the face of 
multithreading,
since multithreading bugs introduced by an "optimization" are a whole other
ball of wax than what this message is all about; for now, let`s assume that
all of A[] is thread-local and no nasty, sneaky pointer-passing has occurred.

The problem is this: the compiler cannot, in the general case, predict what the 
values of C[]
will be at runtime.  Therefor, it cannot [again, in the general case] arrive at 
a conclusion
"this is definitely worth it" or "this is definitely _not_ worth it".  All the 
compiler can do
statically without profiling information is to say "I guess a probability of 
50% on the
elements of C[] being equivalent to true", which -- under an assumption of 
vectorization --
means that the vectorization factor is going to make the transformation 
worthwhile.

However: what if the values of C[] are mostly equivalent to false, not to true? 
 For such
cases, the old if converter yielded code that may cause a big performance 
degradation due to
the if conversion, even in the presence of vectorization.  If we assume that 
the CPU hardware
is not checking to see whether writes to an address change the contents, then 
each execution
of "A[index] = C[index] ? foo(bar) : A[index];" is causing a write to occur 
*_no matter
what the value of "C[index]" is/was_*.  Now, instead of reading the whole A[] 
and writing
a tiny fraction of it, the program is reading all of A[] and also (re)writing 
at least
almost all of A[] (possibly writing all of it even when the probability of the
relevant elements of C[] is _not_ 100%, due to cache-line granularity of writes:
when every cache line from A[] is modified, all of A[] will be rewritten).

The preceding problem could be somewhat ameliorated by profiling, providing 
that the data
you run through your program while profiling is a good representation of the 
data run
through the same program by "real executions", but there is no need for that 
extra work
or extra qualification given the existence of the new if converter.  Plus, the 
profiling
approach to "fixing" this problem with the old converter would only result in a 
binary
decision -- "do convert this if" vs. "don`t convert this if" -- which in cases 
where the
decision is to do/retain the conversion, the converted code is going to rewrite 
the whole
array.  The new converter`s conversion, OTOH, can produce better performance 
than the
conversion from the old converter in cases where the elements of C[] in our 
example are
very clustered: in other words, the _overall_ probability can still be close 
[or =] to the
hardest-to-deal-with challenge of 50%, but there is a large degree of 
clustering of the
"true" values and the "false" values.  For example, all the "false" values come 
first in C[].
In this case, if a profiling-based approach using the old converter decides to 
do/keep
the conversion, then there are still lots of wasted writes that the new 
converter
would avoid, assuming at least one level of write-back cache in the relevant 
data path.

The key factor to understand for understanding how/why the new converter`s 
resulting code
is better than that of the old converter is this: the new converter uses a 
scratchpad
to "throw away" useless writes.  This not only fixes problems with speculative 
writes
through a null pointer that the pre-conversion code never actually does, it 
also fixes
the above-described potential performance problem, at least on architectures 
with
write-back data cache, which AFAIK covers most/all modern high-speed CPUs.
The new converter converts something like (same example as one of the above):

  if (C[index])  A[index] = foo(bar);
  // important: no else here

... into something like:

  /* the type of the scalar goes here */ * pointer = C[index] ? &A[index] : 
&scratchpad;
  __compiler_temp = foo(bar);
  *pointer = C[index] ? __compiler_temp : scratchpad;

... so that in the event of C[] being mostly false, most of the reads are 
_from_ the scratchpad
and most of the writes are _to_ the scratchpad.  This not only probably* saves 
a lot of the
potentially-massive number of wasted writes due to the old conversion rewriting 
all of A[]
no matter what the values of C[]`s elements are, it also saves a lot of reads 
when there is
enough clustering of false values in C[] that entire cache lines worth of A[] 
can be not
read from main RAM.  Caching and the "chunking effect" of cache lines, besides 
qualifying how
much reading is saved by the new conversion relative to the old one, also is 
the reason I needed
to qualify "probably* saves most/all the [...] wasted writes": it depends on 
how well or
how poorly the values in C[] are clustered.  In the case that is best for the 
new converter,
the new converter saves at least almost all the waste of the old converter`s 
code on both
the read path _and_ the write path [modulo at most one cache line of both read 
and write].


I hope the above helps to explain why the new converter is worth merging to 
trunk,
even though for now it still has performance regressions which
we expect to fix before the first GCC 6.x is tagged for release.


Regards,

Abe

Reply via email to