Re: wide-int branch timings

Richard Biener Tue, 15 Oct 2013 05:41:59 -0700

On Tue, Oct 15, 2013 at 2:10 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Tue, Oct 15, 2013 at 1:12 AM, Mike Stump <mikest...@comcast.net> wrote:
>> So, here is a comparison of the time required to do a make -j15 of a 
>> --disable-bootstrap --enable-checking=none --enable-languages=c,c++ style 
>> compiler.  The base compiler is a --enable-checking=none 
>> --enable-languages=c,c++,lto style compiler, which is 
>> 1b2bf75690af8115739ebba710a44d05388c7a1a (aka trunk@202797) from git.  The 
>> wide branch compiler is 4529820913813b810860784382f975ea8e6be61d (aka 
>> wide-int@203462) from git.  The software compiled in both cases is the base 
>> compiler described above.
>>
>> Net result, around 2.6% regression in user time, and 0.4% in elapsed time.  
>> The raw data is below, just in case one is interested.  This is on Ubuntu 
>> 12.04.3 system with 12GB ram with 8 cores.
>
> Btw, more interesting are testcases that put a heavy load on the alias
> machinery, like (many) (nested) loops with a lot of memory references.
> Like the testcase in PR39326.  If you profile that you will see some
> of the double_int routines high in the profile which means on the
> branch wide_int routines should start to show up.
>
> I didn't expect visible differences for a bootstrap, but you proved me
> wrong :(  Btw, with parallel make a single file getting a lot slower can
> be masked by parallelism completely, so I take timings with -j
> with a grain of salt.


For example for get_ref_base_and_extent the adds to bit_offset
(even though initially of addr_wide_int kind) end up unoptimized,
exposing

  if (len_822 > 2)
    goto <bb 96>;
  else
    goto <bb 94>;

<bb 94>:
  xprecision_819 = (unsigned int) D.54901_818;
  if (xprecision_819 > 127)
    goto <bb 96>;
  else
    goto <bb 95>;

<bb 95>:
  D.54899_838 = D.54922_816->base.u.bits.unsigned_flag;
  D.54900_839 = (signop) D.54899_838;
  len_840 = wi::force_to_size (&MEM[(struct wide_int_ref_storage
*)&yi].scratch, val_823, len_822, xprecision_819, 128, D.54900_839);

<bb 96>:
  # val_1543 = PHI <val_823(93), &MEM[(struct wide_int_ref_storage
*)&yi].scratch(95), val_823(94)>
  # len_1542 = PHI <2(93), len_840(95), len_822(94)>
  MEM[(struct generic_wide_int *)&yi].val = val_1543;
  MEM[(struct generic_wide_int *)&yi].len = len_1542;
  MEM[(struct generic_wide_int *)&yi].precision = 128;
  D.54871_813 = wi::add_large (&MEM[(struct fixed_wide_int_storage
*)&D.54875].D.43191.val, &MEM[(const struct fixed_wide_int_storage
*)&bit_offset].val, D.54872_808, val_1543, len_1542, 128, 1, 0B);
  MEM[(unsigned int *)&D.54875 + 24B] = D.54871_813;
  __builtin_memcpy (&bit_offset, &D.54875, 28);
  goto <bb 284> (<L141>);

one issue you can clearly see is that too much of the temporaries
(like here the wide_int_ref yi that is created for the tree) ends up
being addressable.  That's because its data is embedded and
passed to add_large (instead of what you'd say is "ref" storage, refering
to storage elsewhere).  Which is because of the canonicalization
mismatch between tree, wide-int and RTX I guess.

Not sure where the memcpy comes from in the above code - seems
that

          bit_offset += TREE_OPERAND (exp, 2);

builds a temporary bit_offset + TREE_OPERAND (exp, 2) that is
then copied to bit_offset and this copy cannot be elided.

That said, how do cc1 binary sizes compare branch vs. trunk at
the last merge point?

Richard.

Re: wide-int branch timings

Reply via email to