On Tue, Oct 15, 2013 at 2:10 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Tue, Oct 15, 2013 at 1:12 AM, Mike Stump <mikest...@comcast.net> wrote: >> So, here is a comparison of the time required to do a make -j15 of a >> --disable-bootstrap --enable-checking=none --enable-languages=c,c++ style >> compiler. The base compiler is a --enable-checking=none >> --enable-languages=c,c++,lto style compiler, which is >> 1b2bf75690af8115739ebba710a44d05388c7a1a (aka trunk@202797) from git. The >> wide branch compiler is 4529820913813b810860784382f975ea8e6be61d (aka >> wide-int@203462) from git. The software compiled in both cases is the base >> compiler described above. >> >> Net result, around 2.6% regression in user time, and 0.4% in elapsed time. >> The raw data is below, just in case one is interested. This is on Ubuntu >> 12.04.3 system with 12GB ram with 8 cores. > > Btw, more interesting are testcases that put a heavy load on the alias > machinery, like (many) (nested) loops with a lot of memory references. > Like the testcase in PR39326. If you profile that you will see some > of the double_int routines high in the profile which means on the > branch wide_int routines should start to show up. > > I didn't expect visible differences for a bootstrap, but you proved me > wrong :( Btw, with parallel make a single file getting a lot slower can > be masked by parallelism completely, so I take timings with -j > with a grain of salt.
For example for get_ref_base_and_extent the adds to bit_offset (even though initially of addr_wide_int kind) end up unoptimized, exposing if (len_822 > 2) goto <bb 96>; else goto <bb 94>; <bb 94>: xprecision_819 = (unsigned int) D.54901_818; if (xprecision_819 > 127) goto <bb 96>; else goto <bb 95>; <bb 95>: D.54899_838 = D.54922_816->base.u.bits.unsigned_flag; D.54900_839 = (signop) D.54899_838; len_840 = wi::force_to_size (&MEM[(struct wide_int_ref_storage *)&yi].scratch, val_823, len_822, xprecision_819, 128, D.54900_839); <bb 96>: # val_1543 = PHI <val_823(93), &MEM[(struct wide_int_ref_storage *)&yi].scratch(95), val_823(94)> # len_1542 = PHI <2(93), len_840(95), len_822(94)> MEM[(struct generic_wide_int *)&yi].val = val_1543; MEM[(struct generic_wide_int *)&yi].len = len_1542; MEM[(struct generic_wide_int *)&yi].precision = 128; D.54871_813 = wi::add_large (&MEM[(struct fixed_wide_int_storage *)&D.54875].D.43191.val, &MEM[(const struct fixed_wide_int_storage *)&bit_offset].val, D.54872_808, val_1543, len_1542, 128, 1, 0B); MEM[(unsigned int *)&D.54875 + 24B] = D.54871_813; __builtin_memcpy (&bit_offset, &D.54875, 28); goto <bb 284> (<L141>); one issue you can clearly see is that too much of the temporaries (like here the wide_int_ref yi that is created for the tree) ends up being addressable. That's because its data is embedded and passed to add_large (instead of what you'd say is "ref" storage, refering to storage elsewhere). Which is because of the canonicalization mismatch between tree, wide-int and RTX I guess. Not sure where the memcpy comes from in the above code - seems that bit_offset += TREE_OPERAND (exp, 2); builds a temporary bit_offset + TREE_OPERAND (exp, 2) that is then copied to bit_offset and this copy cannot be elided. That said, how do cc1 binary sizes compare branch vs. trunk at the last merge point? Richard.