On Mon, 10 Jul 2023, Jan Hubicka wrote:
> Hi,
> over weekend I found that vectorizer is missing scale_loop_profile for
> epilogues. It already adjusts loop_info to set max iteraitons, so
> adding it was easy. However now predicts the first loop to iterate at
> most once (which is too much, I suppose it forgets to divide by epilogue
> unrolling factor) and second never.
> >
> > The -O2 cost model doesn't want to do epilogues:
> >
> > /* If using the "very cheap" model. reject cases in which we'd keep
> > a copy of the scalar code (even if we might be able to vectorize it).
> > */
> > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
> > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> > {
> > if (dump_enabled_p ())
> > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > "some scalar iterations would need to be
> > peeled\n");
> > return 0;
> > }
> >
> > it's because of the code size increase.
>
> I know, however -O2 is not -Os and here the tradeoffs of
> performance/code size seems a lot better than other code expanding
> things we do at -O2 (such as the unrolling 3 times).
> I think we set the very cheap cost model very conservatively in order to
> get -ftree-vectorize enabled with -O2 and there is some room for finding
> right balance.
>
> I get:
>
> jan@localhost:~> cat t.c
> int a[99];
> __attribute((noipa, weak))
> void
> test()
> {
> for (int i = 0 ; i < 99; i++)
> a[i]++;
> }
> void
> main()
> {
> for (int j = 0; j < 10000000; j++)
> test();
> }
> jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out
>
> real 0m0.529s
> user 0m0.528s
> sys 0m0.000s
>
> jan@localhost:~> gcc -O2 t.c ; time ./a.out
>
> real 0m0.427s
> user 0m0.426s
> sys 0m0.000s
> jan@localhost:~> gcc -O3 t.c ; time ./a.out
>
> real 0m0.136s
> user 0m0.135s
> sys 0m0.000s
> jan@localhost:~> clang -O2 t.c ; time ./a.out
> <warnings>
>
> real 0m0.116s
> user 0m0.116s
> sys 0m0.000s
>
> Code size (of function test):
> gcc -O2 -fno-unroll-loops 17 bytes
> gcc -O2 29 bytes
> gcc -O3 50 bytes
> clang -O2 510 bytes
>
> So unroling 70% code size growth for 23% speedup.
> Vectorizing is 294% code size growth for 388% speedup
> Clang does 3000% codde size growth for 456% speedup
> >
> > That's clearly much larger code. On x86 we're also fighting with
> > large instruction encodings here, in particular EVEX for AVX512 is
> > "bad" here. We hardly get more than two instructions decoded per
> > cycle due to their size.
>
> Agreed, I found it surprising clang does that much of complette unrolling
> at -O2. However vectorizing and not unrolling here seems like it may be
> a better default for -O2 than what we do currently...
I was also playing with AVX512 fully masked loops here which avoids
the epilogue but due to the instruction encoding size that doesn't
usually win. I agree that size isn't everything at least for -O2.
Richard.