On Fri, Jan 3, 2014 at 12:59 PM, Jakub Jelinek <ja...@redhat.com> wrote:
> On Fri, Jan 03, 2014 at 12:25:00PM +0100, Uros Bizjak wrote:
>> I am testing a patch that removes "max_align" part from ix86_data_alignment.
>
> That looks like unnecessary pessimization.  Note the hunk in question is
> guarded with opt, which means it is an optimization rather than ABI issue,
> it can increase alignment, but the compiler can only assume the increased
> alignment if the symbol is not public or if it is public, but can't be
> preempted by another TU's definition.  Even in that case it can be
> worthwhile to increase the alignment, say if doing versioning for alignment,
> or say just doing some AVX/AVX2/AVX512F version of memcpy/memset that can
> handle it faster if it is sufficiently aligned by testing it at runtime.
>
> So, IMHO we shouldn't drop it, just improve it.
> Perhaps:
>
>   int max_align = optimize_size ? BITS_PER_WORD
>                                 : MIN (512, MAX_OFILE_ALIGNMENT);
>
>   if (opt
>       && AGGREGATE_TYPE_P (type)
>       && TYPE_SIZE (type)
>       && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
>       && align < max_align)
>     {
>       int this_align = 256;
>       for (this_align = 256; this_align <= max_align; this_align *= 2)
>         if (TREE_INT_CST_LOW (TYPE_SIZE (type)) < (unsigned) this_align
>             && !TREE_INT_CST_HIGH (TYPE_SIZE (type)))
>           break;
>         else if (align < this_align)
>           align = this_align;
>     }
>
> which will handle both the 256 bit alignment for >= 256 bit objects,
> 512 bit alignment for >= 512 bit objects and will be prepared for the
> future.
>
> 128 bit I think doesn't need to be handled, DATA_ALIGNMENT has been
> using 256-bit test already since it's introduction in 1998:
> http://gcc.gnu.org/ml/gcc-bugs/1998-03/msg00011.html

Thanks for the pointer, there is indeed the recommendation in
optimization manual [1], section 3.6.4, where it is said:

--quote--
Misaligned data access can incur significant performance penalties.
This is particularly true for cache line
splits. The size of a cache line is 64 bytes in the Pentium 4 and
other recent Intel processors, including
processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory
accesses and requires several
μops to be executed (instead of one). Accesses that span 64-byte
boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on
machines with longer pipelines.

...

A 64-byte or greater data structure or array should be aligned so that
its base address is a multiple of 64.
Sorting data in decreasing size order is one heuristic for assisting
with natural alignment. As long as 16-
byte boundaries (and cache lines) are never crossed, natural alignment
is not strictly necessary (though
it is an easy way to enforce this).
--/quote--

So, this part has nothing to do with AVX512, but with cache line
width. And we do have a --param "l1-cache-line-size=64", detected with
-march=native that could come handy here.

This part should be rewritten (and commented) with the information
above in mind.

[1] 
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

Uros.

Reply via email to