from:"yann.collet.73 at gmail dot com"

[Bug target/65871] bzhi builtin/intrinsic wrongly assumes bzhi instruction doesn't set the ZF flag

2018-01-29 Thread yann.collet.73 at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65871

Yann Collet  changed:

   What|Removed |Added

 CC||yann.collet.73 at gmail dot com

--- Comment #14 from Yann Collet  ---
Is gcc -mbmi2 currently able to generate automatically a bzhi instruction when
it detects a "X & ((1 << Y) - 1)" sequence, as suggested by James Almer ?
If yes, are there some available examples ?

[Bug c/82802] New: Potential UBSAN error with pointer difference (32-bits mode)

2017-11-01 Thread yann.collet.73 at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82802

Bug ID: 82802
   Summary: Potential UBSAN error with pointer difference (32-bits
mode)
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yann.collet.73 at gmail dot com
  Target Milestone: ---

As part of our CI test suite,
we compile and run fuzzer tests every day.

The UBSAN test has been failing for some time now.
I suspect it's related to our provider having updated at some point the gcc
version.


The failure happens in this situation :
presuming we have 2 pointers : highPtr > lowPtr,
if I request the distance in 32-bits mode,
both pointers being associated to the same object 
(one is the upper limit, another is a cursor into the object)

`highPtr - lowPtr` generates this UBSAN error : 
runtime error: signed integer overflow: -2147452928 - 1879078921 cannot be
represented in type 'int'

The values of these pointers are :
highPtr : 0x80007800
lowPtr  : 0x70018AAB

As can be seen, there is no overflow : highPtr is > lowPtr, and the distance is
~256 MB, well within the limits of ptrdiff_t in 32-bits.
Nonetheless, UBSAN consider it an error, likely because it crosses the
0x8000 threshold.
I suspect the pointer addresses are converted into `int` type *before* the
substraction, which leads to UBSAN conclusion.


The same code on clang doesn't trigger any error.

[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-04-09 Thread yann.collet.73 at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Yann Collet  changed:

   What|Removed |Added

 CC||yann.collet.73 at gmail dot com

--- Comment #9 from Yann Collet  ---
While the issue can be easily fixed from an LZ4 perspective, 
the main topic here is to analyze a GCC 4.9+ vectorizer choice.

The piece of code that it tried to optimize can be summarized as follows (once
removed all the garbage) :

static void LZ4_copy8(void* dstPtr, const void* srcPtr)
{
   *(U64*)dstPtr = *(U64*)srcPtr;
}

Pretty simple.
Let's assume for the rest of the post that both pointers are correctly aligned,
so it's not a problem anymore.

Looking at the assembler generated, we see that GCC generates a MOVDQA
instruction for it.
> movdqa (%rdi,%rax,1),%xmm0
> $rdi=0x7fffea4b53e6
> $rax=0x0

This seems wrong on 2 levels :

- The function only wants to copy 8 bytes. MOVDQA works on a full SSE register,
which is 16 bytes. This spell troubles, if only for buffer boundaries checks :
the algorithm uses 8 bytes because it knows it can safely read/write that size
without crossing buffer limits. With 16 bytes, no such guarantee.

- MOVDQA requires both positions to be aligned.
I read it as being SSE size aligned, which means 16-bytes aligned.
But they are not, these pointers are supposed to be 8-bytes aligned only.

(A bit off topic, but from a general perspective, I don't understand the use of
MOVDQA, which requires such a strong alignment condition, while there is also
MOVDQU available, which works fine at any memory address, while suffering no
performance penalty on aligned memory addresses. MOVDQU looks like a better
choice in every circumstances.)

Anyway, the core of the issue is rather above :
this is just an 8-bytes copy operation, replacing by a 16-bytes one looks
suspicious. Maybe it would deserve a look.

[Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)

2015-09-02 Thread yann.collet.73 at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

Bug ID: 67435
   Summary: Large performance drop on apparently unrelated changes
(probable cause : strange inlining side-effect)
   Product: gcc
   Version: 4.8.4
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yann.collet.73 at gmail dot com
  Target Milestone: ---

Some weird effect with gcc (tested version : 4.8.4).

I've got a performance oriented code, which runs pretty fast. Its speed depends
for a large part on inlining many small functions.
There is no inline statement. All functions are either normal or static.
Automatic inlining decision is solely within compiler's realm, which has worked
fine so far (functions to inline are very small, typically from 1 to 5 lines).

Since inlining across multiple .c files is difficult (-flto is not yet widely
available), I've kept a lot of small functions into a single `.c` file, into
which I'm also developing a codec, and its associated decoder. It's
"relatively" large by my standard (about ~2000 lines, although a lot of them
are mere comments and blank lines), but breaking it into smaller parts opens
new problems, so I would prefer to avoid that, if that is possible.

Encoder and Decoder are related, since they are inverse operations. But from a
programming perspective, they are completely separated, sharing nothing in
common, except a few typedef and very low-level functions (such as reading from
unaligned memory position).

The strange effect is this one :

I recently added a new function fnew to the encoder side. It's a new "entry
point". It's not used nor called from anywhere within the .c file.

The simple fact that it exists makes the performance of the decoder function
fdec drops substantially, by more than 20%, which is way too much to be
ignored.

Now, keep in mind that encoding and decoding operations are completely
separated, they share almost nothing, save some minor typedef (u32, u16 and
such) and associated operations (read/write).

When defining the new encoding function fnew as static, performance of the
decoder fdec increases back to normal. Since fnew isn't called from the .c, I
guess it's the same as if it was not there (dead code elimination).

If static fnew is now called from the encoder side, performance of fdec remains
good.
But as soon as fnew is modified, fdec performance just drops substantially.

Presuming fnew modifications crossed a threshold, I increased the following gcc
parameter : --param max-inline-insns-auto=60 (by default, its value is supposed
to be 40.) And it worked : performance of fdec is now back to normal.

But I guess this game will continue forever with each little modification of
fnew or anything else similar, requiring further tweak on some customized
advance parameter. So I want to avoid that.

I tried another variant : I'm adding another completely useless function, just
to play with. Its content is strictly exactly a copy-paste of fnew, but the
name of the function is obviously different, so let's call it wtf.

When wtf exists (on top of fnew), it doesn't matter if fnew is static or not,
nor what is the value of max-inline-insns-auto : performance of fdec is just
back to normal. Even though wtf is not used nor called from anywhere... :'(


All these effects look plain weird. There is no logical reason for some little
modification in function fnew to have knock-on effect on completely unrelated
function fdec, which only relation is to be in the same file.

I'm trying to understand what could be going on, in order to develop the codec
more reliably. 
For the time being, any modification in function A can have large ripple
effects (positive or negative) on completely unrelated function B, making each
step a tedious process with random outcome. A developer's nightmare.

[Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)

2015-09-02 Thread yann.collet.73 at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #4 from Yann Collet  ---
> Gcc also tries to limit code growth for the unit also which might be 
> something you are seeing.

Yes, that could be the case.
Is there some information available somewhere on such unit-level limit ?

Specifically, I'm wondering if splitting the file into 2 would help. But since
it's a fairly large and difficult task, I'm really looking for hints that it's
the right solution before starting that direction.

> you can use -fdump-ipa-inline to look at gcc's inline decisions in detail.
> You can also try -Winline

Sure, I will try them.

> Do you see similar effects with 4.9.3 or 5.2?

I've difficulties installing multiple versions of gcc on the same dev system. I
will try again when I've got time.
But anyway, that's not the sole issue : my users have the compiler they have,
meaning I can't target only the latest version, since >90% of users won't have
it.
I don't intend to support gcc 1.2 either, but there is a middle ground to find.

If I can have a solution which works with gcc 4.6 / 4.8, without relying on new
features from 5.x, then it's a better solution.

[Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)

2015-09-02 Thread yann.collet.73 at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #5 from Yann Collet  ---
Complementary information :

-Winline : does not output anything (is that normal ?)

-fdump-ipa-inline : produce several large files, the interesting one being 1.5
MB long. That's a huge dump to analyze.

Nonetheless, I had a deeper look directly at the function which speed is
affected.
Looking at both slow and fast versions, I could spot *no difference* regarding
inline decisions. From what I can tell, the dump file seems strictly identical.

(note : there could be some differences somewhere else that I did not spotted).


Since then, I've also been suggested that maybe this effect could related to
something else, instruction cache line alignment.

[Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)

2015-09-02 Thread yann.collet.73 at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #6 from Yann Collet  ---
The issue seems in fact related to _instruction alignment_.
More precisely, to alignment of some critical loop.

That's basically why adding some code in the file would just "pushes" some
other code into another position, potentially into a less favorable path (hence
the appearance of "random impact").


The following GCC command saved the day :
-falign-loops=32

Note that -falign-loops=16 doesn't work.
I'm suspecting it might be the default value, but can't be sure.
I'm also suspecting that -falign-loops=32 is primarily useful for Broadwell
cpu.


Now, the problem is, `-falign-loops=32` is a gcc-only command line parameter.
It seems not possible to apply this optimization from within the source file,
such as using :
#pragma GCC optimize ("align-loops=32")
or the function targeted :
__attribute__((optimize("align-loops=32")))

None of these alternatives does work.

[Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment)

2015-09-03 Thread yann.collet.73 at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #8 from Yann Collet  ---
Thanks for the link.
It's a very good read, and indeed, completely in line with my recent
experience.
Recommended solution seems to be the same : "-falign-loops=32"


The article also mentions that the issue is valid for Sandy Bridge cpus.
This broadens the scope : it's not just about Broadwell, but also Haswell, Ivy
Bridge and sandy Bridge. All new cpus from Intel since 2011. It looks like a
large enough installed base to care about.

However, for some reason, in the table provided, both Sandy Bridge and Haswell
get a default loop alignment value of 16. not 32.

Is there a reason for that choice ?


> Optimizing for just one specific model will negatively affect performance on 
> an other.

Well, this issue is apparently important for more than one architecture.
Moreover, being inlined on 32 imply being inlined on 16 too, so it doesn't
introduce drawback for older siblings.


Since then, I could find a few other complaints about the same issue. One
example here : https://software.intel.com/en-us/forums/topic/479392

and a close cousin here :
http://stackoverflow.com/questions/9881002/is-this-a-gcc-bug-when-using-falign-loops-option


This last one introduce a good question : while it's possible to use
"-falign-loops=32" to set the preference for the whole program, it seems not
possible to set it precisely for a single loop.

It looks like a good feature request, as this loop-alignment issue can have a
pretty large impact on performance (~20%), but only matters for a few selected
critical loops. The programmer is typically in good position to know which loop
matters the most. Hence, we don't necessarily need *all* loops to be 32-bytes
aligned, just a handful ones.

Less precise but still great, having the ability to set this optimization
parameter for a function or a section code would be great. But my experiment
seem to show that using #pragma or __attribute__ with align-loops does not
work, as if the optimization setting was simply ignored.

[Bug c/67435] Feature request: Implement align-loops attribute

2015-10-20 Thread yann.collet.73 at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #10 from Yann Collet  ---
> there already is an aligned attribute for functions, variables and fields,

Sure, but none of them is related to aligning the start of an hot instruction
loop. Aligning the function instead looks like a poor proxy.

> there are also drawbacks to high alignment values

Yes. I could test that using -falign-loops=32 on a larger code base produces
drawbacks. Not just larger code size, worse speed speed.

This makes it all the more relevant to have the ability to select which loop
should be aligned, instead of relying on a unique program-wide compilation
flag.

[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-08-12 Thread yann.collet.73 at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #18 from Yann Collet  ---
This issue makes me wonder : how to efficiently access unaligned memory ?


The case in point is ARM cpu.
They don't support SSE/AVX, so they seem unaffected by this specific issue,
but this issue force writing the source code in a certain way, to remain
compatible with vectorizer assumtion.
Therefore, for portable code, it becomes an issue :
how to write a code which is both portable and efficient on both targets ?


Since apparently, writing : u32 = *(U32*)ptr;
is forbidden if ptr is not guaranteed to be aligned on 4-bytes boundaries
as the compiler will then be authorized to assume ptr is properly aligned,
how to efficiently load 4-bytes from memory at unaligned position ?

I know 3 ways :

1) byte by byte : secure, but slow ==> not efficient

2) using memcpy : memcpy(&u32, ptr, sizeof(u32));
It works. It's safe, and on x86/x64 it's correctly translated into a single mov
instruction, so it's also fast.
Alas, on ARM target, this get translated into much more complex /cautious
sequence, depending on optimization settings.
This is not a small difference :
at -O3 settings, we get a x2 performance difference.
at -O2 settings, it becomes x5 (unaligned code is slower).

3) using __packed instruction : Basically, feature the same benefits and
problems than memcpy() method above


The problem is therefore for newer ARM CPU, which efficiently support unaligned
memory.
Accessing this performance is not possible using memcpy() nor __packed.
And it seems the only way to get it is to do : u32 = *(U32*)ptr;

The difference in performance is really huge, in fact it totally changes the
application, so it can't be ignored.


The question is :
Is there a way to access this performance without violating the principle which
has been stated into this thread, 
that is : it's not authorized to write : u32 = *(U32*)ptr; if ptr is not
guaranteed to be properly aligned on 4-bytes boundaries.

[Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus

2015-08-26 Thread yann.collet.73 at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

Bug ID: 67366
   Summary: Poor assembly generation for unaligned memory accesses
on ARM v6 & v7 cpus
   Product: gcc
   Version: 4.8.2
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yann.collet.73 at gmail dot com
  Target Milestone: ---

Accessing unaligned memory positions used to be forbidden on ARM cpus. But
since ARMv6 (quite many years by now), this operation is supported.

However, GCC 4.5 - 4.6 - 4.7 - 4.8 seem to generate sub-optimal code on these
targets.

In theory, it's illegal to issue a direct statement such as :

u32 read32(const void* ptr) { return *(const u32*)ptr; }

if ptr is not properly aligned.

There are 2 work-around that I know.
The first is to use `packed` instruction, which is not portable (compiler
specific).

The second and better one is to use memcpy() :

u32 read32(const void* ptr) { u32 v; memcpy(&u, ptr, sizeof(v)); return v; }

This version is portable and safe.
It also works very well on multiple platform, such as x86/x64 or PPC, or ARM64,
being reduced to an optimal assembly sequence (single instruction).

Unfortunately, GCC 4.5 - 4.6 - 4.7 - 4.8 generate suboptimal assembly for this
function on ARMv6 or ARMv7 :

read32(void const*):
ldr r0, [r0]@ unaligned
sub sp, sp, #8
str r0, [sp, #4]@ unaligned
ldr r0, [sp, #4]
add sp, sp, #8
bx  lr

This in stark contrast with clang, which generates a much more efficient
assembly :

read32(void const*):   @ @read32(void const*)
ldr r0, [r0]
bx  lr

(assembly can be generated and displayed using a simple tool :
https://goo.gl/7FWDB8)

It's not that gcc is unaware of cpu's unaligned memory access capability,
since it does use it : `ldr r0, [r0]`
but then lose a lot of time on useless operations on a discardable temporary
variable,
storing data into stack just to read it again.


Inlining does not save the day. -O3 help at reducing the impact, but it's still
large.

On a recent exercise comparing efficient vs inefficient memory access on ARMv6
and ARMv7,
the measured difference was very large : up to 6x faster at -O2 settings.
See :
http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html

It's definitely a too large difference to be ignored.
As a consequence, to preserve performance, source code must try a bunch of
possibilities depending on target and compiler, if not version.
In some circumstances (gcc with ARMv6, or gcc <= 4.5), it's even necessary to
write illegal code (see !st version above) to reach optimal performance on
targets.

This looks like a waste of energy, and a recipe for bugs, especially compared
to clang, which generates clean code in all circumstances for all targets.


Considering the huge performance difference such an improvement could make, is
that something the gcc team would like to look into ?


Regards

[Bug target/65871] bzhi builtin/intrinsic wrongly assumes bzhi instruction doesn't set the ZF flag

[Bug c/82802] New: Potential UBSAN error with pointer difference (32-bits mode)

[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

[Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)

[Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)

[Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)

[Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)

[Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment)

[Bug c/67435] Feature request: Implement align-loops attribute

[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

[Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus

11 matches

Site Navigation

Mail list logo

Footer information