[Bug target/65871] bzhi builtin/intrinsic wrongly assumes bzhi instruction doesn't set the ZF flag
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65871 Yann Collet changed: What|Removed |Added CC||yann.collet.73 at gmail dot com --- Comment #14 from Yann Collet --- Is gcc -mbmi2 currently able to generate automatically a bzhi instruction when it detects a "X & ((1 << Y) - 1)" sequence, as suggested by James Almer ? If yes, are there some available examples ?
[Bug c/82802] New: Potential UBSAN error with pointer difference (32-bits mode)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82802 Bug ID: 82802 Summary: Potential UBSAN error with pointer difference (32-bits mode) Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: yann.collet.73 at gmail dot com Target Milestone: --- As part of our CI test suite, we compile and run fuzzer tests every day. The UBSAN test has been failing for some time now. I suspect it's related to our provider having updated at some point the gcc version. The failure happens in this situation : presuming we have 2 pointers : highPtr > lowPtr, if I request the distance in 32-bits mode, both pointers being associated to the same object (one is the upper limit, another is a cursor into the object) `highPtr - lowPtr` generates this UBSAN error : runtime error: signed integer overflow: -2147452928 - 1879078921 cannot be represented in type 'int' The values of these pointers are : highPtr : 0x80007800 lowPtr : 0x70018AAB As can be seen, there is no overflow : highPtr is > lowPtr, and the distance is ~256 MB, well within the limits of ptrdiff_t in 32-bits. Nonetheless, UBSAN consider it an error, likely because it crosses the 0x8000 threshold. I suspect the pointer addresses are converted into `int` type *before* the substraction, which leads to UBSAN conclusion. The same code on clang doesn't trigger any error.
[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709 Yann Collet changed: What|Removed |Added CC||yann.collet.73 at gmail dot com --- Comment #9 from Yann Collet --- While the issue can be easily fixed from an LZ4 perspective, the main topic here is to analyze a GCC 4.9+ vectorizer choice. The piece of code that it tried to optimize can be summarized as follows (once removed all the garbage) : static void LZ4_copy8(void* dstPtr, const void* srcPtr) { *(U64*)dstPtr = *(U64*)srcPtr; } Pretty simple. Let's assume for the rest of the post that both pointers are correctly aligned, so it's not a problem anymore. Looking at the assembler generated, we see that GCC generates a MOVDQA instruction for it. > movdqa (%rdi,%rax,1),%xmm0 > $rdi=0x7fffea4b53e6 > $rax=0x0 This seems wrong on 2 levels : - The function only wants to copy 8 bytes. MOVDQA works on a full SSE register, which is 16 bytes. This spell troubles, if only for buffer boundaries checks : the algorithm uses 8 bytes because it knows it can safely read/write that size without crossing buffer limits. With 16 bytes, no such guarantee. - MOVDQA requires both positions to be aligned. I read it as being SSE size aligned, which means 16-bytes aligned. But they are not, these pointers are supposed to be 8-bytes aligned only. (A bit off topic, but from a general perspective, I don't understand the use of MOVDQA, which requires such a strong alignment condition, while there is also MOVDQU available, which works fine at any memory address, while suffering no performance penalty on aligned memory addresses. MOVDQU looks like a better choice in every circumstances.) Anyway, the core of the issue is rather above : this is just an 8-bytes copy operation, replacing by a 16-bytes one looks suspicious. Maybe it would deserve a look.
[Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435 Bug ID: 67435 Summary: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect) Product: gcc Version: 4.8.4 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: yann.collet.73 at gmail dot com Target Milestone: --- Some weird effect with gcc (tested version : 4.8.4). I've got a performance oriented code, which runs pretty fast. Its speed depends for a large part on inlining many small functions. There is no inline statement. All functions are either normal or static. Automatic inlining decision is solely within compiler's realm, which has worked fine so far (functions to inline are very small, typically from 1 to 5 lines). Since inlining across multiple .c files is difficult (-flto is not yet widely available), I've kept a lot of small functions into a single `.c` file, into which I'm also developing a codec, and its associated decoder. It's "relatively" large by my standard (about ~2000 lines, although a lot of them are mere comments and blank lines), but breaking it into smaller parts opens new problems, so I would prefer to avoid that, if that is possible. Encoder and Decoder are related, since they are inverse operations. But from a programming perspective, they are completely separated, sharing nothing in common, except a few typedef and very low-level functions (such as reading from unaligned memory position). The strange effect is this one : I recently added a new function fnew to the encoder side. It's a new "entry point". It's not used nor called from anywhere within the .c file. The simple fact that it exists makes the performance of the decoder function fdec drops substantially, by more than 20%, which is way too much to be ignored. Now, keep in mind that encoding and decoding operations are completely separated, they share almost nothing, save some minor typedef (u32, u16 and such) and associated operations (read/write). When defining the new encoding function fnew as static, performance of the decoder fdec increases back to normal. Since fnew isn't called from the .c, I guess it's the same as if it was not there (dead code elimination). If static fnew is now called from the encoder side, performance of fdec remains good. But as soon as fnew is modified, fdec performance just drops substantially. Presuming fnew modifications crossed a threshold, I increased the following gcc parameter : --param max-inline-insns-auto=60 (by default, its value is supposed to be 40.) And it worked : performance of fdec is now back to normal. But I guess this game will continue forever with each little modification of fnew or anything else similar, requiring further tweak on some customized advance parameter. So I want to avoid that. I tried another variant : I'm adding another completely useless function, just to play with. Its content is strictly exactly a copy-paste of fnew, but the name of the function is obviously different, so let's call it wtf. When wtf exists (on top of fnew), it doesn't matter if fnew is static or not, nor what is the value of max-inline-insns-auto : performance of fdec is just back to normal. Even though wtf is not used nor called from anywhere... :'( All these effects look plain weird. There is no logical reason for some little modification in function fnew to have knock-on effect on completely unrelated function fdec, which only relation is to be in the same file. I'm trying to understand what could be going on, in order to develop the codec more reliably. For the time being, any modification in function A can have large ripple effects (positive or negative) on completely unrelated function B, making each step a tedious process with random outcome. A developer's nightmare.
[Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435 --- Comment #4 from Yann Collet --- > Gcc also tries to limit code growth for the unit also which might be > something you are seeing. Yes, that could be the case. Is there some information available somewhere on such unit-level limit ? Specifically, I'm wondering if splitting the file into 2 would help. But since it's a fairly large and difficult task, I'm really looking for hints that it's the right solution before starting that direction. > you can use -fdump-ipa-inline to look at gcc's inline decisions in detail. > You can also try -Winline Sure, I will try them. > Do you see similar effects with 4.9.3 or 5.2? I've difficulties installing multiple versions of gcc on the same dev system. I will try again when I've got time. But anyway, that's not the sole issue : my users have the compiler they have, meaning I can't target only the latest version, since >90% of users won't have it. I don't intend to support gcc 1.2 either, but there is a middle ground to find. If I can have a solution which works with gcc 4.6 / 4.8, without relying on new features from 5.x, then it's a better solution.
[Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435 --- Comment #5 from Yann Collet --- Complementary information : -Winline : does not output anything (is that normal ?) -fdump-ipa-inline : produce several large files, the interesting one being 1.5 MB long. That's a huge dump to analyze. Nonetheless, I had a deeper look directly at the function which speed is affected. Looking at both slow and fast versions, I could spot *no difference* regarding inline decisions. From what I can tell, the dump file seems strictly identical. (note : there could be some differences somewhere else that I did not spotted). Since then, I've also been suggested that maybe this effect could related to something else, instruction cache line alignment.
[Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435 --- Comment #6 from Yann Collet --- The issue seems in fact related to _instruction alignment_. More precisely, to alignment of some critical loop. That's basically why adding some code in the file would just "pushes" some other code into another position, potentially into a less favorable path (hence the appearance of "random impact"). The following GCC command saved the day : -falign-loops=32 Note that -falign-loops=16 doesn't work. I'm suspecting it might be the default value, but can't be sure. I'm also suspecting that -falign-loops=32 is primarily useful for Broadwell cpu. Now, the problem is, `-falign-loops=32` is a gcc-only command line parameter. It seems not possible to apply this optimization from within the source file, such as using : #pragma GCC optimize ("align-loops=32") or the function targeted : __attribute__((optimize("align-loops=32"))) None of these alternatives does work.
[Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435 --- Comment #8 from Yann Collet --- Thanks for the link. It's a very good read, and indeed, completely in line with my recent experience. Recommended solution seems to be the same : "-falign-loops=32" The article also mentions that the issue is valid for Sandy Bridge cpus. This broadens the scope : it's not just about Broadwell, but also Haswell, Ivy Bridge and sandy Bridge. All new cpus from Intel since 2011. It looks like a large enough installed base to care about. However, for some reason, in the table provided, both Sandy Bridge and Haswell get a default loop alignment value of 16. not 32. Is there a reason for that choice ? > Optimizing for just one specific model will negatively affect performance on > an other. Well, this issue is apparently important for more than one architecture. Moreover, being inlined on 32 imply being inlined on 16 too, so it doesn't introduce drawback for older siblings. Since then, I could find a few other complaints about the same issue. One example here : https://software.intel.com/en-us/forums/topic/479392 and a close cousin here : http://stackoverflow.com/questions/9881002/is-this-a-gcc-bug-when-using-falign-loops-option This last one introduce a good question : while it's possible to use "-falign-loops=32" to set the preference for the whole program, it seems not possible to set it precisely for a single loop. It looks like a good feature request, as this loop-alignment issue can have a pretty large impact on performance (~20%), but only matters for a few selected critical loops. The programmer is typically in good position to know which loop matters the most. Hence, we don't necessarily need *all* loops to be 32-bytes aligned, just a handful ones. Less precise but still great, having the ability to set this optimization parameter for a function or a section code would be great. But my experiment seem to show that using #pragma or __attribute__ with align-loops does not work, as if the optimization setting was simply ignored.
[Bug c/67435] Feature request: Implement align-loops attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435 --- Comment #10 from Yann Collet --- > there already is an aligned attribute for functions, variables and fields, Sure, but none of them is related to aligning the start of an hot instruction loop. Aligning the function instead looks like a poor proxy. > there are also drawbacks to high alignment values Yes. I could test that using -falign-loops=32 on a larger code base produces drawbacks. Not just larger code size, worse speed speed. This makes it all the more relevant to have the ability to select which loop should be aligned, instead of relying on a unique program-wide compilation flag.
[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709 --- Comment #18 from Yann Collet --- This issue makes me wonder : how to efficiently access unaligned memory ? The case in point is ARM cpu. They don't support SSE/AVX, so they seem unaffected by this specific issue, but this issue force writing the source code in a certain way, to remain compatible with vectorizer assumtion. Therefore, for portable code, it becomes an issue : how to write a code which is both portable and efficient on both targets ? Since apparently, writing : u32 = *(U32*)ptr; is forbidden if ptr is not guaranteed to be aligned on 4-bytes boundaries as the compiler will then be authorized to assume ptr is properly aligned, how to efficiently load 4-bytes from memory at unaligned position ? I know 3 ways : 1) byte by byte : secure, but slow ==> not efficient 2) using memcpy : memcpy(&u32, ptr, sizeof(u32)); It works. It's safe, and on x86/x64 it's correctly translated into a single mov instruction, so it's also fast. Alas, on ARM target, this get translated into much more complex /cautious sequence, depending on optimization settings. This is not a small difference : at -O3 settings, we get a x2 performance difference. at -O2 settings, it becomes x5 (unaligned code is slower). 3) using __packed instruction : Basically, feature the same benefits and problems than memcpy() method above The problem is therefore for newer ARM CPU, which efficiently support unaligned memory. Accessing this performance is not possible using memcpy() nor __packed. And it seems the only way to get it is to do : u32 = *(U32*)ptr; The difference in performance is really huge, in fact it totally changes the application, so it can't be ignored. The question is : Is there a way to access this performance without violating the principle which has been stated into this thread, that is : it's not authorized to write : u32 = *(U32*)ptr; if ptr is not guaranteed to be properly aligned on 4-bytes boundaries.
[Bug c/67366] New: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366 Bug ID: 67366 Summary: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus Product: gcc Version: 4.8.2 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: yann.collet.73 at gmail dot com Target Milestone: --- Accessing unaligned memory positions used to be forbidden on ARM cpus. But since ARMv6 (quite many years by now), this operation is supported. However, GCC 4.5 - 4.6 - 4.7 - 4.8 seem to generate sub-optimal code on these targets. In theory, it's illegal to issue a direct statement such as : u32 read32(const void* ptr) { return *(const u32*)ptr; } if ptr is not properly aligned. There are 2 work-around that I know. The first is to use `packed` instruction, which is not portable (compiler specific). The second and better one is to use memcpy() : u32 read32(const void* ptr) { u32 v; memcpy(&u, ptr, sizeof(v)); return v; } This version is portable and safe. It also works very well on multiple platform, such as x86/x64 or PPC, or ARM64, being reduced to an optimal assembly sequence (single instruction). Unfortunately, GCC 4.5 - 4.6 - 4.7 - 4.8 generate suboptimal assembly for this function on ARMv6 or ARMv7 : read32(void const*): ldr r0, [r0]@ unaligned sub sp, sp, #8 str r0, [sp, #4]@ unaligned ldr r0, [sp, #4] add sp, sp, #8 bx lr This in stark contrast with clang, which generates a much more efficient assembly : read32(void const*): @ @read32(void const*) ldr r0, [r0] bx lr (assembly can be generated and displayed using a simple tool : https://goo.gl/7FWDB8) It's not that gcc is unaware of cpu's unaligned memory access capability, since it does use it : `ldr r0, [r0]` but then lose a lot of time on useless operations on a discardable temporary variable, storing data into stack just to read it again. Inlining does not save the day. -O3 help at reducing the impact, but it's still large. On a recent exercise comparing efficient vs inefficient memory access on ARMv6 and ARMv7, the measured difference was very large : up to 6x faster at -O2 settings. See : http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html It's definitely a too large difference to be ignored. As a consequence, to preserve performance, source code must try a bunch of possibilities depending on target and compiler, if not version. In some circumstances (gcc with ARMv6, or gcc <= 4.5), it's even necessary to write illegal code (see !st version above) to reach optimal performance on targets. This looks like a waste of energy, and a recipe for bugs, especially compared to clang, which generates clean code in all circumstances for all targets. Considering the huge performance difference such an improvement could make, is that something the gcc team would like to look into ? Regards