https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #18 from Yann Collet <yann.collet.73 at gmail dot com> ---
This issue makes me wonder : how to efficiently access unaligned memory ?


The case in point is ARM cpu.
They don't support SSE/AVX, so they seem unaffected by this specific issue,
but this issue force writing the source code in a certain way, to remain
compatible with vectorizer assumtion.
Therefore, for portable code, it becomes an issue :
how to write a code which is both portable and efficient on both targets ?


Since apparently, writing : u32 = *(U32*)ptr;
is forbidden if ptr is not guaranteed to be aligned on 4-bytes boundaries
as the compiler will then be authorized to assume ptr is properly aligned,
how to efficiently load 4-bytes from memory at unaligned position ?

I know 3 ways :

1) byte by byte : secure, but slow ==> not efficient

2) using memcpy : memcpy(&u32, ptr, sizeof(u32));
It works. It's safe, and on x86/x64 it's correctly translated into a single mov
instruction, so it's also fast.
Alas, on ARM target, this get translated into much more complex /cautious
sequence, depending on optimization settings.
This is not a small difference :
at -O3 settings, we get a x2 performance difference.
at -O2 settings, it becomes x5 (unaligned code is slower).

3) using __packed instruction : Basically, feature the same benefits and
problems than memcpy() method above


The problem is therefore for newer ARM CPU, which efficiently support unaligned
memory.
Accessing this performance is not possible using memcpy() nor __packed.
And it seems the only way to get it is to do : u32 = *(U32*)ptr;

The difference in performance is really huge, in fact it totally changes the
application, so it can't be ignored.


The question is :
Is there a way to access this performance without violating the principle which
has been stated into this thread, 
that is : it's not authorized to write : u32 = *(U32*)ptr; if ptr is not
guaranteed to be properly aligned on 4-bytes boundaries.

Reply via email to