Re: How to efficiently unpack 8 bytes from a 64-bit integer?

Richard Biener Fri, 19 Feb 2016 00:47:53 -0800

On Fri, Feb 19, 2016 at 7:24 AM, Phil Ruffwind <r...@rufflewind.com> wrote:
> Hello all,
>
> I am trying to analyze the optimized results of following code.  The
> intent is to unpack a 64-bit integer into a struct containing eight
> 8-bit integers.  The optimized result was very promising at first, but
> I then discovered that whenever the unpacking function gets inlined
> into another function, the optimization no longer works.
>
>     /* a struct of eight 8-bit integers */
>     struct alpha {
>         int8_t a;
>         int8_t b;
>         ...
>         int8_t h;
>     };
>
>     struct alpha unpack(uint64_t x)
>     {
>         struct alpha r;
>         memcpy(&r, &x, 8);
>         return r;
>     }
>
>     struct alpha wrapper(uint64_t y)
>     {
>         return unpack(y);
>     }
>
> The code was compiled with gcc 5.3.0 on Linux 4.4.1 with -O3 on x86-64.
>
> The `unpack` function optimizes fine.  It produces the following
> assembly as expected:
>
>     mov rax, rdi
>     ret
>
> Given that `wrapper` is a trivial wrapper around `unpack`, I would
> expect the same.  But in reality this is what I got from gcc:
>
>     mov eax, edi
>     xor ecx, ecx
>     mov esi, edi
>     shr ax, 8
>     mov cl, dil
>     shr esi, 24
>     mov ch, al
>     mov rax, rdi
>     movzx edx, sil
>     and eax, 16711680
>     and rcx, -16711681
>     sal rdx, 24
>     movabs rsi, -4278190081
>     or rcx, rax
>     mov rax, rcx
>     movabs rcx, -1095216660481
>     and rax, rsi
>     or rax, rdx
>     movabs rdx, 1095216660480
>     and rdx, rdi
>     and rax, rcx
>     movabs rcx, -280375465082881
>     or rax, rdx
>     movabs rdx, 280375465082880
>     and rdx, rdi
>     and rax, rcx
>     movabs rcx, -71776119061217281
>     or rax, rdx
>     movabs rdx, 71776119061217280
>     and rdx, rdi
>     and rax, rcx
>     shr rdi, 56
>     or rax, rdx
>     sal rdi, 56
>     movabs rdx, 72057594037927935
>     and rax, rdx
>     or rax, rdi
>     ret
>
> This seems quite strange.  Somehow the inlining process seems to have
> screwed up the potential optimizations.  Is there a someway to prevent
> this from happening short of disabling inlining?  Or perhaps there is
> a better way to write this code so that gcc would optimize more
> predictably?


It seems to be SRA "optimizing" the copy it sees in unpack ()

unpack (uint64_t x)
{
  struct alpha r;
  struct alpha D.2276;
  long unsigned int _2;

  <bb 2>:
  _2 = x_6(D);
  MEM[(char * {ref-all})&r] = x_6(D);
  D.2276 = r;  <--  this one
  r ={v} {CLOBBER};
  return D.2276;

when inlined into wrapper:

wrapper (uint64_t y)
{
  struct alpha D.2286;
  struct alpha r;
  struct alpha D.2279;

  <bb 2>:
  MEM[(char * {ref-all})&r] = y_2(D);
  D.2286 = r;
  r ={v} {CLOBBER};
  D.2279 = D.2286;
  return D.2279;

while this results in removing the redundant aggregate D.2286 it
also results in implementing the copy byte-wise (for no good reason).

Note that it cannot simply use bigger accesses as struct alpha is
only aligned to 1 byte.

Richard.


> I would appreciate any advice, thanks.
>
> Phil

Re: How to efficiently unpack 8 bytes from a 64-bit integer?

Reply via email to