https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69935

            Bug ID: 69935
           Summary: load not hoisted out of linked-list traversal loop
           Product: gcc
           Version: 5.3.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

(please check the component.  I guessed tree-optimization since it's
cross-architecture.)

gcc doesn't hoist the p->a load out of the loop in this linked-list function

int traverse_loadloop(struct foo_head *ph)
{
  int a = -1;
  struct foo *p = ph->h;
  while (p) {
    a = p->a;
    p = p->n;
  }
  return a;
}

I checked on godbolt with gcc 4.8 on ARM/PPC/ARM64, and gcc 4.5.3 for AVR.
For x86, gcc 5.3.0 -O3 on godbolt (http://goo.gl/r8vb5L) does this:

        movq    (%rdi), %rdx
        movl    $-1, %eax
        testq   %rdx, %rdx
        je      .L10
.L11:
        movl    8(%rdx), %eax     ; load p->a inside the loop, not hoisted
        movq    (%rdx), %rdx
        testq   %rdx, %rdx
        jne     .L11
.L10:
        rep ret

This is nice and compact, but less hyperthreading-friendly than it could be. 
(The mov reg,reg alternative doesn't even take an execution unit on recent
CPUs).

The load of p->a every time through the loop might also delay the p->n load by
a cycle on CPUs with only one load port, or when there's a cache-bank conflict.
 This might take the loop from one iteration per 4c to one per 5c (if L1
load-use latency is 4c).

Clang hoists the load out of the loop, producing identical asm output for this
function and one with the load hoisted in the C source.  (The godbolt link has
both versions.  Also see bug 69933 which I just reported, since gcc showed a
separate branch-layout issue for the source-level hoisting version.)

Reply via email to