https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69935
Bug ID: 69935 Summary: load not hoisted out of linked-list traversal loop Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- (please check the component. I guessed tree-optimization since it's cross-architecture.) gcc doesn't hoist the p->a load out of the loop in this linked-list function int traverse_loadloop(struct foo_head *ph) { int a = -1; struct foo *p = ph->h; while (p) { a = p->a; p = p->n; } return a; } I checked on godbolt with gcc 4.8 on ARM/PPC/ARM64, and gcc 4.5.3 for AVR. For x86, gcc 5.3.0 -O3 on godbolt (http://goo.gl/r8vb5L) does this: movq (%rdi), %rdx movl $-1, %eax testq %rdx, %rdx je .L10 .L11: movl 8(%rdx), %eax ; load p->a inside the loop, not hoisted movq (%rdx), %rdx testq %rdx, %rdx jne .L11 .L10: rep ret This is nice and compact, but less hyperthreading-friendly than it could be. (The mov reg,reg alternative doesn't even take an execution unit on recent CPUs). The load of p->a every time through the loop might also delay the p->n load by a cycle on CPUs with only one load port, or when there's a cache-bank conflict. This might take the loop from one iteration per 4c to one per 5c (if L1 load-use latency is 4c). Clang hoists the load out of the loop, producing identical asm output for this function and one with the load hoisted in the C source. (The godbolt link has both versions. Also see bug 69933 which I just reported, since gcc showed a separate branch-layout issue for the source-level hoisting version.)