On Tue, Jun 20, 2017 at 12:18 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Tue, Jun 20, 2017 at 10:03 AM, Uros Bizjak <ubiz...@gmail.com> wrote: >> On Mon, Jun 19, 2017 at 7:51 PM, Jakub Jelinek <ja...@redhat.com> wrote: >>> On Mon, Jun 19, 2017 at 11:45:13AM -0600, Jeff Law wrote: >>>> On 06/19/2017 11:29 AM, Jakub Jelinek wrote: >>>> > >>>> > Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack, >>>> > while it is shorter, is it actually faster or as slow as movq $0, (%rsp) >>>> > or movl $0, (%esp) ? >>>> Florian raised this privately to me as well. THere's a couple issues. >>>> >>>> 1. Is there a performance penalty/gain for sub-word operations? If not, >>>> we can improve things slighly there. Even if it's performance >>>> neutral we can probably do better on code size. >>> >>> CCing Uros and Honza here, I believe there are at least on x86 penalties >>> for 2-byte, maybe for 1-byte and then sometimes some stalls when you >>> write or read in a different size from a recent write or read. >> >> Don't use orq $0, (%rsp), as this is a high latency RMW insn. > > Well, but _maybe_ it's optimized because oring 0 never changes anything? > At least it would be nice if it would only trigger the page-fault side-effect > and then not consume other CPU resources.
It doesn't look so: --cut here-- void __attribute__ ((noinline)) test_or (void) { volatile int a; unsigned int n; for (n = 0; n < (unsigned) -1; n++) asm ("orl $0, %0" : "+m" (a)); } void __attribute__ ((noinline)) test_movb (void) { volatile int a; unsigned int n; for (n = 0; n < (unsigned) -1; n++) asm ("movb $0, %0" : "+m" (a)); } void __attribute__ ((noinline)) test_movl (void) { volatile int a; unsigned int n; for (n = 0; n < (unsigned) -1; n++) asm ("movl $0, %0" : "+m" (a)); } int main() { test_or (); test_movb (); test_movl (); return 0; } --cut here-- 74,99% a.out a.out [.] test_or 12,50% a.out a.out [.] test_movb 12,50% a.out a.out [.] test_movl Uros.