On Thu, Nov 13, 2014 at 1:03 PM, David Wohlferd <[email protected]> wrote:
> Sorry for the (very) delayed response. I'm still looking for feedback here
> so I can fix the docs.
>
> To refresh: The topic of conversation was the (extremely) wrong explanation
> that has been in the docs since forever about how to use memory constraints
> with inline asm to avoid the performance hit of a full memory clobber.
> Trying to understand how this really works has led to some surprising
> results.
>
> Me:
>>> While I really like the idea of using memory constraints to avoid all out
>>> memory clobbers, 16 bytes is a pretty small maximum memory block, and x86
>>> only supports a max of 8. Unless there's some way to use larger sizes
>>> (say
>>> SSIZE_MAX), this feature hardly seems worth documenting.
>
> Richard:
>> I wonder how you figured out that a 12 byte clobber performs a full
>> memory clobber?
>
> Here's the code (compiled with gcc version 4.9.0 x86_64-win32-seh-rev2,
> using -m64 -O2 -fdump-final-insns):
>
> --------------------
> #include <stdio.h>
>
> #define MYSIZE 3
>
> inline void
> __stosb(unsigned char *Dest, unsigned char Data, size_t Count)
> {
> struct _reallybigstruct { char x[MYSIZE]; }
> *p = (struct _reallybigstruct *)Dest;
>
> __asm__ __volatile__ ("rep stos{b|b}"
> : "+D" (Dest), "+c" (Count), "=m" (*p)
> : [Data] "a" (Data)
> //: "memory"
> );
> }
>
> int main()
> {
> unsigned char buff[100];
> buff[5] = 'A';
>
> __stosb(buff, 'B', sizeof(buff));
> printf("%c\n", buff[5]);
> }
> --------------------
>
> In summary:
>
> 1) Create a 100 byte buffer, and set buff[5] to 'A'.
> 2) Call __stosb, which uses inline asm to overwrite all of buff with 'B'.
> 3) Use a memory constraint in __stosb to flush buff. The size of the
> memory constraint is controlled by a #define.
>
> With this, I have a simple way to test various sizes of memory constraints
> to see if the buffer gets flushed. If it *is* flushing the buffer, printing
> buff[5] after __stosb will print 'B'. If it is *not* flushing, it will
> print 'A'.
>
> Results:
> - Since buff[5] is the 6th byte in the buffer, using memory constraint
> sizes of 1, 2 & 4 (not surprisingly) all print 'A', showing that no flush
> was done.
> - Sizes of 8 and 16 print 'B', showing that the flush was done. This is
> also the expected result, since I am now flushing enough of buff to include
> buff[5].
> - The surprise comes from using a size of 3 or 5. These also print 'B'.
> WTF? If 4 doesn't flush, why does 3?
>
> I believe the answer comes from reading the RTL. The difference between
> sizes of 3 and 16 comes here:
>
> (set (mem/c:TI (plus:DI (reg/f:DI 7 sp)
> (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S16
> A128])
> (asm_operands/v:TI ("rep stos{b|b}") ("=m") 2 [
>
> (set (mem/c:BLK (plus:DI (reg/f:DI 7 sp)
> (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S3
> A128])
> (asm_operands/v:BLK ("rep stos{b|b}") ("=m") 2 [
>
> While I don't actually speak RTL, TI clearly refers to TIMode. Apparently
> when using a size that exactly matches a machine mode, asm memory references
> (on i386) can flush the exact number of bytes. But for other sizes, gcc
> seems to falls back to BLK mode, which doesn't.
>
> I don't know the exact meaning of BLK on a "set" or "asm_operands." Does it
> cause a full clobber? Or just a complete clobber of buff? Attempting to
> answer that question leads us to the second bit of code:
>
> --------------------
> #include <stdio.h>
>
> #define MYSIZE 8
>
> inline void
> __stosb(unsigned char *Dest, unsigned char Data, size_t Count)
> {
> struct _reallybigstruct { char x[MYSIZE]; }
> *p = (struct _reallybigstruct *)Dest;
>
> __asm__ __volatile__ ("rep stos{b|b}"
> : "+D" (Dest), "+c" (Count), "=m" (*p)
> : [Data] "a" (Data)
> //: "memory"
> );
> }
> int main()
> {
> unsigned char buff[100], buff2[100];
> buff[5] = 'A';
> buff2[5] = 'M';
> asm("#" : : "r" (buff2));
>
> __stosb(buff, 'B', sizeof(buff));
> printf("%c %c\n", buff[5], buff2[5]);
> }
> --------------------
>
> Here I've added a buff2, and I set buff2[5] to 'M' (aka ascii 77), which I
> also print. I still perform the memory constraint against buff, then I
> check to see if it is affecting buff2.
>
> I start by compiling this with a size of 8 and look at the -S output. If
> this is NOT doing a full clobber, gcc should be able to just print buff2[5]
> by moving 77 into the appropriate register before calling printf. And
> indeed, that's what we see.
>
> /APP
> # 17 "mem2.cpp" 1
> rep stosb
> # 0 "" 2
> /NO_APP
> movzbl 37(%rsp), %edx
> movl $77, %r8d
> leaq .LC0(%rip), %rcx
> call printf
>
> If using a size of 3 *is* causing a full memory clobber, we would expect to
> see the value getting read from memory before calling printf. And indeed,
> that's also what we see.
>
> /APP
> # 17 "mem2.cpp" 1
> rep stosb
> # 0 "" 2
> /NO_APP
> movzbl 37(%rsp), %edx
> leaq .LC0(%rip), %rcx
> movzbl 149(%rsp), %r8d
>
> I don't know the internals of gcc well enough to understand exactly why this
> is happening. But from a user's point of view, it sure looks like a memory
> clobber.
>
> As I said before, triggering a full memory clobber for anything over 16
> bytes (and most sizes under 16 bytes) makes this feature all but useless.
> So if that's really what's happening, we need to decide what to do next:
>
> 1) Can this be "fixed?"
> 2) Do we want to doc the current behavior?
> 3) Or do we just remove this section?
>
> I think it could be a nice performance win for inline asm if it could be
> made to work right, but I have no idea what might be involved in that.
> Failing that, I guess if it doesn't work and isn't going to work, I'd
> recommend removing the text for this feature.
>
> Since all 3 suggestions require a doc change, I'll just say that I'm
> prepared to start work on the doc patch as soon as someone lets me know what
> the plan is.
>
> Richard? Hans-Peter? Your thoughts?
Just from a very very quick look you miss:
> (set (mem/c:TI (plus:DI (reg/f:DI 7 sp)
> (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S16
> A128])
> (asm_operands/v:TI ("rep stos{b|b}") ("=m") 2 [
>
> (set (mem/c:BLK (plus:DI (reg/f:DI 7 sp)
> (const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0 S3
> A128])
> (asm_operands/v:BLK ("rep stos{b|b}") ("=m") 2 [
The memory attributes - the first one has 'S16' (size 16 bytes), the
2nd has 'S3' (size 3 bytes). So the information is clearly there.
It might be that RTL alias analysis / CSE give up too early here
(we don't optimize across asm() on the GIMPLE level at all ... heh).
I didn't look where it gives up (even though appearantly it does).
Richard.
> Thanks,
> dw