Hello dw,

I try my best to read through your long e-mail ... (I want to get the
kudos ;) ).  And thank you for reviewing this inline-assembler.  This
is much appreachiated.  Most of this stuff is implemented in a way "it
works for me", but code didn't had a general review.

At the beginning I want to comment about your general statement for
volatile.  I disagree here.
The cite from http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html states:

If your assembler instructions access memory in an unpredictable
fashion, add ‘memory’ to the list of clobbered registers. This causes
GCC to not keep memory values cached in registers across the assembler
instruction and not optimize stores or loads to that memory. You also
should add the volatile keyword if the memory affected is not listed
in the inputs or outputs of the asm, as the ‘memory’ clobber does not
count as a side-effect of the asm. If you know how large the accessed
memory is, you can add it as input or output but if this is not known,
you should add ‘memory’.

Therefore the use of volatile is a way to minimize the number of
constrains.  And for sure it doesn't hurt.  To optimize constrains is
of course welcome, but then we need to be sure all required are
provided.


2013/3/28 dw <[email protected]>:
> I have been trying to port an x64 application compiled with MSVC to gcc.
> More or less, things have been working, but I have run into a few problems.
> Most of those problems involve the inline assembler in winnt.h.  Some
> routines are inefficient, and a number just plain return the wrong answer.
>
> IMPORTANT: I'm NOT asking anyone to fix them!

Well, thank you for taking this job and to provide patches for that.
Please don't forget to provide patches also for the
intrinsic-implementations in crt.  They share the same assembly-code
and we should keep them in sync to inline-versions.

> I have been repairing/improving each of these routines as I go.  So I
> suppose at this point I'm looking for someone to:
>
> a) Review what I'm proposing, and verify that it makes things better.
Well, I do so.
> b) Suggest any improvements.
I did so some lines above, and I might provide you with further
suggestion later on in that mail.
> c) Check the changes into the tree.
Well, if I am satisfied with the patches you sent to this ML, then I
approve your change and the patch (with ChangeLog) gets into
trunk-version.  If patches will be merged back to 2.x branch is up to
Ozkan.  We will see.

> I'm sorry this message is so long, but I felt the need to show the existing
> code, my "improved" code, and to describe what I changed and why.  Also
> someone (you know who you are) suggested doing this as one big blast might
> ultimately work better than flooding the list with a bunch of smaller posts.

Agreed and that is fine.

> So, my first problem was with the BitScanForward routine.  Put simply, it
> was returning the wrong answer.  I raised this point in the sf forum, and
> ktietz70 promptly fixed it for me, along with the other 3 variations of
> BitScan* that had the same issue.  However, I don't think the fix is quite
> right either.  While it produces the right answer, I believe it does so in
> an unnecessarily inefficient manner.  Specifically, this code:
>
>     __CRT_INLINE BOOLEAN _BitScanForward(DWORD *Index,DWORD Mask) {
>       DWORD n;
>       __asm__ __volatile__("bsfl %0,%1" : "+r" (Mask),"=rm" (n) : :
> "memory");
>       *Index = n;
>       return Mask!=0;
>     }
>
> We see here the __volatile__ qualifier and the memory clobber.  The
> __volatile__ qualifier tells the compiler to always include the assembler in
> the executable, even if the outputs are never being used (presumably there
> is some important side effect in the asm).  It also tells the compiler not
> to move the asm around too much (whatever that means).  However, I submit
> that neither applies here.

That's not quite true.  See explaination I provided at top of mail.
If we would remove volatile here we would need to specify input also
as output constrains.  To avoid that volatile is IMHO the better
choice to make things easier to read.

> Similarly, the "memory" clobber tells the compiler that all values stored in
> registers must be flushed to memory before executing the assembler, and will
> need to be re-loaded after the assembler is complete.  While I believe this
> is *essential* for all Interlocked* routines, I see no reason to incur that
> cost here.

Well, the memory is essential here due we modify memory (the stack
variable).  The problem about this inline-assembler - and this applies
to much other comments you did here too - is that we need to express
it in a general way for *all* possible operand-kinds.

> Lastly, the code describes Mask as an output, when I believe it is an input.

See the description about meaning of "volatile".

> I would propose:
>
>     __CRT_INLINE BOOLEAN _BitScanForward(DWORD *Index, DWORD Mask)
>     {
>       __asm__ ("bsfl %1,%0"
>          : "=r" (*Index)
>          : "r" (Mask)
>          : "cc");
>       return Mask!=0;
>     }

well, the constrain "cc" might be of interest to list of constrains,
but for general case "memory" constrain is required, secondly this
code would enforce that all operands getting copied to register, which
is in a lot of cases just inefficient.  And volatile should be
specified so that it is clear that Mask is input and output operand,

> and similarly with the other BitScan methods:
>
>     __CRT_INLINE BOOLEAN _BitScanReverse(DWORD *Index, DWORD Mask)
>     {
>       __asm__ ("bsrl %1,%0"
>          : "=r" (*Index)
>          : "r" (Mask)
>          : "cc");
>       return Mask!=0;
>     }
>     __CRT_INLINE BOOLEAN _BitScanForward64(DWORD64 *Index, DWORD64 Mask)
>     {
>       __asm__ ("bsfq %1,%0"
>          : "=r" (*Index)
>          : "r" (Mask)
>          : "cc");
>       return Mask!=0;
>     }
>     __CRT_INLINE BOOLEAN _BitScanReverse64(DWORD64 *Index, DWORD64 Mask)
>     {
>       __asm__ ("bsrq %1,%0"
>          : "=r" (*Index)
>          : "r" (Mask)
>          : "cc");
>       return Mask!=0;
>     }

See above.

> The second thing I noticed was the the _bittest method was also returning
> the wrong answer.  The current code is:
>
>     __CRT_INLINE BOOLEAN _bittest(LONG const *Base,LONG Offset) {
>       int old = 0;
>       __asm__ __volatile__("btl %2,%1\n\tsbbl %0,%0 "
>     :"=r" (old),"=m" ((*(volatile __LONG32 *) Base))
>     :"Ir" (Offset));
>       return (BOOLEAN) (old!=0);
>     }
>
> Again with the __volatile__ qualifier.  I don't believe the volatile cast is
> necessary either.  And Base (which is a const) is listed as an output that
> is being overwritten.  Also, this seems like an unnecessarily awkward way to
> return the result.  I suppose my change to the return value *could* be a
> problem if someone is (foolishly) checking for a specific return value,
> instead of just comparing with FALSE.  My proposal:
>
>     __CRT_INLINE BOOLEAN _bittest(const LONG *Base, LONG Offset)
>     {
>       BOOLEAN old;
>       __asm__ ("btl %1,%2\n\tsbbb %0,%0"
>         : "=r" (old)
>         : "Ir" (Offset), "rm" (*Base)
>         : "cc");
>       return old;
>     }

I prefer to have here explicit a comparison as return-statement.
Secondly old has to have proper operand-width for "bt" command.  So
making it BOOLEAN is just true for the "l" variant.  And again
"memory" constrain and volatile are necessary.  The addition of "cc"
constrain is ok.

> Similarly for the other non-interlocked bitmap methods:
>
>     __CRT_INLINE BOOLEAN _bittestandset(LONG *Base, LONG Offset)
>     {
>       BOOLEAN old;
>       __asm__ ("btsl %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+rm" (*Base)
>         : "Ir" (Offset)
>         : "cc");
>       return old;
>     }
>     __CRT_INLINE BOOLEAN _bittestandreset(LONG *Base, LONG Offset)
>     {
>       BOOLEAN old;
>       __asm__ ("btrl %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+rm" (*Base)
>         : "Ir" (Offset)
>         : "cc");
>       return old;
>     }
>     __CRT_INLINE BOOLEAN _bittestandcomplement(LONG *Base, LONG Offset)
>     {
>       BOOLEAN old;
>       __asm__ ("btcl %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+rm" (*Base)
>         : "Ir" (Offset)
>         : "cc");
>       return old;
>     }
>     __CRT_INLINE BOOLEAN _bittest64(const LONG64 *Base, LONG64 Offset)
>     {
>       BOOLEAN old;
>       __asm__ ("btq %1,%2\n\tsbbb %0,%0"
>         : "=r" (old)
>         : "Ir" (Offset), "rm" (*Base)
>         : "cc");
>       return old;
>     }
>     __CRT_INLINE BOOLEAN _bittestandset64(LONG64 *Base, LONG64 Offset)
>     {
>       BOOLEAN old;
>       __asm__ ("btsq %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+rm" (*Base)
>         : "Ir" (Offset)
>         : "cc");
>       return old;
>     }
>     __CRT_INLINE BOOLEAN _bittestandreset64(LONG64 *Base, LONG64 Offset)
>     {
>       BOOLEAN old;
>       __asm__ ("btrq %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+rm" (*Base)
>         : "Ir" (Offset)
>         : "cc");
>       return old;
>     }
>     __CRT_INLINE BOOLEAN _bittestandcomplement64(LONG64 *Base, LONG64
> Offset)
>     {
>       BOOLEAN old;
>       __asm__ ("btcq %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+rm" (*Base)
>         : "Ir" (Offset)
>         : "cc");
>       return old;
>     }

See above.

In general it is much better practice to provide little testcases to
demonstrate "wrong" or "inefficient" results.  These testcases later
on can be used to verify that we don't break stuff later on, if
somebody else doing "fixes" to already tested code.

> Having experienced problems with every asm routine I'd examined so far
> (0/5), I decided I needed to read all the asm routines in winnt.h.  So, that
> brings us to this:
>
>     __CRT_INLINE BOOLEAN _interlockedbittestandset(LONG *Base,LONG Offset) {
>       int old = 0;
>       __asm__ __volatile__("lock ; btsl %2,%1\n\tsbbl %0,%0 "
>     :"=r" (old),"=m" ((*(volatile __LONG32 *) Base))
>     :"Ir" (Offset));
>       return (BOOLEAN) (old!=0);
>     }
>
> Unlike the BitTest* and BitScan* routines above, I believe all Interlocked
> routines *must* use both volatile and the "memory" clobber (which this code
> doesn't do).  Also, rather than use the volatile cast, I have changed the
> code to be consistent with the Windows declaration by using the volatile
> qualifier on the parameter, and cleaned up the return value:

Well, as I have shown above many times, "volatile" on an
assembly-statement has a different meaning.  In general it is a
mis-believe that "volatile" means something like "interlocked".  It is
just an indicator for compiler that a variable/expression might change
outside of current scope and therefore it has to be accessed always
via its memory-address.

That these function don't specifies "memory"-constrain seems to me
indeed something we should fix.  The "cc" constrain here is ok.  The
point about the "cc" constrain is right now of not that much
importance due gcc doesn't handle constrains over expression-block
scope.  But well, maybe in future gcc will get the feature, so it
doesn't hurt.

>     __CRT_INLINE BOOLEAN InterlockedBitTestAndSet(volatile LONG *Base, LONG
> Offset)
>     {
>       BOOLEAN old;
>       __asm__ __volatile__("lock ; btsl %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+m" (*Base)
>         : "Ir" (Offset)
>         : "memory", "cc");
>       return old;
>     }
>
> Similarly for the other InterlockedBitTest* routines:
>
>     __CRT_INLINE BOOLEAN InterlockedBitTestAndReset(volatile LONG *Base,
> LONG Offset)
>     {
>       BOOLEAN old;
>       __asm__ __volatile__("lock ; btrl %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+m" (*Base)
>         : "Ir" (Offset)
>         : "memory", "cc");
>       return old;
>     }
>     __CRT_INLINE BOOLEAN InterlockedBitTestAndComplement(volatile LONG
> *Base, LONG Bit)
>     {
>       BOOLEAN old;
>       __asm__ __volatile__("lock ; btcl %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+m" (*Base)
>         : "Ir" (Bit)
>         : "memory", "cc");
>       return old;
>     }
>     __CRT_INLINE BOOLEAN InterlockedBitTestAndSet64(volatile LONG64 *Base,
> LONG64 Offset)
>     {
>       BOOLEAN old;
>       __asm__ __volatile__("lock ; btsq %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+m" (*Base)
>         : "Ir" (Offset)
>         : "memory", "cc");
>       return old;
>     }
>     __CRT_INLINE BOOLEAN InterlockedBitTestAndReset64(volatile LONG64 *Base,
> LONG64 Offset)
>     {
>       BOOLEAN old;
>       __asm__ __volatile__("lock ; btrq %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+m" (*Base)
>         : "Ir" (Offset)
>         : "memory", "cc");
>       return old;
>     }
>     __CRT_INLINE BOOLEAN InterlockedBitTestAndComplement64(volatile LONG64
> *Base, LONG64 Offset)
>     {
>       BOOLEAN old;
>       __asm__ __volatile__("lock ; btcq %2,%1\n\tsbbb %0,%0"
>         : "=r" (old), "+m" (*Base)
>         : "Ir" (Offset)
>         : "memory", "cc");
>       return old;
>     }

Same applies to these as above.

> And while we are talking about the InterlockedBitTest* routines, there
> doesn't seem to be much consistency in the casing (ie
> _interlockedbittestandreset vs InterlockedBitTestAndReset).  While sometimes
> these are #defined to map one to the other, both InterlockedBitTestAndReset
> and _interlockedbittestandreset have (identical) function bodies.  I'm not
> sure what the standard is here, so my names may need to be adjusted.

Well, that isn't the issue of mingw-w64.  This is an issue of how
stuff was defined by MS.  so don't touch, don't even think about
change signature-names please.
And if you think they should change then please provide proper msdn
documentation to prove your assumption.

> Probably the next most serious issue I found has do with the the
> InterlockedAnd/InterlockedOr/InterlockedXor functions.  They perform the
> correct operation, but then they return the wrong value.  This code:
>
>     __CRT_INLINE LONG InterlockedAnd(LONG volatile *Destination,LONG Value)
> {
>       __asm__ __volatile__("lock ; andl %0,%1"
>     : :"r"(Value),"m"(*Destination)
>     : "memory");
>       return *Destination;
>     }
>
> returns the new value after applying the "and" (unless some other thread has
> updated the value *again* between the "and" and the read, which could very
> well happen).  What's more, MSVC programmers expect this will return the
> *old* value that was actually in the memory when the interlocked instruction
> executed.  This is harder than it sounds.  Unlike "lock ; xadd" which is
> used for InterlockedIncrement, the "lock ; and" does *not* return the
> previous value in any of the registers.  I struggled with this, because if
> you want to (read and return the value) PLUS (perform the operation) both as
> a single atomic action, the best I could come up with was:
>
>     __CRT_INLINE LONG InterlockedAnd(volatile LONG *Destination, LONG Value)
>     {
>       LONG lPrev;
>
>       do {
>          lPrev = *Destination;
>       } while (InterlockedCompareExchange(Destination, lPrev & Value, lPrev)
> != lPrev);
>
>       return lPrev;
>     }
>
> Not tidy, I know.  But, unlike the old declaration, it does dependably
> return the right answer.  And when the decision is between "pretty code" and
> "right answer"...  If someone has a cleaner answer here, speak up.
> Similarly for the other routines:

Yes, agreed.  These routines need indeed some adjustment.  I would
welcome a patch for review for that and/or/xor interlocked variants to
address this issues.

>     __CRT_INLINE LONG InterlockedOr(volatile LONG *Destination, LONG Value)
>     {
>       LONG lPrev;
>
>       do {
>          lPrev = *Destination;
>       } while (InterlockedCompareExchange(Destination, lPrev | Value, lPrev)
> != lPrev);
>
>       return lPrev;
>     }
>     __CRT_INLINE LONG InterlockedXor(volatile LONG *Destination, LONG Value)
>     {
>       LONG lPrev;
>
>       do {
>          lPrev = *Destination;
>       } while (InterlockedCompareExchange(Destination, lPrev ^ Value, lPrev)
> != lPrev);
>
>       return lPrev;
>     }
>     __CRT_INLINE LONG64 InterlockedAnd64(volatile LONG64 *Destination,
> LONG64 Value)
>     {
>       LONG64 lPrev;
>
>       do {
>          lPrev = *Destination;
>       } while (InterlockedCompareExchange64(Destination, lPrev & Value,
> lPrev) != lPrev);
>
>       return lPrev;
>     }
>     __CRT_INLINE LONG64 InterlockedOr64(volatile LONG64 *Destination, LONG64
> Value)
>     {
>       LONG64 lPrev;
>
>       do {
>          lPrev = *Destination;
>       } while (InterlockedCompareExchange64(Destination, lPrev | Value,
> lPrev) != lPrev);
>
>       return lPrev;
>     }
>     __CRT_INLINE LONG64 InterlockedXor64(volatile LONG64 *Destination,
> LONG64 Value)
>     {
>       LONG64 lPrev;
>
>       do {
>          lPrev = *Destination;
>       } while (InterlockedCompareExchange64(Destination, lPrev ^ Value,
> lPrev) != lPrev);
>
>       return lPrev;
>     }
>
> The next group is InterlockedExchange.  Target is declared as an input, not
> an output.  Probably harmless given the memory clobber, but still.  Also,
> instead of using __CRT_INLINE like all the other routines do, this uses
> __MINGW_INTRIN_INLINE.  I don't know what the difference is supposed to be
> here?  And when did we start using WINAPI?  Is there some standard here that
> should apply to all these routines?

It isn't the memory-constrain, it is the volatile making sure input is
also output.

Well, __MINGW_INTRIN_INLINE is a helper-macro making sure things
getting always inlined.  This special macro is necessary for cygwin,
which needs those intrinsic that way.  Our default-inline macro is
turned off for cygwin, and so we can't use it here.

>   __MINGW_INTRIN_INLINE LONG WINAPI InterlockedExchange(LONG volatile
> *Target,LONG Value) {
>     __asm__ __volatile__ ("lock ; xchgl %0,%1"
>         : "=r"(Value)
>         : "m"(*Target),"0"(Value)
>         : "memory");
>     return Value;
>   }
>
> My code:
>
>     __CRT_INLINE LONG InterlockedExchange(volatile LONG *Target, LONG Value)
>     {
>       __asm__ __volatile__ ("lock ; xchgl %1,%0"
>         : "+m" (*Target), "+r" (Value)
>         :
>         : "memory");
>       return Value;
>     }
>     __CRT_INLINE LONG64 InterlockedExchange64(volatile LONG64 *Target,
> LONG64 Value)
>     {
>       __asm__ __volatile("lock ; xchgq %1,%0"
>         : "+m" (*Target), "+r" (Value)
>         :
>         : "memory");
>       return Value;
>     }
>
> Which brings us to InterlockedExchangePointer.
>
>     __CRT_INLINE PVOID InterlockedExchangePointer(PVOID volatile
> *Target,PVOID Value) {
>       __asm__ __volatile("lock ; xchgq %0,%1"
>     : "=r"(Value)
>     : "m"(*Target),"0"(Value)
>     : "memory");
>       return Value;
>     }

Well, as said above, I don't see here a need to change the code,
volatile does pretty well the trick.  To remove output-operand is IMHO
even a inconsistancy and here an real bug due missing "volatile".
Btw you are aware that the "lock" prefix isn't really necessary for
the "xchg" statement?  xchg gets automatically set the lock-signal by
cpu ... but well for consistancy and ease the reading I kept the
"lock" prefix.

> Again, Target is not listed as an output:
>
>     __CRT_INLINE PVOID InterlockedExchangePointer(volatile PVOID *Target,
> PVOID Value)
>     {
>       __asm__ __volatile("lock ; xchgq %1,%0"
>         : "+m" (*Target), "+r" (Value)
>         :
>         : "memory");
>       return Value;
>     }
>
> Also, while InterlockedExchangePointer is wrapped in #ifdef _AMD64_, should
> there be a 32bit version?  Similarly for InterlockedCompareExchangePointer
> below.
>
> Next we have the InterlockedCompareExchange family.  Destination is not
> listed as an output again.  Old:

Sure, it is volatile.

>     __CRT_INLINE SHORT InterlockedCompareExchange16(SHORT volatile
> *Destination,SHORT ExChange,SHORT Comperand) {
>       SHORT prev;
>       __asm__ __volatile__("lock ; cmpxchgw %w1,%2"
>     :"=a"(prev)
>     :"q"(ExChange), "m"(*Destination), "0"(Comperand)
>     : "memory");
>       return prev;
>     }
>
> Mine:
>
>     __CRT_INLINE SHORT InterlockedCompareExchange16(volatile SHORT
> *Destination, SHORT ExChange, SHORT Comperand)
>     {
>       SHORT prev;
>       __asm__ __volatile__("lock ; cmpxchgw %w2,%1"
>         : "=a" (prev), "+m" (*Destination)
>         : "r" (ExChange), "0" (Comperand)
>         : "memory", "cc");
>       return prev;
>     }
>     __CRT_INLINE LONG InterlockedCompareExchange(volatile LONG *Destination,
> LONG ExChange, LONG Comperand)
>     {
>       LONG prev;
>       __asm__ __volatile__("lock ; cmpxchgl %2,%1"
>         : "=a" (prev), "+m" (*Destination)
>         : "r" (ExChange), "0" (Comperand)
>         : "memory", "cc");
>       return prev;
>     }
>     __CRT_INLINE LONG64 InterlockedCompareExchange64(volatile LONG64
> *Destination, LONG64 ExChange, LONG64 Comperand)
>     {
>       LONG64 prev;
>       __asm__ __volatile__("lock ; cmpxchgq %2,%1"
>         : "=a" (prev), "+m" (*Destination)
>         : "r" (ExChange), "0" (Comperand)
>         : "memory", "cc");
>       return prev;
>     }
>     __CRT_INLINE PVOID InterlockedCompareExchangePointer(volatile PVOID
> *Destination, PVOID ExChange, PVOID Comperand)
>     {
>       PVOID prev;
>       __asm__ __volatile__("lock ; cmpxchgq %2,%1"
>         : "=a" (prev), "+m" (*Destination)
>         : "r" (ExChange), "0" (Comperand)
>         : "memory", "cc");
>       return prev;
>     }
>
> For the InterlockedIncre/Decrement routines, my code is the same (other than
> formatting).  Well, that and again with the __MINGW_INTRIN_INLINE thing.

As explained above the __MINGW_INTRIN_INLINE thing is required for our
user cygwin.
So please keep it.

>     __CRT_INLINE SHORT InterlockedIncrement16(volatile SHORT *Addend)
>     {
>       SHORT ret = 1;
>       __asm__ __volatile__("lock\n\txaddw %0,%1"
>         : "+r" (ret), "+m" (*Addend)
>         :
>         : "memory", "cc");
>       return ret + 1;
>     }
>     __CRT_INLINE SHORT InterlockedDecrement16(volatile SHORT *Addend)
>     {
>       SHORT ret = -1;
>       __asm__ __volatile__("lock\n\txaddw %0,%1"
>         : "+r" (ret), "+m" (*Addend)
>         :
>         : "memory", "cc");
>       return ret - 1;
>     }
>     __CRT_INLINE LONG64 InterlockedIncrement64(volatile LONG64 *Addend)
>     {
>       LONG64 ret = 1LL;
>       __asm__ __volatile__ ("lock\n\txaddq %0,%1"
>         : "+r" (ret), "+m" (*Addend)
>         :
>         : "memory", "cc");
>       return ret + 1LL;
>     }
>     __CRT_INLINE LONG64 InterlockedDecrement64(volatile LONG64 *Addend)
>     {
>       LONG64 ret = -1LL;
>       __asm__ __volatile__ ("lock\n\txaddq %0,%1"
>         : "+r" (ret), "+m" (*Addend)
>         :
>         : "memory", "cc");
>       return ret - 1LL;
>     }
>     __CRT_INLINE LONG InterlockedIncrement(volatile LONG *Addend)
>     {
>       LONG ret = 1;
>       __asm__ __volatile__ ("lock\n\txaddl %0,%1"
>         : "+r" (ret), "+m" (*Addend)
>         :
>         : "memory", "cc");
>       return ret + 1;
>     }
>     __CRT_INLINE LONG InterlockedDecrement(volatile LONG *Addend)
>     {
>       LONG ret = -1;
>       __asm__ __volatile__ ("lock\n\txaddl %0,%1"
>         : "+r" (ret), "+m" (*Addend)
>         :
>         : "memory", "cc");
>       return ret - 1;
>     }
>
> I also don't like the existing definitions for __faststorefence and
> MemoryBarrier:
>
>     __MINGW_INTRIN_INLINE void __faststorefence(void) {
>       __asm__ __volatile__ ("" ::: "memory");
>     }
>   __CRT_INLINE VOID MemoryBarrier(VOID)
>   {
>     LONG Barrier = 0;
>     __asm__ __volatile__("xchgl %%eax,%0 "
>       :"=r" (Barrier));
>   }
>
> Unlike the Windows version which affects the processor, this
> __faststorefence only affects the compiler.  And while any instruction with
> a lock prefix causes a fence, this MemoryBarrier code doesn't *use* a lock
> prefix (so I'm not sure it does anything at all).  Plus, there is anecdotal
> evidence that this isn't the most performant approach anyway.  How about:
>
>     __CRT_INLINE void __faststorefence(void)
>     {
>       __asm__ __volatile__ ("sfence" ::: "memory");
>     }
> #define MemoryBarrier __faststorefence

Ok, I am fine by this.  A small testcase for that would be appreachiated.

> That's almost all the asm in the file.  For the ones that are left:
>
> __stos* - I didn't muck with these, but I'd be tempted to remove the
> __volatile__ and add a "memory" clobber.
Please don't do that ...

> __readgs* - Since I'm not a kernel mode developer, I didn't muck with these
> either.  However, the fact that Offset is listed as an output rather than an
> input seems suspicious.
well ... volatile ...
> __writegs* - Still not a kernel developer, but again, no inputs.  Also, the
> format here looks backward?
hmm, well,  it works as desired ...
> NtCurrentTeb, GetCurrentFiber, GetFiberData - I don't believe any of these
> need "volatile".
The need ...

> Ok, that's it, I'm done.  Kudos to anyone who made it this far down the
> message.  I have links, docs and other justifications I can offer if needed,
> but this post is already long enough.  Ask your questions and I'll see what
> I can do.
>
> dw

Thanks
Kai

------------------------------------------------------------------------------
Own the Future-Intel&reg; Level Up Game Demo Contest 2013
Rise to greatness in Intel's independent game demo contest.
Compete for recognition, cash, and the chance to get your game 
on Steam. $5K grand prize plus 10 genre and skill prizes. 
Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to