Re: Peephole optimisation: isWhitespace()

2020-08-16 Thread Stefan Kanthak
"Nathan Sidwell"  wrote:

> On 8/14/20 12:43 PM, Stefan Kanthak wrote:
>> Hi @ll,
>> 
>> in his ACM queue article ,
>> Matt Godbolt used the function
>> 
>> | bool isWhitespace(char c)
>> | {
>> | return c == ' '
>> |   || c == '\r'
>> |   || c == '\n'
>> |   || c == '\t';
>> | }
>> 
>> as an example, for which GCC 9.1 emits the following assembly for AMD64
>> processors (see ):
>> 
>> |xoreax, eax  ; result = false
>> |cmpdil, 32   ; is c > 32
>> |ja .L4   ; if so, exit with false
>> |movabs rax, 4294977024   ; rax = 0x12600
>> |shrx   rax, rax, rdi ; rax >>= c
>> |andeax, 1; result = rax & 1
>> |.L4:
>> |ret
>> 
>> This code is but not optimal!
> 
> What evidence do you have that your alternative sequence performs
> better?

45+ years experience in writing assembly code!

> Have you benchmarked it?

Of course! Did you?
I didn't include the numbers in my initial post since I don't have
a processor which supports BMI2 and thus can't run the original code.
I benchmarked the following equivalent code (input character is in
ECX instead of EDI):

GCC:  x64:   x86:
xoreax, eax   movabs rax, 12600h mov   eax, 2600h
cmpcl, 32test  cl, cl
ja L4setnz al
movabs rax, 12600hshrrax, cl shr   eax, cl
shrrax, clxoredx, edxxor   edx, edx
andeax, 1 cmpecx, 33 cmp   edx, 33
  setb   dl  setb  dl
L4:   andeax, edxand   eax, edx
ret   retret

Left column 1 billion sequential characters
for (int i=10; i; --i) ...(i);
right column 1 billion random characters, in cycles per character:

GCC:   2.43.4
x64:   3.02.5
x86:   4.34.5

> (I tried, but your code doesn't assemble)

Wrong: YOUR code doesn't assemble, mine does! And it works well too.

> It is more instructions

Because I dared to show code for the old(er) i386 alias x86 processor,
not for the AMD64 alias x86_64.
I expect people commenting on assembly to understand (just a little bit)
of it and to get the idea of the code I posted first, then eventually be
able to come up with the following x86_64 code (without a fancy SHRX,
i.e. for ALL AMD64 processors, not just those supporting BMI2):

 movecx, edi
 movabs rax, 12600h   ; rax = (1 << ' ') | (1 << '\r') | (1 << 
'\n') | (1 << '\t')
 shrrax, cl   ; rax >>= c
 xoredi, edi
 cmpecx, 33   ; CF = c <= ' '
.if 0
 adcedi, edi  ; edi = (c <= ' ')
.else
 setb   dil
.endif
 andeax, edi  ; result = rax & (c <= ' ')
 ret


If someone (or GCC) but really insists to (ab)use SHRX:

 xorecx, ecx
 cmpdil, 33   ; is !(c > 32)
 setb   cl
 movabs rax, 4294977024   ; rax = 0x12600
 shrx   rax, rax, rdi ; rax >>= c
 andeax, ecx  ; result = rax & (c <= ' ')
 ret

Now start counting: first the instructions, then the cycles lost
due to wrong branch prediction (hint: 0 for my code).

> and cannot speculate past the setnz

OUCH!
Unfortunately, AMD64 and i386 processors are not THAT retarded.

> (As I understand it, x86_64 speculates branch instructions, but
> doesn't speculate cmov -- so perversely branches are faster!)

There's no CMOV in my code.
What are you trying to demonstrate?

JFTR: how much speculative executed instructions are discarded if
  the branch prediction is wrong?

>> The following equivalent and branchless code works on i386 too,
>> it needs neither an AMD64 processor nor the SHRX instruction,
>> which is not available on older processors:
> 
>> 
>>   movecx, edi
>>   moveax, 2600h; eax = (1 << '\r') | (1 << '\n') | (1 << 
>> '\t')
>>   test   cl, cl
>>   setnz  al; eax |= (c != '\0')
>>   shreax, cl   ; eax >>= (c % ' ')
> 
> ^^ operand type mismatch on this instruction

I recommend to fix YOUR typo!
I bet you wrote "shr eax, c" instead of "shr eax, cl" there.

>>   xoredx, edx
>>   cmpecx, 33   ; CF = c <= ' '
>>   adcedx, edx  ; edx = (c <= ' ')
>>   andeax, edx
>>   ret

Stefan


Re: Clobber REG_CC only for some constraint alternatives?

2020-08-16 Thread Pip Cet via Gcc
On Sun, Aug 16, 2020 at 12:50 AM Segher Boessenkool
 wrote:
> On Sat, Aug 15, 2020 at 10:18:27AM +, Pip Cet wrote:
> > > > What I'm currently doing is this:
> > > >
> > > > (define_split
> > > >   [(set (match_operand 0 "nonimmediate_operand")
> > > > (match_operand 1 "nox_general_operand"))]
> > > >   "reload_completed && mov_clobbers_cc (insn, operands)"
> > > >   [(parallel [(set (match_dup 0) (match_dup 1))
> > > >   (clobber (reg:CC REG_CC))])])
> > > >
> > > > which, if I understand correctly, introduces the parallel-clobber
> > > > expression only when necessary, at the same time as post-reload
> > > > splitters introduce REG_CC references. Is that correct?
> > >
> > > Yes.  And this will work correctly if somehow you ensure that REG_CC
> > > isn't live anywhere you introduce such a clobber.
> >
> > IIUC, the idea is that references to REG_CC, except for clobbers, are
> > only introduced in the post-reload split pass, so it cannot be live
> > before our define_split runs. Does that make sense?
>
> Yes, it does.  It has some huge restrictions (using the reg in inline
> assembler can never work reliably, for example, so you'll have to
> disallow that).

Is there any approach that doesn't suffer from that problem? My
understanding was that we need to allow reload to insert CC-clobbering
insns on this (and many other) architectures, and that there are so
many places where reload might choose to do so (including before and
after inline asm) that using the register prior to reload just isn't
possible. I'd be glad to be wrong, though :-)

Is it true that reload chooses which constraint alternative is used
for each insn? Is that information available somehow to post-reload
splits? The problem is that the "X" constraint matches whatever's
there already, and doesn't replace it with the (scratch:CC) expression
that would work, so I can't rewrite those insns not to clobber the CC
reg.

For example, here's what I currently have:

(define_expand "mov"
  [(parallel [(set (match_operand:MOVMODE 0 "nonimmediate_operand" "")
   (match_operand:MOVMODE 1 "general_operand" ""))
  (clobber (reg:CC REG_CC))])]
  ...)

(define_insn "mov_insn"
   [(set (match_operand:ALL1 0 "nonimmediate_operand" "=r,r  ,d,Qm
  ,r ,q,r,*r")
(match_operand:ALL1 1 "nox_general_operand"   "r,Y00,n Ynn,r
Y00,Qm,r,q,i"))
   (clobber (match_scratch:CC 2 "=X,c,X,c,c,X,X,c"))]
  ...)

That works, but it results in an incorrect CC clobber for, say,
register-to-register movqi. For that, I'd need something like

(define_split
  [(parallel [(set (match_operand:ALL1 0 "nonimmediate_operand")
 (match_operand:ALL1 1 "nox_general_operand"))
  (clobber (reg:CC REG_CC))])]
  "reload_completed && REG_P (operands[0]) && REG_P (operands[1])"
  [(parallel [(set (match_dup 0)
   (match_dup 1))
  (clobber (scratch:CC))])])

and so on, for all four constraint alternatives that don't actually
clobber REG_CC (and also for a fifth which only rarely clobbers
REG_CC). That's duplicated code, of course.

All this is at least somewhat academic: the code produced isn't
drastically better after my cc conversion, but it's not usually any
worse, and I'm still looking at assembler examples that are pessimized
a little to find out how to fix them...

> And earlier passes (like combine) cannot optimise this,

I'm not sure what "this" refers to: my understanding is that the idea
is to let combine see CC-free code, or code with CC clobbers only,
which it can then optimize, and only add "real" CC references after
reload, so many optimizations should just work.

> But it is a pretty straightforward way to move from CC0 to the
> modern world!

With the limitations of the reload pass being as they are, I don't
really see a dramatically better alternative. I suppose we could
explicitly save and restore CC flags around insns emitted when
reload_in_progress is true, to simulate non-CC-clobbering add/mov insn
patterns? That sounds like it would reduce code quality a lot, unless
great care were taken to make sure all of the save/restore CC flags
insns were optimized away in later passes.

In any case, thanks for the answers so far!

Pip


Re: Peephole optimisation: isWhitespace()

2020-08-16 Thread Nathan Sidwell

On 8/16/20 9:54 AM, Stefan Kanthak wrote:

"Nathan Sidwell"  wrote:



What evidence do you have that your alternative sequence performs
better?


45+ years experience in writing assembly code!






Have you benchmarked it?


Of course! Did you?
I didn't include the numbers in my initial post since I don't have
a processor which supports BMI2 and thus can't run the original code.
I benchmarked the following equivalent code (input character is in
ECX instead of EDI):


you seem very angry about being asked for data.  As I said, I couldn't benchmark 
your code, because of the incorrect assembly.


As some one with 45+years of writing assembly, you'll be aware that processor 
micro architectures have changed dramatically over that time, and one can very 
easily be misled by 'intuition'.


Because I dared to show code for the old(er) i386 alias x86 processor,
not for the AMD64 alias x86_64.


Which I did find bizarre -- if you're targeting an x86_64 ISA, why are you 
writing code for a different processor?


anyway, you've made it clear you do not wish to engage in constructive 
discussion.

BTW, I have come up with a sequence as short as GCC's but without the 
conditional branch.  Sadly the margin is too small to write it.


Good day, sir

nathan

--
Nathan Sidwell


gcc-11-20200816 is now available

2020-08-16 Thread GCC Administrator via Gcc
Snapshot gcc-11-20200816 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/11-20200816/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 11 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch master 
revision c99116aeeb9644ebddec653ee8b19de4d38b65bd

You'll find:

 gcc-11-20200816.tar.xz   Complete GCC

  SHA256=6496b29cd2bc905564776c30866a47ceaeb8d86952a14e76fa108997ca9d1640
  SHA1=a0f290957445ceade6610e4e8bef1f6e88dcde88

Diffs from 11-20200809 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-11
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.