> 
> Since read-modify-write is enabled for PentiumPro:
> 
> /* X86_TUNE_READ_MODIFY_WRITE: Enable use of read modify write instructions
>    such as "add $1, mem".  */
> DEF_TUNE (X86_TUNE_READ_MODIFY_WRITE, "read_modify_write",
>           ~(m_PENT | m_LAKEMONT))
> 
> should this
> 
> /* Generate "and $0,mem" and "or $-1,mem", instead of "mov $0,mem" and
>    "mov $-1,mem" with shorter encoding for TARGET_SPLIT_LONG_MOVES with
>    TARGET_READ_MODIFY_WRITE or -Oz.  */
> #define TARGET_USE_AND0_ORM1_STORE \
>   ((TARGET_SPLIT_LONG_MOVES && TARGET_READ_MODIFY_WRITE) \
>    || (optimize_insn_for_size_p () && optimize_size > 1))

I really think we are mixing performance and code size optimizations.
I may be misremembering, but I believe that on PPro

 movl $0, (%edx)

is slower than

 xorl %eax, %eax
 movl $0, (%edx)

due to hardware limitations on decoding instructions with long encoding.
However

 andl $0, (%edx)

is even slower than both above since it is a read-modify-write instruction
while both variants above does only write.  I do not think hardware
special cases this.

Situation is different when you actually do read-modify-write

If read_modify_write is set we produce:

 andl $1, (%edx)

While if it is unset we will do:
 
 movl (%edx), %eax
 andl $0, %eax
 movl %eax,(%edx)

which scheduled better on original Pentium provided extra register is
available.

Honza

Reply via email to