On 14/07/2016 22:29, Pranith Kumar wrote:
> +            } else if (curr_mb_type == TCG_BAR_STRL &&
> +                       prev_mb_type == TCG_BAR_LDAQ) {
> +                /* Consecutive load-acquire and store-release barriers
> +                 * can be merged into one stronger SC barrier
> +                 * ldaq; strl => ld; mb; st
> +                 */
> +                args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
> +                tcg_op_remove(s, prev_op);

Is this really an optimization?  For example the processor could reorder
"st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2".  It cannot do this
if you change ldaq1/strl2 to ld1/mb/st2.

On x86 for example a memory fence costs ~50 clock cycles, while normal
loads and stores are of course faster.

Of course this is useful if your target doesn't have ldaq/strl
instructions.  In this case, however, you probably want to lower ldaq to
"ld;mb" and strl to "mb;st"; the other optimizations then will remove
the unnecessary barrier.

Paolo

Reply via email to