On 14/07/2016 22:29, Pranith Kumar wrote: > + } else if (curr_mb_type == TCG_BAR_STRL && > + prev_mb_type == TCG_BAR_LDAQ) { > + /* Consecutive load-acquire and store-release barriers > + * can be merged into one stronger SC barrier > + * ldaq; strl => ld; mb; st > + */ > + args[0] = (args[0] & 0x0F) | TCG_BAR_SC; > + tcg_op_remove(s, prev_op);
Is this really an optimization? For example the processor could reorder "st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2". It cannot do this if you change ldaq1/strl2 to ld1/mb/st2. On x86 for example a memory fence costs ~50 clock cycles, while normal loads and stores are of course faster. Of course this is useful if your target doesn't have ldaq/strl instructions. In this case, however, you probably want to lower ldaq to "ld;mb" and strl to "mb;st"; the other optimizations then will remove the unnecessary barrier. Paolo