Re: [PATCH] add self-tuning to x86 hardware fast path in libitm

Nuno Diegues Mon, 02 Nov 2015 00:58:26 -0800

Hello everyone,

gently pinging to bring this back to life given the last patch I emailed.



Best regards,
-- Nuno Diegues

On Mon, Aug 24, 2015 at 12:51 PM, Nuno Diegues <n...@ist.utl.pt> wrote:
> Hello everyone,
>
> after a summer internship and some backlog catching up in the past
> weeks, I have finally got around to review the patch according to the
> latest discussion.
>
> The changes have not caused regression, and the latest speedup results
> are coherent with what we had before.
>
>
> In the following I add some comments inline to the discussion, and
> after that paste the updated patch.
>
>
>> >
>> > So let's iterate on this in parallel with the other changes that we need
>> > to get in place.  I'd prefer to have some more confidence that measuring
>> > txn throughput in epochs is the best way forward.
>> >
>> > Here are a few thoughts:
>> >
>> > Why do we actually need to measure succeeded transactions?  If a HW txn
>> > is just getting committed without interfering with anything else, is
>> > this really different from, say, two HW txns getting committed?  Aren't
>> > we really just interested in wasted work, or delayed txns?  That may
>> > help taking care of the nontxnal vs. txnal problem.
>> >
>> > Measuring committed txns during a time that might otherwise be spent by
>> > a serial txns could be useful to figure out how much other useful work a
>> > serial txn prevents.  But we'd see that as well if we'd just go serial
>> > during the auto-tuning because then concurrent txns will have to wait;
>> > and in this case we could measure it in the slow path of those
>> > concurrent txns (ie, during waiting, where measurement overhead wouldn't
>> > matter as much).
>> >
>> > If a serial txn is running, concurrent txns (that wait) are free to sync
>> > and tell the serial how much cost it creates for the concurrent txns.
>> > There, txn length could matter, but we won't find out for real until
>> > after the concurrent ones have run (they could be pretty short, so we
>> > can't simply assume that the concurrent ones are as long as the serial
>> > one, so that simply the number of concurrent ones can be used to
>> > calculate delayed work).
>
>
> I understand you concern: measuring failure in a slow path rather than
> success in
> the fast/common path.
>
> My problem is that the approach taken in our work is tailored for accessing 
> one
> given metric, in our case some form of throughput of successfully
> executed atomic
> blocks.
> However, to measure failure, we seem to need two things:
>  1) the rate of failed hardware transactions
>  2) the time spent waiting for the serial lock
>
> In short, the performance will be bad if there is a mix of those two
> issues that is bad
> enough. The problem is how to define a metric that considers those two 
> together?
> It is not obvious to me that there is one. Furthermore, that deviates
> significantly from
> our approach, and in the end --- even if there is a solution that
> works in that way ---
> that is a different approach, not one that can be applied to our
> self-tuning framework.
>
> What we are doing right now is incrementing a counter in a (most
> likely cached line)
> that is single-writer and multiple-reader. Most often, it will be
> owned by the single-writer.
> As such, the cost should be quite negligible compared to the execution
> of the atomic
> block, more so in the case that a Hardware transaction was used, as
> the commit itself
> should be a hundred cycles or more.
>
> A good way to measure this impact is to use the new "htm_notune" flag
> and compare
> its performance with the non-changed libitm that I use as a baseline above.
> The average results are similar, without any trend noticeable outside
> the standard
> deviation for one side or the other.
>
>
>
>
>>
>> >
>> >>
>> >> > Also, note that the mitigation strategy for rdtsc
>> >> > short-comings that you mention in the paper is not applicable in
>> >> > general, specifically not in libitm.
>> >>
>> >>
>> >> I suppose you mean the preemption of threads inflating the cycles 
>> >> measured?
>> >
>> > Yes, preemption and migration of threads (in case there's no global
>> > sync'ed TSC or similar) -- you mentioned in the paper that you pin
>> > threads to cores...
>> >
>> >> This would be similarly a problem to any time source that tries to 
>> >> measure the
>> >> amount of work performed; not sure how we can avoid it in general. Any 
>> >> thoughts?
>> >
>> > Not really as long as we keep depending on measuring time in a
>> > light-weight way.  Measuring smaller time intervals could make it less
>> > likely that preemption happens during such an interval, though.
>> >
>
>
> Note that in this patch we use no such pinning. In fact, I've measured
> that it does not
> have a noticeable impact in performance. Maybe that could be the case with 
> heavy
> over-subscription of the cores with lots of threads.
> Still, it is not a correctness issue, so I believe that the fact this
> performs great in the
> common case should make it prevail.
>
>
>
>>
>> >>
>> >> > Another issue is that we need to implement the tuning in a portable way.
>> >> > You currently make it depend on whether the assembler knows about RTM,
>> >> > whereas the rest of the code makes this more portable and defers to
>> >> > arch-specific files.  I'd prefer if we could use the tuning on other
>> >> > archs too.  But for that, we need to cleanly separate generic from
>> >> > arch-specific parts.  That affects magic numbers as well as things like
>> >> > rdtsc().
>> >>
>> >>
>> >> Yes, I refrained from adding new calls to the arch-specific files, to
>> >> contain the
>> >> changes mainly. But that is possible and that's part of the feedback I
>> >> was hoping
>> >> to get.
>> >
>> > OK.  Let me know if you want further input regarding this.
>
>
> I have established this arch-specific part in the new patch. PTAL
>
>
>
>>
>> >
>> >> > I'm wondering about whether it really makes sense to treat XABORT like
>> >> > conflicts and other abort reasons, instead of like capacity aborts.
>> >> > Perhaps we need to differentiate between the explicit abort codes glibc
>> >> > uses, so that we can distinguish between cases where an abort is
>> >> > supposed to signal incompatibility with txnal execution and cases where
>> >> > it's just used for lock elision (see sysdeps/unix/sysv/linux/x86/hle.h
>> >> > in current glibc):
>> >> > #define _ABORT_LOCK_BUSY        0xff
>> >> > #define _ABORT_LOCK_IS_LOCKED   0xfe
>> >> > #define _ABORT_NESTED_TRYLOCK   0xfd
>> >>
>> >>
>> >> I am not sure I follow: are you suggesting to consider other aborts 
>> >> besides
>> >> those of capacity?
>> >
>> > We may want to do this with UCB, similar to capacity.  Another option
>> > would be to, for now, make fixed choices regarding whether they are
>> > considered permanent or not.
>> > _ABORT_NESTED_TRYLOCK is an incompatibility with txnal execution
>> > (similar to illegal instructions and such).
>> > _ABORT_LOCK_BUSY is failing lock elision due to the lock being already
>> > acquired; thus, this is transient I'd say.
>> >
>> > Andi, what was _ABORT_LOCK_IS_LOCKED used for again?
>
>
> This would seem a more or less trivial change, after it is agreed
> upon. So I suggest
> we leave it for later if/when this one gets through.
>
>
>  >
>>
>> >> > > --- libitm/config/x86/target.h (revision 219316)
>> >> > > +++ libitm/config/x86/target.h (working copy)
>> >> > > @@ -126,12 +126,25 @@ htm_abort_should_retry (uint32_t begin_ret)
>> >> > >    return begin_ret & _XABORT_RETRY;
>> >> > >  }
>> >> > >
>> >> > > +static inline bool
>> >> > > +htm_is_capacity_abort(uint32_t begin_ret)
>> >> > > +{
>> >> > > +  return begin_ret & _XABORT_CAPACITY;
>> >> > > +}
>> >> >
>> >> > Is this indeed just about capacity, or do we actually mean a more
>> >> > general situation?
>> >>
>> >> The idea of the Upper Confidence Bounds (UCB) part of the algorithm was
>> >> to detect "real" capacity aborts (i.e., permanent) vs those that are
>> >> "transient".
>> >> So in this case we really wanted to know if the abort was a capacity one 
>> >> so
>> >> that we could direct the UCB algorithm and understand the "type" of 
>> >> capacity
>> >> abort.
>> >
>> > OK.  But do you want to single out capacity aborts as one case for which
>> > you want to detect permanent vs. transient, or do you want to generally
>> > find out whether aborts that are reported as permanent (e.g., capacity)
>> > are indeed permanent?  In the latter case, a more generic name would be
>> > more appropriate.  (And that's not just a naming thing, but guidance for
>> > other archs regarding what they should put in the "capacity" bucket of
>> > abort reasons...)
>> >
>> >
>
>
> As far as we know, capacity aborts are the ones that fit our criteria:
> they are identifiable
> in a different abort category, yet they are not necessarily
> persistent/deterministic.
> This could be expanded to others in the future, but it is not
> something we tested with, as
> other abort types are not as widespread.
>
>
>
>
>
> Updated patch follows:
>
> Index: util.cc
> ===================================================================
> --- util.cc (revision 219316)
> +++ util.cc (working copy)
> @@ -94,4 +94,15 @@ xrealloc (void *old, size_t size, bool separate_cl
>    return r;
>  }
>
> +// State for the generation is not synchronized. It assumes single thread
> +// usage at any time. It is, in fact, invoked by the thread that is
> +// re-optimizing the libitm HTM usage, so the assumption holds.
> +uint32_t
> +random_int(uint32_t max)
> +{
> +  static uint32_t seed = 123456789;
> +  seed = (1103515245 * seed + 12345);
> +  return seed;
> +}
> +
>  } // namespace GTM
> Index: libitm.h
> ===================================================================
> --- libitm.h (revision 219316)
> +++ libitm.h (working copy)
> @@ -101,7 +101,11 @@ typedef enum
>     /* These are not part of the ABI but used for custom HTM fast paths.  See
>        ITM_beginTransaction and gtm_thread::begin_transaction.  */
>     pr_HTMRetryableAbort = 0x800000,
> -   pr_HTMRetriedAfterAbort = 0x1000000
> +   pr_HTMRetriedAfterAbort = 0x1000000,
> +   /* Reports whether a capacity abort was detected in the custom HTM fast
> +      path, so that it can be used in the abort handler in
> +      gtm_thread::begin_transaction.  */
> +   pr_HTMCapacityAbort          = 0x2000000
>  } _ITM_codeProperties;
>
>  /* Result from startTransaction that describes what actions to take.
> Index: method-serial.cc
> ===================================================================
> --- method-serial.cc (revision 219316)
> +++ method-serial.cc (working copy)
> @@ -223,6 +223,19 @@ struct htm_mg : public method_group
>      // initially disabled.
>  #ifdef USE_HTM_FASTPATH
>      htm_fastpath = htm_init();
> +    optimizer.optimized_mode = STUBBORN;
> +    optimizer.last_cycles = get_efficient_time();
> +    optimizer.last_total_txs_executed = 0;
> +    optimizer.last_throughput = optimizer.MIN_THROUGHPUT;
> +    optimizer.last_attempts = htm_fastpath > 0 ? htm_fastpath - 1 : 1;
> +    optimizer.best_ever_throughput = optimizer.last_throughput;
> +    optimizer.best_ever_attempts = htm_fastpath;
> +    optimizer.txns_while_stubborn = 1;
> +    optimizer.cycles_while_stubborn = optimizer.INIT_CYCLES;
> +    optimizer.txns_while_halven = 1;
> +    optimizer.cycles_while_halven = optimizer.INIT_CYCLES;
> +    optimizer.txns_while_giveup = 1;
> +    optimizer.cycles_while_giveup = optimizer.INIT_CYCLES;
>  #endif
>    }
>    virtual void fini()
> Index: retry.cc
> ===================================================================
> --- retry.cc (revision 219316)
> +++ retry.cc (working copy)
> @@ -218,7 +218,6 @@ GTM::gtm_thread::set_default_dispatch(GTM::abi_dis
>    default_dispatch.store(disp, memory_order_relaxed);
>  }
>
> -
>  static GTM::abi_dispatch*
>  parse_default_method()
>  {
> @@ -254,10 +253,17 @@ parse_default_method()
>        disp = GTM::dispatch_ml_wt();
>        env += 5;
>      }
> +  else if (strncmp(env, "htm_notune", 10) == 0)
> +    {
> +      disp = GTM::dispatch_htm();
> +      env += 10;
> +      GTM::optimizer.set_is_enabled(0);
> +    }
>    else if (strncmp(env, "htm", 3) == 0)
>      {
>        disp = GTM::dispatch_htm();
>        env += 3;
> +      GTM::optimizer.set_is_enabled(1);
>      }
>    else
>      goto unknown;
> Index: beginend.cc
> ===================================================================
> --- beginend.cc (revision 219316)
> +++ beginend.cc (working copy)
> @@ -39,6 +39,9 @@ unsigned GTM::gtm_thread::number_of_threads = 0;
>  gtm_stmlock GTM::gtm_stmlock_array[LOCK_ARRAY_SIZE];
>  atomic<gtm_version> GTM::gtm_clock;
>
> +// Holds the optimized configuration for parameters relevant to HTM 
> execution.
> +gtm_global_optimizer GTM::optimizer;
> +
>  /* ??? Move elsewhere when we figure out library initialization.  */
>  uint64_t GTM::gtm_spin_count_var = 1000;
>
> @@ -95,7 +98,139 @@ thread_exit_init()
>      GTM_fatal("Creating thread release TLS key failed.");
>  }
>
> +/* Implementations for the fixed-point arithmetic type. */
> +const uint32_t GTM::fixed_pt_t::FIXED_PT_WIDTH = 64;
> +const uint32_t GTM::fixed_pt_t::FIXED_PT_INTEGER_WIDTH = 40;
> +const uint32_t GTM::fixed_pt_t::FIXED_PT_FRAC_WIDTH =
> +  GTM::fixed_pt_t::FIXED_PT_WIDTH - GTM::fixed_pt_t::FIXED_PT_INTEGER_WIDTH;
> +const fixed_pt_t GTM::fixed_pt_t::FIXED_PT_ONE = GTM::fixed_pt_t::one();
> +const fixed_pt_t GTM::fixed_pt_t::FIXED_PT_TWO =
> +  GTM::fixed_pt_t::FIXED_PT_ONE.add(GTM::fixed_pt_t::FIXED_PT_ONE);
>
> +fixed_pt_t
> +GTM::fixed_pt_t::one()
> +{
> +  fixed_pt_t result(0);
> +  result.number = static_cast<uint64_t>(1) << FIXED_PT_FRAC_WIDTH;
> +  return result;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::add(const fixed_pt_t operand) const
> +{
> +  fixed_pt_t result(0);
> +  result.number = this->number + operand.number;
> +  return result;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::sub(const fixed_pt_t operand) const
> +{
> +  fixed_pt_t result(0);
> +  result.number = this->number - operand.number;
> +  return result;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::mul(const fixed_pt_t operand) const
> +{
> +  fixed_pt_t result(0);
> +  result.number = (static_cast<__uint128_t>(this->number) *
> +     static_cast<__uint128_t>(operand.number)) >> FIXED_PT_FRAC_WIDTH;
> +  return result;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::div(const fixed_pt_t operand) const
> +{
> +  fixed_pt_t result(0);
> +  result.number =
> +    (static_cast<__uint128_t>(this->number) << FIXED_PT_FRAC_WIDTH) /
> +      static_cast<__uint128_t>(operand.number);
> +  return result;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::sqrt() const
> +{
> +  uint64_t invert = 0;
> +  uint64_t iter = FIXED_PT_FRAC_WIDTH;
> +  uint64_t i;
> +  fixed_pt_t n = *this;
> +
> +  if (n.number == 0 || n.number == FIXED_PT_ONE.number)
> +    {
> +      return n;
> +    }
> +  if (n.number < FIXED_PT_ONE.number && n.number > 6)
> +    {
> +      invert = 1;
> +      n = FIXED_PT_ONE.div(n);
> +    }
> +  if (n.number > FIXED_PT_ONE.number)
> +    {
> +      uint64_t s = n.number;
> +      iter = 0;
> +      while (s > 0)
> +        {
> +          s >>= 2;
> +          iter++;
> +        }
> +    }
> +
> +  fixed_pt_t l(0);
> +  l.number = (n.number >> 1) + 1;
> +  for (i = 0; i < iter; i++)
> +    {
> +      l.number = l.add(n.div(l)).number >> 1;
> +    }
> +  if (invert)
> +    {
> +      return FIXED_PT_ONE.div(l);
> +    }
> +  return l;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::ln() const
> +{
> +  const fixed_pt_t LN2 = fixed_pt_t(0.69314718055994530942);
> +  const fixed_pt_t LG[7] = {
> +    fixed_pt_t(6.666666666666735130e-01),
> +    fixed_pt_t(3.999999999940941908e-01),
> +    fixed_pt_t(2.857142874366239149e-01),
> +    fixed_pt_t(2.222219843214978396e-01),
> +    fixed_pt_t(1.818357216161805012e-01),
> +    fixed_pt_t(1.531383769920937332e-01),
> +    fixed_pt_t(1.479819860511658591e-01)
> +  };
> +
> +  if (this->number == 0)
> +    {
> +      return 0xffffffff;
> +    }
> +
> +  fixed_pt_t log2 = fixed_pt_t(0);
> +  fixed_pt_t xi = *this;
> +  while (xi.number > FIXED_PT_TWO.number)
> +    {
> +      xi.number >>= 1;
> +      log2.number++;
> +    }
> +
> +  fixed_pt_t f = xi.sub(FIXED_PT_ONE);
> +  fixed_pt_t s = f.div(FIXED_PT_TWO.add(f));
> +  fixed_pt_t z = s.mul(s);
> +  fixed_pt_t w = z.mul(z);
> +  fixed_pt_t R =
> +    w.mul(LG[1].add(w.mul(LG[3].add(w.mul(LG[5]))))).add(
> +      z.mul(LG[0].add(w.mul(LG[2].add(
> +        w.mul(LG[4].add(w.mul(LG[6]))))))));
> +
> +  return LN2.mul(fixed_pt_t(log2.number << FIXED_PT_FRAC_WIDTH)).add(
> +    f.sub(s.mul(f.sub(R))));
> +}
> +
>  GTM::gtm_thread::~gtm_thread()
>  {
>    if (nesting > 0)
> @@ -149,6 +284,180 @@ choose_code_path(uint32_t prop, abi_dispatch *disp
>      return a_runInstrumentedCode;
>  }
>
> +// Periodically invoked by some thread(s) that are in the HTM fall-back path.
> +// This implements the Upper Confidence Bounds and Gradient Descent to tune,
> +// respectively, how to deal with HTM capacity aborts and how many HTM 
> retries
> +// to allow. For more details, check Diegues et al. in USENIX ICAC 2014.
> +void
> +GTM::gtm_global_optimizer::reoptimize_htm_execution()
> +{
> +  // Avoid redundant re-optimization if another thread may be doing it 
> already.
> +  // This allows the threads to avoid racing to re-optimize concurrently, 
> which
> +  // is useless.
> +
> +  if (unlikely(GTM::gtm_thread::serial_lock.is_write_locked()))
> +    {
> +      return;
> +    }
> +  // Eventually, some thread(s) will follow this path, even in the face of
> +  // spurious short-cut exits due to the serial lock being taken.
> +  GTM::gtm_thread::serial_lock.write_lock();
> +
> +  // Collect the statistics obtained so far.
> +  uint64_t total_txs_executed = 0;
> +  gtm_thread *ptr = GTM::gtm_thread::list_of_threads;
> +  for (; ptr; ptr = ptr->next_thread)
> +    {
> +      total_txs_executed += ptr->number_executed_txns;
> +    }
> +
> +  // Obtain the delta performance with respect to the last period measured.
> +  // We expect the underlying architecture to provide an efficient way to
> +  // measure time passed since the last re-optimization.
> +  const uint64_t current_cycles = get_efficient_time();
> +  const uint64_t cycles_used = current_cycles - optimizer.last_cycles;
> +  const uint64_t new_txs_executed =
> +    total_txs_executed - optimizer.last_total_txs_executed;
> +  const fixed_pt_t current_throughput =
> +    fixed_pt_t(new_txs_executed).div(fixed_pt_t(cycles_used));
> +
> +  if (current_throughput.number == 0)
> +    {
> +      // This thread probably executed right after another one as they both
> +      // found it was time to re-optimize, but did not contend on the serial
> +      // lock. As such, we avoid redundant re-optimizations and short cut 
> out.
> +      GTM::gtm_thread::serial_lock.write_unlock();
> +      return;
> +    }
> +
> +  optimizer.last_cycles = current_cycles;
> +  optimizer.last_total_txs_executed = total_txs_executed;
> +  if (optimizer.optimized_mode == STUBBORN)
> +    {
> +      optimizer.txns_while_stubborn += new_txs_executed;
> +      optimizer.cycles_while_stubborn += cycles_used;
> +    }
> +  else if (optimizer.optimized_mode == HALVEN)
> +    {
> +      optimizer.txns_while_halven += new_txs_executed;
> +      optimizer.cycles_while_halven += cycles_used;
> +    }
> +  else
> +    {
> +      optimizer.txns_while_giveup += new_txs_executed;
> +      optimizer.cycles_while_giveup += cycles_used;
> +    }
> +
> +  // Compute Upper Confidence Bounds for the mode to choose next to decide on
> +  // how to deal with HTM capacity aborts.
> +  // Use fixed point arithmetic types to spare floating point usage.
> +  const fixed_pt_t log_sum =
> +    GTM::fixed_pt_t::FIXED_PT_TWO.mul(
> +      (fixed_pt_t(optimizer.txns_while_stubborn).add(
> +       fixed_pt_t(optimizer.txns_while_halven)).add(
> +       fixed_pt_t(optimizer.txns_while_giveup))).ln());
> +  const fixed_pt_t ucb_stubborn =
> +    ((fixed_pt_t(optimizer.txns_while_stubborn).div(
> +      fixed_pt_t(optimizer.cycles_while_stubborn))).div(
> +      optimizer.best_ever_throughput)).add(
> +     (log_sum.div(fixed_pt_t(optimizer.txns_while_stubborn))).sqrt());
> +  const fixed_pt_t ucb_halven =
> +    ((fixed_pt_t(optimizer.txns_while_halven).div(
> +      fixed_pt_t(optimizer.cycles_while_halven))).div(
> +      optimizer.best_ever_throughput)).add(
> +     (log_sum.div(fixed_pt_t(optimizer.txns_while_halven))).sqrt());
> +  const fixed_pt_t ucb_giveup =
> +    ((fixed_pt_t(optimizer.txns_while_giveup).div(
> +      fixed_pt_t(optimizer.cycles_while_giveup)).div(
> +      optimizer.best_ever_throughput))).add(
> +     (log_sum.div(fixed_pt_t(optimizer.txns_while_giveup))).sqrt());
> +
> +  if (ucb_stubborn.number > ucb_halven.number &&
> +      ucb_stubborn.number > ucb_giveup.number)
> +    {
> +      optimizer.optimized_mode = STUBBORN;
> +    }
> +  else if (ucb_halven.number > ucb_giveup.number)
> +    {
> +      optimizer.optimized_mode = HALVEN;
> +    }
> +  else
> +    {
> +      optimizer.optimized_mode = GIVEUP;
> +    }
> +
> +  // Compute Gradient Descent to regulate the number of retries in HTM.
> +  const fixed_pt_t LARGE_DEGRADATION = fixed_pt_t(1.40);
> +  const fixed_pt_t MIN_IMPROVEMENT = fixed_pt_t(1.05);
> +  const int32_t MAX_RETRIES = 20;
> +  const int32_t MIN_RETRIES = 2;
> +
> +  const fixed_pt_t change_for_better =
> +    current_throughput.div(optimizer.last_throughput);
> +  const fixed_pt_t change_for_worse =
> +    optimizer.last_throughput.div(current_throughput);
> +  int32_t last_attempts = optimizer.last_attempts;
> +  int32_t current_attempts = htm_fastpath;
> +  int32_t new_attempts = current_attempts;
> +  if (unlikely(change_for_worse.number > 
> fixed_pt_t(LARGE_DEGRADATION).number))
> +    {
> +      htm_fastpath = optimizer.best_ever_attempts;
> +      optimizer.last_throughput = current_throughput;
> +      optimizer.last_attempts = current_attempts;
> +      goto finish_reoptimization;
> +    }
> +
> +  if (unlikely(random_int(100) == 0))
> +    {
> +      optimizer.last_attempts = htm_fastpath;
> +      optimizer.last_throughput = current_throughput;
> +      htm_fastpath = random_int(MAX_RETRIES - MIN_RETRIES) + MIN_RETRIES;
> +      goto finish_reoptimization;
> +    }
> +
> +  if (change_for_better.number > MIN_IMPROVEMENT.number)
> +    {
> +      new_attempts += current_attempts - last_attempts;
> +      if (current_attempts == last_attempts)
> + {
> +  new_attempts += random_int(2) == 0 ? -1 : 1;
> + }
> +    }
> +  else if (change_for_worse.number > MIN_IMPROVEMENT.number)
> +    {
> +      new_attempts -= current_attempts - last_attempts;
> +      if (current_attempts == last_attempts)
> + {
> +  new_attempts += random_int(2) == 0 ? -1 : 1;
> + }
> +    }
> +
> +  if (unlikely(new_attempts > MAX_RETRIES))
> +    {
> +      new_attempts = MAX_RETRIES;
> +    }
> +  else if (unlikely(new_attempts < MIN_RETRIES))
> +    {
> +      new_attempts = MIN_RETRIES;
> +    }
> +
> +  if (current_attempts != new_attempts)
> +    {
> +      optimizer.last_attempts = current_attempts;
> +      optimizer.last_throughput = current_throughput;
> +    }
> +
> +  htm_fastpath = new_attempts;
> +  if (unlikely(current_throughput.number >
> optimizer.best_ever_throughput.number))
> +    {
> +      optimizer.best_ever_throughput = current_throughput;
> +      optimizer.best_ever_attempts = current_attempts;
> +    }
> +
> + finish_reoptimization:
> +  GTM::gtm_thread::serial_lock.write_unlock();
> +}
> +
>  uint32_t
>  GTM::gtm_thread::begin_transaction (uint32_t prop, const gtm_jmpbuf *jb)
>  {
> @@ -209,8 +518,36 @@ GTM::gtm_thread::begin_transaction (uint32_t prop,
>      }
>    // The transaction has aborted.  Don't retry if it's unlikely that
>    // retrying the transaction will be successful.
> -  if (!htm_abort_should_retry(ret))
> -    break;
> +  if (optimizer.is_enabled)
> +    {
> +      // If the HTM self-tuning is enabled, then we act according to
> +      // its configuration.
> +      if (htm_is_capacity_abort(ret))
> + {
> +  gtm_capacity_abort_mode capacity_mode =
> +    optimizer.optimized_mode;
> +  // Adapt the retries according to the mode. Note that the
> +  // linear decrease is implicit in the outer for loop.
> +  if (capacity_mode == HALVEN)
> +    t = t / 2 + 1;
> +  else if (capacity_mode == GIVEUP)
> +    break;
> + }
> +
> +      // Take the chance of having aborted to possibly re-optimize the
> +      // the HTM configuration. Do this as a last resort before going
> +      // for the serial lock.
> +      if (unlikely(tx->number_executed_txns %
> +   optimizer.OPTIMIZATION_PERIOD_TXS == 0 && t == 1))
> + optimizer.reoptimize_htm_execution();
> +    }
> +  else
> +    {
> +      // Execute the fixed logic when there is no adaptation.
> +      if (!htm_abort_should_retry(ret))
> + break;
> +    }
> +
>    // Wait until any concurrent serial-mode transactions have finished.
>    // This is an empty critical section, but won't be elided.
>    if (serial_lock.is_write_locked())
> @@ -248,7 +585,11 @@ GTM::gtm_thread::begin_transaction (uint32_t prop,
>    // HTM fastpath aborted, and that we thus have to decide whether to retry
>    // the fastpath (returning a_tryHTMFastPath) or just proceed with the
>    // fallback method.
> -  if (likely(htm_fastpath && (prop & pr_HTMRetryableAbort)))
> +  // Even if pr_HTMRetryableAbort is not true, we still consider to retry if
> +  // it was a capacity abort (stated by pr_HTMCapacityAbort) and the self
> +  // tuning capabilities are enabled (stated by optimizer.is_enabled).
> +  if (likely(htm_fastpath && ((prop & pr_HTMRetryableAbort)
> +      || ((prop & pr_HTMCapacityAbort) && (optimizer.is_enabled)))))
>      {
>        tx = gtm_thr();
>        if (unlikely(tx == NULL))
> @@ -266,6 +607,26 @@ GTM::gtm_thread::begin_transaction (uint32_t prop,
>
>        if (--tx->restart_total > 0)
>   {
> +  // See above: we use the optimized configuration if enabled.
> +  if (optimizer.is_enabled)
> +    {
> +      // See above: similarly, we adapt upon capacity aborts.
> +      if (prop & pr_HTMCapacityAbort)
> + {
> +  gtm_capacity_abort_mode capacity_mode = optimizer.optimized_mode;
> +  if (capacity_mode == HALVEN)
> +    tx->restart_total = tx->restart_total / 2 + 1;
> +  else if (capacity_mode == GIVEUP)
> +    goto stop_custom_htm_fastpath;
> + }
> +
> +      // See above: similarly, we periodically re-optimize.
> +      if (unlikely(tx->number_executed_txns %
> +  optimizer.OPTIMIZATION_PERIOD_TXS == 0 &&
> +  tx->restart_total == 1))
> + optimizer.reoptimize_htm_execution();
> +    }
> +
>    // Wait until any concurrent serial-mode transactions have finished.
>    // Essentially the same code as above.
>    if (serial_lock.is_write_locked())
> @@ -665,12 +1026,23 @@ _ITM_commitTransaction(void)
>    if (likely(htm_fastpath && !gtm_thread::serial_lock.is_write_locked()))
>      {
>        htm_commit();
> +      gtm_thread *tx = gtm_thr();
> +      if (unlikely(tx == NULL))
> + {
> +  // See above: when using the HTM fast-path, we might not have created
> +  // a thread object yet.
> +  tx = new gtm_thread();
> +  set_gtm_thr(tx);
> + }
> +      tx->number_executed_txns++;
>        return;
>      }
>  #endif
>    gtm_thread *tx = gtm_thr();
>    if (!tx->trycommit ())
>      tx->restart (RESTART_VALIDATE_COMMIT);
> +
> +  tx->number_executed_txns++;
>  }
>
>  void ITM_REGPARM
> @@ -681,6 +1053,14 @@ _ITM_commitTransactionEH(void *exc_ptr)
>    if (likely(htm_fastpath && !gtm_thread::serial_lock.is_write_locked()))
>      {
>        htm_commit();
> +      gtm_thread *tx = gtm_thr();
> +      if (unlikely(tx == NULL))
> + {
> +  // See above.
> +  tx = new gtm_thread();
> +  set_gtm_thr(tx);
> + }
> +      tx->number_executed_txns++;
>        return;
>      }
>  #endif
> @@ -690,4 +1070,5 @@ _ITM_commitTransactionEH(void *exc_ptr)
>        tx->eh_in_flight = exc_ptr;
>        tx->restart (RESTART_VALIDATE_COMMIT);
>      }
> +  tx->number_executed_txns++;
>  }
> Index: config/s390/target.h
> ===================================================================
> --- config/s390/target.h (revision 219316)
> +++ config/s390/target.h (working copy)
> @@ -116,12 +116,30 @@ htm_abort_should_retry (uint32_t begin_ret)
>  }
>
>  static inline bool
> +htm_is_capacity_abort (uint32_t begin_ret)
> +{
> +  return begin_ret != _HTM_TBEGIN_PERSISTENT;
> +}
> +
> +static inline bool
>  htm_transaction_active ()
>  {
>    __asm volatile (".machine \"all\"  \n\t");
>    return __builtin_tx_nesting_depth() != 0;
>  }
>
> +static inline uint64_t
> +get_efficient_time()
> +{
> +    return 0;
> +}
> +
> +static inline bool
> +allows_reoptimization()
> +{
> +  return false;
> +}
> +
>  #endif
>
>  } // namespace GTM
> Index: config/arm/target.h
> ===================================================================
> --- config/arm/target.h (revision 219316)
> +++ config/arm/target.h (working copy)
> @@ -46,4 +46,10 @@ cpu_relax (void)
>    __sync_synchronize ();
>  }
>
> +static inline bool
> +allows_reoptimization()
> +{
> +  return false;
> +}
> +
>  } // namespace GTM
> Index: config/powerpc/target.h
> ===================================================================
> --- config/powerpc/target.h (revision 219316)
> +++ config/powerpc/target.h (working copy)
> @@ -74,6 +74,7 @@ cpu_relax (void)
>  #define _TBEGIN_STARTED       0
>  #define _TBEGIN_INDETERMINATE 1
>  #define _TBEGIN_PERSISTENT    2
> +#define _TBEGIN_CAPACITY      3
>
>  /* Number of retries for transient failures.  */
>  #define _HTM_RETRIES 10
> @@ -101,6 +102,9 @@ htm_begin (void)
>    if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
>      return _TBEGIN_PERSISTENT;
>
> +  if (_TEXASRU_FOOTPRINT_OVERFLOW (__builtin_get_texasru ()))
> +    return _TBEGIN_CAPACITY;
> +
>    return _TBEGIN_INDETERMINATE;
>  }
>
> @@ -128,6 +132,12 @@ htm_abort_should_retry (uint32_t begin_ret)
>    return begin_ret != _TBEGIN_PERSISTENT;
>  }
>
> +static inline bool
> +htm_is_capacity_abort (uint32_t begin_ret)
> +{
> +  return begin_ret == _TBEGIN_CAPACITY;
> +}
> +
>  /* Returns true iff a hardware transaction is currently being executed.  */
>  static inline bool
>  htm_transaction_active (void)
> @@ -135,6 +145,36 @@ htm_transaction_active (void)
>    return (_HTM_STATE (__builtin_ttest ()) == _HTM_TRANSACTIONAL);
>  }
>
> +static inline uint64_t
> +get_efficient_time()
> +{
> +  #if defined (__powerpc64__) || defined (__ppc64__)
> +  uint64_t __tb;
> +  __asm__ volatile ("mfspr %[tb], 268\n"
> +    : [tb]"=r" (__tb)
> +    : );
> +  return __tb;
> +  #else
> +  register unsigned long __tbu, __tbl, __tmp; \
> +  __asm__ volatile (
> +    "0:\n"
> +    "mftbu %[tbu]\n"
> +    "mftbl %[tbl]\n"
> +    "mftbu %[tmp]\n"
> +    "cmpw %[tbu], %[tmp]\n"
> +    "bne- 0b\n"
> +    : [tbu]"=r" (__tbu), [tbl]"=r" (__tbl), [tmp]"=r" (__tmp)
> +    : );
> +  return (( (uint64_t) __tbu << 32) | __tbl);
> +  #endif /* not __powerpc64__ */
> +}
> +
> +static inline bool
> +allows_reoptimization()
> +{
> +  return true;
> +}
> +
>  #endif
>
>  } // namespace GTM
> Index: config/alpha/target.h
> ===================================================================
> --- config/alpha/target.h (revision 219316)
> +++ config/alpha/target.h (working copy)
> @@ -41,4 +41,10 @@ cpu_relax (void)
>    __asm volatile ("" : : : "memory");
>  }
>
> +static inline bool
> +allows_reoptimization()
> +{
> +  return false;
> +}
> +
>  } // namespace GTM
> Index: config/x86/sjlj.S
> ===================================================================
> --- config/x86/sjlj.S (revision 219316)
> +++ config/x86/sjlj.S (working copy)
> @@ -59,12 +59,14 @@
>  #define pr_hasNoAbort 0x08
>  #define pr_HTMRetryableAbort 0x800000
>  #define pr_HTMRetriedAfterAbort 0x1000000
> +#define pr_HTMCapacityAbort     0x2000000
>  #define a_runInstrumentedCode 0x01
>  #define a_runUninstrumentedCode 0x02
>  #define a_tryHTMFastPath 0x20
>
>  #define _XABORT_EXPLICIT (1 << 0)
>  #define _XABORT_RETRY (1 << 1)
> +#define _XABORT_CAPACITY (1 << 3)
>
>   .text
>
> @@ -111,6 +113,9 @@ SYM(_ITM_beginTransaction):
>   testl $(_XABORT_RETRY|_XABORT_EXPLICIT), %eax
>   jz .Lno_htm
>   orl $pr_HTMRetryableAbort, %edi
> + testl   $(_XABORT_CAPACITY), %eax
> + jz      .Lno_htm
> + orl     $pr_HTMCapacityAbort, %edi
>   /* Let the C++ code handle the retry policy.  */
>  .Lno_htm:
>  #endif
> Index: config/x86/target.h
> ===================================================================
> --- config/x86/target.h (revision 219316)
> +++ config/x86/target.h (working copy)
> @@ -126,13 +126,33 @@ htm_abort_should_retry (uint32_t begin_ret)
>    return begin_ret & _XABORT_RETRY;
>  }
>
> +static inline bool
> +htm_is_capacity_abort(uint32_t begin_ret)
> +{
> +  return begin_ret & _XABORT_CAPACITY;
> +}
> +
>  /* Returns true iff a hardware transaction is currently being executed.  */
>  static inline bool
>  htm_transaction_active ()
>  {
>    return _xtest() != 0;
>  }
> +
> +static inline uint64_t
> +get_efficient_time()
> +{
> +  uint32_t hi, lo;
> +  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
> +  return ((unsigned long long) lo) | (((unsigned long long) hi) << 32);
> +}
> +
> +static inline bool
> +allows_reoptimization()
> +{
> +  return true;
> +}
> +
>  #endif
>
> -
>  } // namespace GTM
> Index: config/aarch64/target.h
> ===================================================================
> --- config/aarch64/target.h (revision 219316)
> +++ config/aarch64/target.h (working copy)
> @@ -42,4 +42,10 @@ cpu_relax (void)
>    __asm volatile ("" : : : "memory");
>  }
>
> +static inline bool
> +allows_reoptimization()
> +{
> +  return false;
> +}
> +
>  } // namespace GTM
> Index: config/sparc/target.h
> ===================================================================
> --- config/sparc/target.h (revision 219316)
> +++ config/sparc/target.h (working copy)
> @@ -39,4 +39,10 @@ cpu_relax (void)
>    __asm volatile ("rd %%ccr, %%g0" : : : "memory");
>  }
>
> +static inline bool
> +allows_reoptimization()
> +{
> +  return false;
> +}
> +
>  } // namespace GTM
> Index: config/sh/target.h
> ===================================================================
> --- config/sh/target.h (revision 219316)
> +++ config/sh/target.h (working copy)
> @@ -44,4 +44,10 @@ cpu_relax (void)
>    __asm volatile ("" : : : "memory");
>  }
>
> +static inline bool
> +allows_reoptimization()
> +{
> +  return false;
> +}
> +
>  } // namespace GTM
> Index: libitm_i.h
> ===================================================================
> --- libitm_i.h (revision 219316)
> +++ libitm_i.h (working copy)
> @@ -242,6 +242,9 @@ struct gtm_thread
>    uint32_t restart_reason[NUM_RESTARTS];
>    uint32_t restart_total;
>
> +  // Keeps track of how many transactions were successfully executed.
> +  uint64_t number_executed_txns;
> +
>    // *** The shared part of gtm_thread starts here. ***
>    // Shared state is on separate cachelines to avoid false sharing with
>    // thread-local parts of gtm_thread.
> @@ -309,6 +312,94 @@ struct gtm_thread
>    void commit_user_actions ();
>  };
>
> +// Definition of fixed point arithmetic types.
> +// Half the bits are dedicated to the fractional type, and the rest to the
> +// "whole" part.
> +struct fixed_pt_t
> +{
> +  uint64_t number;
> +
> +  fixed_pt_t(double n)
> +  {
> +    this->number = static_cast<uint64_t>(n * FIXED_PT_ONE.number + (n
>>= 0 ? 0.5 : -0.5));
> +  }
> +
> +  fixed_pt_t add(const fixed_pt_t operand) const;
> +  fixed_pt_t sub(const fixed_pt_t operand) const;
> +  fixed_pt_t mul(const fixed_pt_t operand) const;
> +  fixed_pt_t div(const fixed_pt_t operand) const;
> +  fixed_pt_t sqrt() const;
> +  fixed_pt_t ln() const;
> +
> +  static fixed_pt_t one();
> +
> +  static const uint32_t FIXED_PT_WIDTH;
> +  static const uint32_t FIXED_PT_INTEGER_WIDTH;
> +  static const uint32_t FIXED_PT_FRAC_WIDTH;
> +  static const fixed_pt_t FIXED_PT_ONE;
> +  static const fixed_pt_t FIXED_PT_TWO;
> +};
> +
> +// Different ways to deal with capacity aborts in HTM execution.
> +enum gtm_capacity_abort_mode
> +{
> +  STUBBORN,
> +  HALVEN,
> +  GIVEUP
> +};
> +
> +// Maintains the current values optimized for HTM execution and the
> +// corresponding statistics gathered for the decision-making.
> +struct gtm_global_optimizer
> +{
> +  // Whether to re-optimize the HTM execution. This is parametrizable by an
> +  // env variable.
> +  uint32_t is_enabled;
> +
> +  // Mode chosen to currently deal with capacity aborts.
> +  gtm_capacity_abort_mode optimized_mode;
> +  // The number of attempts chosen to currently insist on HTM execution is
> +  // maintained in htm_fastpath.
> +
> +  uint64_t last_cycles;
> +  uint64_t last_total_txs_executed;
> +
> +  fixed_pt_t last_throughput;
> +  uint32_t last_attempts;
> +
> +  fixed_pt_t best_ever_throughput;
> +  uint32_t best_ever_attempts;
> +
> +  uint64_t txns_while_stubborn;
> +  uint64_t cycles_while_stubborn;
> +  uint64_t txns_while_halven;
> +  uint64_t cycles_while_halven;
> +  uint64_t txns_while_giveup;
> +  uint64_t cycles_while_giveup;
> +
> +  gtm_global_optimizer() :
> +    last_throughput(fixed_pt_t(0.0001)),
> +    best_ever_throughput(fixed_pt_t(0.0001))
> +  {
> +  }
> +
> +  void
> +  set_is_enabled(bool arg)
> +  {
> +    is_enabled = allows_reoptimization() && arg;
> +  }
> +
> +  void reoptimize_htm_execution();
> +
> +  // Period, in number of transactions, between which we execute the
> +  // re-optimization procedure.
> +  static const uint32_t OPTIMIZATION_PERIOD_TXS = 500;
> +  // Initialization value for the last, fictional, throughput measured.
> +  static const fixed_pt_t MIN_THROUGHPUT = fixed_pt_t(0.001);
> +  // This serves as an initialization for the cycles used so far.
> +  static const uint32_t INIT_CYCLES = 100;
> +};
> +
>  } // namespace GTM
>
>  #include "tls.h"
> @@ -346,6 +437,9 @@ extern gtm_cacheline_mask gtm_mask_stack(gtm_cache
>  // the name, avoiding complex name mangling.
>  extern uint32_t htm_fastpath __asm__(UPFX "gtm_htm_fastpath");
>
> +// Maintains the optimization for HTM execution.
> +extern gtm_global_optimizer optimizer;
> +
>  } // namespace GTM
>
>  #endif // LIBITM_I_H
> Index: common.h
> ===================================================================
> --- common.h (revision 219316)
> +++ common.h (working copy)
> @@ -59,6 +59,11 @@ extern void * xcalloc (size_t s, bool separate_cl
>  extern void * xrealloc (void *p, size_t s, bool separate_cl = false)
>    __attribute__((malloc, nothrow));
>
> +// Linear congruential number generator.
> +//
> +// Used also in glibc for generators with small state.
> +extern uint32_t random_int(uint32_t max);
> +
>  } // namespace GTM

Re: [PATCH] add self-tuning to x86 hardware fast path in libitm

Reply via email to