Hello everyone,
gently pinging to bring this back to life given the last patch I emailed.
Best regards,
-- Nuno Diegues
On Mon, Aug 24, 2015 at 12:51 PM, Nuno Diegues <n...@ist.utl.pt> wrote:
> Hello everyone,
>
> after a summer internship and some backlog catching up in the past
> weeks, I have finally got around to review the patch according to the
> latest discussion.
>
> The changes have not caused regression, and the latest speedup results
> are coherent with what we had before.
>
>
> In the following I add some comments inline to the discussion, and
> after that paste the updated patch.
>
>
>> >
>> > So let's iterate on this in parallel with the other changes that we need
>> > to get in place. I'd prefer to have some more confidence that measuring
>> > txn throughput in epochs is the best way forward.
>> >
>> > Here are a few thoughts:
>> >
>> > Why do we actually need to measure succeeded transactions? If a HW txn
>> > is just getting committed without interfering with anything else, is
>> > this really different from, say, two HW txns getting committed? Aren't
>> > we really just interested in wasted work, or delayed txns? That may
>> > help taking care of the nontxnal vs. txnal problem.
>> >
>> > Measuring committed txns during a time that might otherwise be spent by
>> > a serial txns could be useful to figure out how much other useful work a
>> > serial txn prevents. But we'd see that as well if we'd just go serial
>> > during the auto-tuning because then concurrent txns will have to wait;
>> > and in this case we could measure it in the slow path of those
>> > concurrent txns (ie, during waiting, where measurement overhead wouldn't
>> > matter as much).
>> >
>> > If a serial txn is running, concurrent txns (that wait) are free to sync
>> > and tell the serial how much cost it creates for the concurrent txns.
>> > There, txn length could matter, but we won't find out for real until
>> > after the concurrent ones have run (they could be pretty short, so we
>> > can't simply assume that the concurrent ones are as long as the serial
>> > one, so that simply the number of concurrent ones can be used to
>> > calculate delayed work).
>
>
> I understand you concern: measuring failure in a slow path rather than
> success in
> the fast/common path.
>
> My problem is that the approach taken in our work is tailored for accessing
> one
> given metric, in our case some form of throughput of successfully
> executed atomic
> blocks.
> However, to measure failure, we seem to need two things:
> 1) the rate of failed hardware transactions
> 2) the time spent waiting for the serial lock
>
> In short, the performance will be bad if there is a mix of those two
> issues that is bad
> enough. The problem is how to define a metric that considers those two
> together?
> It is not obvious to me that there is one. Furthermore, that deviates
> significantly from
> our approach, and in the end --- even if there is a solution that
> works in that way ---
> that is a different approach, not one that can be applied to our
> self-tuning framework.
>
> What we are doing right now is incrementing a counter in a (most
> likely cached line)
> that is single-writer and multiple-reader. Most often, it will be
> owned by the single-writer.
> As such, the cost should be quite negligible compared to the execution
> of the atomic
> block, more so in the case that a Hardware transaction was used, as
> the commit itself
> should be a hundred cycles or more.
>
> A good way to measure this impact is to use the new "htm_notune" flag
> and compare
> its performance with the non-changed libitm that I use as a baseline above.
> The average results are similar, without any trend noticeable outside
> the standard
> deviation for one side or the other.
>
>
>
>
>>
>> >
>> >>
>> >> > Also, note that the mitigation strategy for rdtsc
>> >> > short-comings that you mention in the paper is not applicable in
>> >> > general, specifically not in libitm.
>> >>
>> >>
>> >> I suppose you mean the preemption of threads inflating the cycles
>> >> measured?
>> >
>> > Yes, preemption and migration of threads (in case there's no global
>> > sync'ed TSC or similar) -- you mentioned in the paper that you pin
>> > threads to cores...
>> >
>> >> This would be similarly a problem to any time source that tries to
>> >> measure the
>> >> amount of work performed; not sure how we can avoid it in general. Any
>> >> thoughts?
>> >
>> > Not really as long as we keep depending on measuring time in a
>> > light-weight way. Measuring smaller time intervals could make it less
>> > likely that preemption happens during such an interval, though.
>> >
>
>
> Note that in this patch we use no such pinning. In fact, I've measured
> that it does not
> have a noticeable impact in performance. Maybe that could be the case with
> heavy
> over-subscription of the cores with lots of threads.
> Still, it is not a correctness issue, so I believe that the fact this
> performs great in the
> common case should make it prevail.
>
>
>
>>
>> >>
>> >> > Another issue is that we need to implement the tuning in a portable way.
>> >> > You currently make it depend on whether the assembler knows about RTM,
>> >> > whereas the rest of the code makes this more portable and defers to
>> >> > arch-specific files. I'd prefer if we could use the tuning on other
>> >> > archs too. But for that, we need to cleanly separate generic from
>> >> > arch-specific parts. That affects magic numbers as well as things like
>> >> > rdtsc().
>> >>
>> >>
>> >> Yes, I refrained from adding new calls to the arch-specific files, to
>> >> contain the
>> >> changes mainly. But that is possible and that's part of the feedback I
>> >> was hoping
>> >> to get.
>> >
>> > OK. Let me know if you want further input regarding this.
>
>
> I have established this arch-specific part in the new patch. PTAL
>
>
>
>>
>> >
>> >> > I'm wondering about whether it really makes sense to treat XABORT like
>> >> > conflicts and other abort reasons, instead of like capacity aborts.
>> >> > Perhaps we need to differentiate between the explicit abort codes glibc
>> >> > uses, so that we can distinguish between cases where an abort is
>> >> > supposed to signal incompatibility with txnal execution and cases where
>> >> > it's just used for lock elision (see sysdeps/unix/sysv/linux/x86/hle.h
>> >> > in current glibc):
>> >> > #define _ABORT_LOCK_BUSY 0xff
>> >> > #define _ABORT_LOCK_IS_LOCKED 0xfe
>> >> > #define _ABORT_NESTED_TRYLOCK 0xfd
>> >>
>> >>
>> >> I am not sure I follow: are you suggesting to consider other aborts
>> >> besides
>> >> those of capacity?
>> >
>> > We may want to do this with UCB, similar to capacity. Another option
>> > would be to, for now, make fixed choices regarding whether they are
>> > considered permanent or not.
>> > _ABORT_NESTED_TRYLOCK is an incompatibility with txnal execution
>> > (similar to illegal instructions and such).
>> > _ABORT_LOCK_BUSY is failing lock elision due to the lock being already
>> > acquired; thus, this is transient I'd say.
>> >
>> > Andi, what was _ABORT_LOCK_IS_LOCKED used for again?
>
>
> This would seem a more or less trivial change, after it is agreed
> upon. So I suggest
> we leave it for later if/when this one gets through.
>
>
> >
>>
>> >> > > --- libitm/config/x86/target.h (revision 219316)
>> >> > > +++ libitm/config/x86/target.h (working copy)
>> >> > > @@ -126,12 +126,25 @@ htm_abort_should_retry (uint32_t begin_ret)
>> >> > > return begin_ret & _XABORT_RETRY;
>> >> > > }
>> >> > >
>> >> > > +static inline bool
>> >> > > +htm_is_capacity_abort(uint32_t begin_ret)
>> >> > > +{
>> >> > > + return begin_ret & _XABORT_CAPACITY;
>> >> > > +}
>> >> >
>> >> > Is this indeed just about capacity, or do we actually mean a more
>> >> > general situation?
>> >>
>> >> The idea of the Upper Confidence Bounds (UCB) part of the algorithm was
>> >> to detect "real" capacity aborts (i.e., permanent) vs those that are
>> >> "transient".
>> >> So in this case we really wanted to know if the abort was a capacity one
>> >> so
>> >> that we could direct the UCB algorithm and understand the "type" of
>> >> capacity
>> >> abort.
>> >
>> > OK. But do you want to single out capacity aborts as one case for which
>> > you want to detect permanent vs. transient, or do you want to generally
>> > find out whether aborts that are reported as permanent (e.g., capacity)
>> > are indeed permanent? In the latter case, a more generic name would be
>> > more appropriate. (And that's not just a naming thing, but guidance for
>> > other archs regarding what they should put in the "capacity" bucket of
>> > abort reasons...)
>> >
>> >
>
>
> As far as we know, capacity aborts are the ones that fit our criteria:
> they are identifiable
> in a different abort category, yet they are not necessarily
> persistent/deterministic.
> This could be expanded to others in the future, but it is not
> something we tested with, as
> other abort types are not as widespread.
>
>
>
>
>
> Updated patch follows:
>
> Index: util.cc
> ===================================================================
> --- util.cc (revision 219316)
> +++ util.cc (working copy)
> @@ -94,4 +94,15 @@ xrealloc (void *old, size_t size, bool separate_cl
> return r;
> }
>
> +// State for the generation is not synchronized. It assumes single thread
> +// usage at any time. It is, in fact, invoked by the thread that is
> +// re-optimizing the libitm HTM usage, so the assumption holds.
> +uint32_t
> +random_int(uint32_t max)
> +{
> + static uint32_t seed = 123456789;
> + seed = (1103515245 * seed + 12345);
> + return seed;
> +}
> +
> } // namespace GTM
> Index: libitm.h
> ===================================================================
> --- libitm.h (revision 219316)
> +++ libitm.h (working copy)
> @@ -101,7 +101,11 @@ typedef enum
> /* These are not part of the ABI but used for custom HTM fast paths. See
> ITM_beginTransaction and gtm_thread::begin_transaction. */
> pr_HTMRetryableAbort = 0x800000,
> - pr_HTMRetriedAfterAbort = 0x1000000
> + pr_HTMRetriedAfterAbort = 0x1000000,
> + /* Reports whether a capacity abort was detected in the custom HTM fast
> + path, so that it can be used in the abort handler in
> + gtm_thread::begin_transaction. */
> + pr_HTMCapacityAbort = 0x2000000
> } _ITM_codeProperties;
>
> /* Result from startTransaction that describes what actions to take.
> Index: method-serial.cc
> ===================================================================
> --- method-serial.cc (revision 219316)
> +++ method-serial.cc (working copy)
> @@ -223,6 +223,19 @@ struct htm_mg : public method_group
> // initially disabled.
> #ifdef USE_HTM_FASTPATH
> htm_fastpath = htm_init();
> + optimizer.optimized_mode = STUBBORN;
> + optimizer.last_cycles = get_efficient_time();
> + optimizer.last_total_txs_executed = 0;
> + optimizer.last_throughput = optimizer.MIN_THROUGHPUT;
> + optimizer.last_attempts = htm_fastpath > 0 ? htm_fastpath - 1 : 1;
> + optimizer.best_ever_throughput = optimizer.last_throughput;
> + optimizer.best_ever_attempts = htm_fastpath;
> + optimizer.txns_while_stubborn = 1;
> + optimizer.cycles_while_stubborn = optimizer.INIT_CYCLES;
> + optimizer.txns_while_halven = 1;
> + optimizer.cycles_while_halven = optimizer.INIT_CYCLES;
> + optimizer.txns_while_giveup = 1;
> + optimizer.cycles_while_giveup = optimizer.INIT_CYCLES;
> #endif
> }
> virtual void fini()
> Index: retry.cc
> ===================================================================
> --- retry.cc (revision 219316)
> +++ retry.cc (working copy)
> @@ -218,7 +218,6 @@ GTM::gtm_thread::set_default_dispatch(GTM::abi_dis
> default_dispatch.store(disp, memory_order_relaxed);
> }
>
> -
> static GTM::abi_dispatch*
> parse_default_method()
> {
> @@ -254,10 +253,17 @@ parse_default_method()
> disp = GTM::dispatch_ml_wt();
> env += 5;
> }
> + else if (strncmp(env, "htm_notune", 10) == 0)
> + {
> + disp = GTM::dispatch_htm();
> + env += 10;
> + GTM::optimizer.set_is_enabled(0);
> + }
> else if (strncmp(env, "htm", 3) == 0)
> {
> disp = GTM::dispatch_htm();
> env += 3;
> + GTM::optimizer.set_is_enabled(1);
> }
> else
> goto unknown;
> Index: beginend.cc
> ===================================================================
> --- beginend.cc (revision 219316)
> +++ beginend.cc (working copy)
> @@ -39,6 +39,9 @@ unsigned GTM::gtm_thread::number_of_threads = 0;
> gtm_stmlock GTM::gtm_stmlock_array[LOCK_ARRAY_SIZE];
> atomic<gtm_version> GTM::gtm_clock;
>
> +// Holds the optimized configuration for parameters relevant to HTM
> execution.
> +gtm_global_optimizer GTM::optimizer;
> +
> /* ??? Move elsewhere when we figure out library initialization. */
> uint64_t GTM::gtm_spin_count_var = 1000;
>
> @@ -95,7 +98,139 @@ thread_exit_init()
> GTM_fatal("Creating thread release TLS key failed.");
> }
>
> +/* Implementations for the fixed-point arithmetic type. */
> +const uint32_t GTM::fixed_pt_t::FIXED_PT_WIDTH = 64;
> +const uint32_t GTM::fixed_pt_t::FIXED_PT_INTEGER_WIDTH = 40;
> +const uint32_t GTM::fixed_pt_t::FIXED_PT_FRAC_WIDTH =
> + GTM::fixed_pt_t::FIXED_PT_WIDTH - GTM::fixed_pt_t::FIXED_PT_INTEGER_WIDTH;
> +const fixed_pt_t GTM::fixed_pt_t::FIXED_PT_ONE = GTM::fixed_pt_t::one();
> +const fixed_pt_t GTM::fixed_pt_t::FIXED_PT_TWO =
> + GTM::fixed_pt_t::FIXED_PT_ONE.add(GTM::fixed_pt_t::FIXED_PT_ONE);
>
> +fixed_pt_t
> +GTM::fixed_pt_t::one()
> +{
> + fixed_pt_t result(0);
> + result.number = static_cast<uint64_t>(1) << FIXED_PT_FRAC_WIDTH;
> + return result;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::add(const fixed_pt_t operand) const
> +{
> + fixed_pt_t result(0);
> + result.number = this->number + operand.number;
> + return result;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::sub(const fixed_pt_t operand) const
> +{
> + fixed_pt_t result(0);
> + result.number = this->number - operand.number;
> + return result;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::mul(const fixed_pt_t operand) const
> +{
> + fixed_pt_t result(0);
> + result.number = (static_cast<__uint128_t>(this->number) *
> + static_cast<__uint128_t>(operand.number)) >> FIXED_PT_FRAC_WIDTH;
> + return result;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::div(const fixed_pt_t operand) const
> +{
> + fixed_pt_t result(0);
> + result.number =
> + (static_cast<__uint128_t>(this->number) << FIXED_PT_FRAC_WIDTH) /
> + static_cast<__uint128_t>(operand.number);
> + return result;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::sqrt() const
> +{
> + uint64_t invert = 0;
> + uint64_t iter = FIXED_PT_FRAC_WIDTH;
> + uint64_t i;
> + fixed_pt_t n = *this;
> +
> + if (n.number == 0 || n.number == FIXED_PT_ONE.number)
> + {
> + return n;
> + }
> + if (n.number < FIXED_PT_ONE.number && n.number > 6)
> + {
> + invert = 1;
> + n = FIXED_PT_ONE.div(n);
> + }
> + if (n.number > FIXED_PT_ONE.number)
> + {
> + uint64_t s = n.number;
> + iter = 0;
> + while (s > 0)
> + {
> + s >>= 2;
> + iter++;
> + }
> + }
> +
> + fixed_pt_t l(0);
> + l.number = (n.number >> 1) + 1;
> + for (i = 0; i < iter; i++)
> + {
> + l.number = l.add(n.div(l)).number >> 1;
> + }
> + if (invert)
> + {
> + return FIXED_PT_ONE.div(l);
> + }
> + return l;
> +}
> +
> +fixed_pt_t
> +GTM::fixed_pt_t::ln() const
> +{
> + const fixed_pt_t LN2 = fixed_pt_t(0.69314718055994530942);
> + const fixed_pt_t LG[7] = {
> + fixed_pt_t(6.666666666666735130e-01),
> + fixed_pt_t(3.999999999940941908e-01),
> + fixed_pt_t(2.857142874366239149e-01),
> + fixed_pt_t(2.222219843214978396e-01),
> + fixed_pt_t(1.818357216161805012e-01),
> + fixed_pt_t(1.531383769920937332e-01),
> + fixed_pt_t(1.479819860511658591e-01)
> + };
> +
> + if (this->number == 0)
> + {
> + return 0xffffffff;
> + }
> +
> + fixed_pt_t log2 = fixed_pt_t(0);
> + fixed_pt_t xi = *this;
> + while (xi.number > FIXED_PT_TWO.number)
> + {
> + xi.number >>= 1;
> + log2.number++;
> + }
> +
> + fixed_pt_t f = xi.sub(FIXED_PT_ONE);
> + fixed_pt_t s = f.div(FIXED_PT_TWO.add(f));
> + fixed_pt_t z = s.mul(s);
> + fixed_pt_t w = z.mul(z);
> + fixed_pt_t R =
> + w.mul(LG[1].add(w.mul(LG[3].add(w.mul(LG[5]))))).add(
> + z.mul(LG[0].add(w.mul(LG[2].add(
> + w.mul(LG[4].add(w.mul(LG[6]))))))));
> +
> + return LN2.mul(fixed_pt_t(log2.number << FIXED_PT_FRAC_WIDTH)).add(
> + f.sub(s.mul(f.sub(R))));
> +}
> +
> GTM::gtm_thread::~gtm_thread()
> {
> if (nesting > 0)
> @@ -149,6 +284,180 @@ choose_code_path(uint32_t prop, abi_dispatch *disp
> return a_runInstrumentedCode;
> }
>
> +// Periodically invoked by some thread(s) that are in the HTM fall-back path.
> +// This implements the Upper Confidence Bounds and Gradient Descent to tune,
> +// respectively, how to deal with HTM capacity aborts and how many HTM
> retries
> +// to allow. For more details, check Diegues et al. in USENIX ICAC 2014.
> +void
> +GTM::gtm_global_optimizer::reoptimize_htm_execution()
> +{
> + // Avoid redundant re-optimization if another thread may be doing it
> already.
> + // This allows the threads to avoid racing to re-optimize concurrently,
> which
> + // is useless.
> +
> + if (unlikely(GTM::gtm_thread::serial_lock.is_write_locked()))
> + {
> + return;
> + }
> + // Eventually, some thread(s) will follow this path, even in the face of
> + // spurious short-cut exits due to the serial lock being taken.
> + GTM::gtm_thread::serial_lock.write_lock();
> +
> + // Collect the statistics obtained so far.
> + uint64_t total_txs_executed = 0;
> + gtm_thread *ptr = GTM::gtm_thread::list_of_threads;
> + for (; ptr; ptr = ptr->next_thread)
> + {
> + total_txs_executed += ptr->number_executed_txns;
> + }
> +
> + // Obtain the delta performance with respect to the last period measured.
> + // We expect the underlying architecture to provide an efficient way to
> + // measure time passed since the last re-optimization.
> + const uint64_t current_cycles = get_efficient_time();
> + const uint64_t cycles_used = current_cycles - optimizer.last_cycles;
> + const uint64_t new_txs_executed =
> + total_txs_executed - optimizer.last_total_txs_executed;
> + const fixed_pt_t current_throughput =
> + fixed_pt_t(new_txs_executed).div(fixed_pt_t(cycles_used));
> +
> + if (current_throughput.number == 0)
> + {
> + // This thread probably executed right after another one as they both
> + // found it was time to re-optimize, but did not contend on the serial
> + // lock. As such, we avoid redundant re-optimizations and short cut
> out.
> + GTM::gtm_thread::serial_lock.write_unlock();
> + return;
> + }
> +
> + optimizer.last_cycles = current_cycles;
> + optimizer.last_total_txs_executed = total_txs_executed;
> + if (optimizer.optimized_mode == STUBBORN)
> + {
> + optimizer.txns_while_stubborn += new_txs_executed;
> + optimizer.cycles_while_stubborn += cycles_used;
> + }
> + else if (optimizer.optimized_mode == HALVEN)
> + {
> + optimizer.txns_while_halven += new_txs_executed;
> + optimizer.cycles_while_halven += cycles_used;
> + }
> + else
> + {
> + optimizer.txns_while_giveup += new_txs_executed;
> + optimizer.cycles_while_giveup += cycles_used;
> + }
> +
> + // Compute Upper Confidence Bounds for the mode to choose next to decide on
> + // how to deal with HTM capacity aborts.
> + // Use fixed point arithmetic types to spare floating point usage.
> + const fixed_pt_t log_sum =
> + GTM::fixed_pt_t::FIXED_PT_TWO.mul(
> + (fixed_pt_t(optimizer.txns_while_stubborn).add(
> + fixed_pt_t(optimizer.txns_while_halven)).add(
> + fixed_pt_t(optimizer.txns_while_giveup))).ln());
> + const fixed_pt_t ucb_stubborn =
> + ((fixed_pt_t(optimizer.txns_while_stubborn).div(
> + fixed_pt_t(optimizer.cycles_while_stubborn))).div(
> + optimizer.best_ever_throughput)).add(
> + (log_sum.div(fixed_pt_t(optimizer.txns_while_stubborn))).sqrt());
> + const fixed_pt_t ucb_halven =
> + ((fixed_pt_t(optimizer.txns_while_halven).div(
> + fixed_pt_t(optimizer.cycles_while_halven))).div(
> + optimizer.best_ever_throughput)).add(
> + (log_sum.div(fixed_pt_t(optimizer.txns_while_halven))).sqrt());
> + const fixed_pt_t ucb_giveup =
> + ((fixed_pt_t(optimizer.txns_while_giveup).div(
> + fixed_pt_t(optimizer.cycles_while_giveup)).div(
> + optimizer.best_ever_throughput))).add(
> + (log_sum.div(fixed_pt_t(optimizer.txns_while_giveup))).sqrt());
> +
> + if (ucb_stubborn.number > ucb_halven.number &&
> + ucb_stubborn.number > ucb_giveup.number)
> + {
> + optimizer.optimized_mode = STUBBORN;
> + }
> + else if (ucb_halven.number > ucb_giveup.number)
> + {
> + optimizer.optimized_mode = HALVEN;
> + }
> + else
> + {
> + optimizer.optimized_mode = GIVEUP;
> + }
> +
> + // Compute Gradient Descent to regulate the number of retries in HTM.
> + const fixed_pt_t LARGE_DEGRADATION = fixed_pt_t(1.40);
> + const fixed_pt_t MIN_IMPROVEMENT = fixed_pt_t(1.05);
> + const int32_t MAX_RETRIES = 20;
> + const int32_t MIN_RETRIES = 2;
> +
> + const fixed_pt_t change_for_better =
> + current_throughput.div(optimizer.last_throughput);
> + const fixed_pt_t change_for_worse =
> + optimizer.last_throughput.div(current_throughput);
> + int32_t last_attempts = optimizer.last_attempts;
> + int32_t current_attempts = htm_fastpath;
> + int32_t new_attempts = current_attempts;
> + if (unlikely(change_for_worse.number >
> fixed_pt_t(LARGE_DEGRADATION).number))
> + {
> + htm_fastpath = optimizer.best_ever_attempts;
> + optimizer.last_throughput = current_throughput;
> + optimizer.last_attempts = current_attempts;
> + goto finish_reoptimization;
> + }
> +
> + if (unlikely(random_int(100) == 0))
> + {
> + optimizer.last_attempts = htm_fastpath;
> + optimizer.last_throughput = current_throughput;
> + htm_fastpath = random_int(MAX_RETRIES - MIN_RETRIES) + MIN_RETRIES;
> + goto finish_reoptimization;
> + }
> +
> + if (change_for_better.number > MIN_IMPROVEMENT.number)
> + {
> + new_attempts += current_attempts - last_attempts;
> + if (current_attempts == last_attempts)
> + {
> + new_attempts += random_int(2) == 0 ? -1 : 1;
> + }
> + }
> + else if (change_for_worse.number > MIN_IMPROVEMENT.number)
> + {
> + new_attempts -= current_attempts - last_attempts;
> + if (current_attempts == last_attempts)
> + {
> + new_attempts += random_int(2) == 0 ? -1 : 1;
> + }
> + }
> +
> + if (unlikely(new_attempts > MAX_RETRIES))
> + {
> + new_attempts = MAX_RETRIES;
> + }
> + else if (unlikely(new_attempts < MIN_RETRIES))
> + {
> + new_attempts = MIN_RETRIES;
> + }
> +
> + if (current_attempts != new_attempts)
> + {
> + optimizer.last_attempts = current_attempts;
> + optimizer.last_throughput = current_throughput;
> + }
> +
> + htm_fastpath = new_attempts;
> + if (unlikely(current_throughput.number >
> optimizer.best_ever_throughput.number))
> + {
> + optimizer.best_ever_throughput = current_throughput;
> + optimizer.best_ever_attempts = current_attempts;
> + }
> +
> + finish_reoptimization:
> + GTM::gtm_thread::serial_lock.write_unlock();
> +}
> +
> uint32_t
> GTM::gtm_thread::begin_transaction (uint32_t prop, const gtm_jmpbuf *jb)
> {
> @@ -209,8 +518,36 @@ GTM::gtm_thread::begin_transaction (uint32_t prop,
> }
> // The transaction has aborted. Don't retry if it's unlikely that
> // retrying the transaction will be successful.
> - if (!htm_abort_should_retry(ret))
> - break;
> + if (optimizer.is_enabled)
> + {
> + // If the HTM self-tuning is enabled, then we act according to
> + // its configuration.
> + if (htm_is_capacity_abort(ret))
> + {
> + gtm_capacity_abort_mode capacity_mode =
> + optimizer.optimized_mode;
> + // Adapt the retries according to the mode. Note that the
> + // linear decrease is implicit in the outer for loop.
> + if (capacity_mode == HALVEN)
> + t = t / 2 + 1;
> + else if (capacity_mode == GIVEUP)
> + break;
> + }
> +
> + // Take the chance of having aborted to possibly re-optimize the
> + // the HTM configuration. Do this as a last resort before going
> + // for the serial lock.
> + if (unlikely(tx->number_executed_txns %
> + optimizer.OPTIMIZATION_PERIOD_TXS == 0 && t == 1))
> + optimizer.reoptimize_htm_execution();
> + }
> + else
> + {
> + // Execute the fixed logic when there is no adaptation.
> + if (!htm_abort_should_retry(ret))
> + break;
> + }
> +
> // Wait until any concurrent serial-mode transactions have finished.
> // This is an empty critical section, but won't be elided.
> if (serial_lock.is_write_locked())
> @@ -248,7 +585,11 @@ GTM::gtm_thread::begin_transaction (uint32_t prop,
> // HTM fastpath aborted, and that we thus have to decide whether to retry
> // the fastpath (returning a_tryHTMFastPath) or just proceed with the
> // fallback method.
> - if (likely(htm_fastpath && (prop & pr_HTMRetryableAbort)))
> + // Even if pr_HTMRetryableAbort is not true, we still consider to retry if
> + // it was a capacity abort (stated by pr_HTMCapacityAbort) and the self
> + // tuning capabilities are enabled (stated by optimizer.is_enabled).
> + if (likely(htm_fastpath && ((prop & pr_HTMRetryableAbort)
> + || ((prop & pr_HTMCapacityAbort) && (optimizer.is_enabled)))))
> {
> tx = gtm_thr();
> if (unlikely(tx == NULL))
> @@ -266,6 +607,26 @@ GTM::gtm_thread::begin_transaction (uint32_t prop,
>
> if (--tx->restart_total > 0)
> {
> + // See above: we use the optimized configuration if enabled.
> + if (optimizer.is_enabled)
> + {
> + // See above: similarly, we adapt upon capacity aborts.
> + if (prop & pr_HTMCapacityAbort)
> + {
> + gtm_capacity_abort_mode capacity_mode = optimizer.optimized_mode;
> + if (capacity_mode == HALVEN)
> + tx->restart_total = tx->restart_total / 2 + 1;
> + else if (capacity_mode == GIVEUP)
> + goto stop_custom_htm_fastpath;
> + }
> +
> + // See above: similarly, we periodically re-optimize.
> + if (unlikely(tx->number_executed_txns %
> + optimizer.OPTIMIZATION_PERIOD_TXS == 0 &&
> + tx->restart_total == 1))
> + optimizer.reoptimize_htm_execution();
> + }
> +
> // Wait until any concurrent serial-mode transactions have finished.
> // Essentially the same code as above.
> if (serial_lock.is_write_locked())
> @@ -665,12 +1026,23 @@ _ITM_commitTransaction(void)
> if (likely(htm_fastpath && !gtm_thread::serial_lock.is_write_locked()))
> {
> htm_commit();
> + gtm_thread *tx = gtm_thr();
> + if (unlikely(tx == NULL))
> + {
> + // See above: when using the HTM fast-path, we might not have created
> + // a thread object yet.
> + tx = new gtm_thread();
> + set_gtm_thr(tx);
> + }
> + tx->number_executed_txns++;
> return;
> }
> #endif
> gtm_thread *tx = gtm_thr();
> if (!tx->trycommit ())
> tx->restart (RESTART_VALIDATE_COMMIT);
> +
> + tx->number_executed_txns++;
> }
>
> void ITM_REGPARM
> @@ -681,6 +1053,14 @@ _ITM_commitTransactionEH(void *exc_ptr)
> if (likely(htm_fastpath && !gtm_thread::serial_lock.is_write_locked()))
> {
> htm_commit();
> + gtm_thread *tx = gtm_thr();
> + if (unlikely(tx == NULL))
> + {
> + // See above.
> + tx = new gtm_thread();
> + set_gtm_thr(tx);
> + }
> + tx->number_executed_txns++;
> return;
> }
> #endif
> @@ -690,4 +1070,5 @@ _ITM_commitTransactionEH(void *exc_ptr)
> tx->eh_in_flight = exc_ptr;
> tx->restart (RESTART_VALIDATE_COMMIT);
> }
> + tx->number_executed_txns++;
> }
> Index: config/s390/target.h
> ===================================================================
> --- config/s390/target.h (revision 219316)
> +++ config/s390/target.h (working copy)
> @@ -116,12 +116,30 @@ htm_abort_should_retry (uint32_t begin_ret)
> }
>
> static inline bool
> +htm_is_capacity_abort (uint32_t begin_ret)
> +{
> + return begin_ret != _HTM_TBEGIN_PERSISTENT;
> +}
> +
> +static inline bool
> htm_transaction_active ()
> {
> __asm volatile (".machine \"all\" \n\t");
> return __builtin_tx_nesting_depth() != 0;
> }
>
> +static inline uint64_t
> +get_efficient_time()
> +{
> + return 0;
> +}
> +
> +static inline bool
> +allows_reoptimization()
> +{
> + return false;
> +}
> +
> #endif
>
> } // namespace GTM
> Index: config/arm/target.h
> ===================================================================
> --- config/arm/target.h (revision 219316)
> +++ config/arm/target.h (working copy)
> @@ -46,4 +46,10 @@ cpu_relax (void)
> __sync_synchronize ();
> }
>
> +static inline bool
> +allows_reoptimization()
> +{
> + return false;
> +}
> +
> } // namespace GTM
> Index: config/powerpc/target.h
> ===================================================================
> --- config/powerpc/target.h (revision 219316)
> +++ config/powerpc/target.h (working copy)
> @@ -74,6 +74,7 @@ cpu_relax (void)
> #define _TBEGIN_STARTED 0
> #define _TBEGIN_INDETERMINATE 1
> #define _TBEGIN_PERSISTENT 2
> +#define _TBEGIN_CAPACITY 3
>
> /* Number of retries for transient failures. */
> #define _HTM_RETRIES 10
> @@ -101,6 +102,9 @@ htm_begin (void)
> if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
> return _TBEGIN_PERSISTENT;
>
> + if (_TEXASRU_FOOTPRINT_OVERFLOW (__builtin_get_texasru ()))
> + return _TBEGIN_CAPACITY;
> +
> return _TBEGIN_INDETERMINATE;
> }
>
> @@ -128,6 +132,12 @@ htm_abort_should_retry (uint32_t begin_ret)
> return begin_ret != _TBEGIN_PERSISTENT;
> }
>
> +static inline bool
> +htm_is_capacity_abort (uint32_t begin_ret)
> +{
> + return begin_ret == _TBEGIN_CAPACITY;
> +}
> +
> /* Returns true iff a hardware transaction is currently being executed. */
> static inline bool
> htm_transaction_active (void)
> @@ -135,6 +145,36 @@ htm_transaction_active (void)
> return (_HTM_STATE (__builtin_ttest ()) == _HTM_TRANSACTIONAL);
> }
>
> +static inline uint64_t
> +get_efficient_time()
> +{
> + #if defined (__powerpc64__) || defined (__ppc64__)
> + uint64_t __tb;
> + __asm__ volatile ("mfspr %[tb], 268\n"
> + : [tb]"=r" (__tb)
> + : );
> + return __tb;
> + #else
> + register unsigned long __tbu, __tbl, __tmp; \
> + __asm__ volatile (
> + "0:\n"
> + "mftbu %[tbu]\n"
> + "mftbl %[tbl]\n"
> + "mftbu %[tmp]\n"
> + "cmpw %[tbu], %[tmp]\n"
> + "bne- 0b\n"
> + : [tbu]"=r" (__tbu), [tbl]"=r" (__tbl), [tmp]"=r" (__tmp)
> + : );
> + return (( (uint64_t) __tbu << 32) | __tbl);
> + #endif /* not __powerpc64__ */
> +}
> +
> +static inline bool
> +allows_reoptimization()
> +{
> + return true;
> +}
> +
> #endif
>
> } // namespace GTM
> Index: config/alpha/target.h
> ===================================================================
> --- config/alpha/target.h (revision 219316)
> +++ config/alpha/target.h (working copy)
> @@ -41,4 +41,10 @@ cpu_relax (void)
> __asm volatile ("" : : : "memory");
> }
>
> +static inline bool
> +allows_reoptimization()
> +{
> + return false;
> +}
> +
> } // namespace GTM
> Index: config/x86/sjlj.S
> ===================================================================
> --- config/x86/sjlj.S (revision 219316)
> +++ config/x86/sjlj.S (working copy)
> @@ -59,12 +59,14 @@
> #define pr_hasNoAbort 0x08
> #define pr_HTMRetryableAbort 0x800000
> #define pr_HTMRetriedAfterAbort 0x1000000
> +#define pr_HTMCapacityAbort 0x2000000
> #define a_runInstrumentedCode 0x01
> #define a_runUninstrumentedCode 0x02
> #define a_tryHTMFastPath 0x20
>
> #define _XABORT_EXPLICIT (1 << 0)
> #define _XABORT_RETRY (1 << 1)
> +#define _XABORT_CAPACITY (1 << 3)
>
> .text
>
> @@ -111,6 +113,9 @@ SYM(_ITM_beginTransaction):
> testl $(_XABORT_RETRY|_XABORT_EXPLICIT), %eax
> jz .Lno_htm
> orl $pr_HTMRetryableAbort, %edi
> + testl $(_XABORT_CAPACITY), %eax
> + jz .Lno_htm
> + orl $pr_HTMCapacityAbort, %edi
> /* Let the C++ code handle the retry policy. */
> .Lno_htm:
> #endif
> Index: config/x86/target.h
> ===================================================================
> --- config/x86/target.h (revision 219316)
> +++ config/x86/target.h (working copy)
> @@ -126,13 +126,33 @@ htm_abort_should_retry (uint32_t begin_ret)
> return begin_ret & _XABORT_RETRY;
> }
>
> +static inline bool
> +htm_is_capacity_abort(uint32_t begin_ret)
> +{
> + return begin_ret & _XABORT_CAPACITY;
> +}
> +
> /* Returns true iff a hardware transaction is currently being executed. */
> static inline bool
> htm_transaction_active ()
> {
> return _xtest() != 0;
> }
> +
> +static inline uint64_t
> +get_efficient_time()
> +{
> + uint32_t hi, lo;
> + __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
> + return ((unsigned long long) lo) | (((unsigned long long) hi) << 32);
> +}
> +
> +static inline bool
> +allows_reoptimization()
> +{
> + return true;
> +}
> +
> #endif
>
> -
> } // namespace GTM
> Index: config/aarch64/target.h
> ===================================================================
> --- config/aarch64/target.h (revision 219316)
> +++ config/aarch64/target.h (working copy)
> @@ -42,4 +42,10 @@ cpu_relax (void)
> __asm volatile ("" : : : "memory");
> }
>
> +static inline bool
> +allows_reoptimization()
> +{
> + return false;
> +}
> +
> } // namespace GTM
> Index: config/sparc/target.h
> ===================================================================
> --- config/sparc/target.h (revision 219316)
> +++ config/sparc/target.h (working copy)
> @@ -39,4 +39,10 @@ cpu_relax (void)
> __asm volatile ("rd %%ccr, %%g0" : : : "memory");
> }
>
> +static inline bool
> +allows_reoptimization()
> +{
> + return false;
> +}
> +
> } // namespace GTM
> Index: config/sh/target.h
> ===================================================================
> --- config/sh/target.h (revision 219316)
> +++ config/sh/target.h (working copy)
> @@ -44,4 +44,10 @@ cpu_relax (void)
> __asm volatile ("" : : : "memory");
> }
>
> +static inline bool
> +allows_reoptimization()
> +{
> + return false;
> +}
> +
> } // namespace GTM
> Index: libitm_i.h
> ===================================================================
> --- libitm_i.h (revision 219316)
> +++ libitm_i.h (working copy)
> @@ -242,6 +242,9 @@ struct gtm_thread
> uint32_t restart_reason[NUM_RESTARTS];
> uint32_t restart_total;
>
> + // Keeps track of how many transactions were successfully executed.
> + uint64_t number_executed_txns;
> +
> // *** The shared part of gtm_thread starts here. ***
> // Shared state is on separate cachelines to avoid false sharing with
> // thread-local parts of gtm_thread.
> @@ -309,6 +312,94 @@ struct gtm_thread
> void commit_user_actions ();
> };
>
> +// Definition of fixed point arithmetic types.
> +// Half the bits are dedicated to the fractional type, and the rest to the
> +// "whole" part.
> +struct fixed_pt_t
> +{
> + uint64_t number;
> +
> + fixed_pt_t(double n)
> + {
> + this->number = static_cast<uint64_t>(n * FIXED_PT_ONE.number + (n
>>= 0 ? 0.5 : -0.5));
> + }
> +
> + fixed_pt_t add(const fixed_pt_t operand) const;
> + fixed_pt_t sub(const fixed_pt_t operand) const;
> + fixed_pt_t mul(const fixed_pt_t operand) const;
> + fixed_pt_t div(const fixed_pt_t operand) const;
> + fixed_pt_t sqrt() const;
> + fixed_pt_t ln() const;
> +
> + static fixed_pt_t one();
> +
> + static const uint32_t FIXED_PT_WIDTH;
> + static const uint32_t FIXED_PT_INTEGER_WIDTH;
> + static const uint32_t FIXED_PT_FRAC_WIDTH;
> + static const fixed_pt_t FIXED_PT_ONE;
> + static const fixed_pt_t FIXED_PT_TWO;
> +};
> +
> +// Different ways to deal with capacity aborts in HTM execution.
> +enum gtm_capacity_abort_mode
> +{
> + STUBBORN,
> + HALVEN,
> + GIVEUP
> +};
> +
> +// Maintains the current values optimized for HTM execution and the
> +// corresponding statistics gathered for the decision-making.
> +struct gtm_global_optimizer
> +{
> + // Whether to re-optimize the HTM execution. This is parametrizable by an
> + // env variable.
> + uint32_t is_enabled;
> +
> + // Mode chosen to currently deal with capacity aborts.
> + gtm_capacity_abort_mode optimized_mode;
> + // The number of attempts chosen to currently insist on HTM execution is
> + // maintained in htm_fastpath.
> +
> + uint64_t last_cycles;
> + uint64_t last_total_txs_executed;
> +
> + fixed_pt_t last_throughput;
> + uint32_t last_attempts;
> +
> + fixed_pt_t best_ever_throughput;
> + uint32_t best_ever_attempts;
> +
> + uint64_t txns_while_stubborn;
> + uint64_t cycles_while_stubborn;
> + uint64_t txns_while_halven;
> + uint64_t cycles_while_halven;
> + uint64_t txns_while_giveup;
> + uint64_t cycles_while_giveup;
> +
> + gtm_global_optimizer() :
> + last_throughput(fixed_pt_t(0.0001)),
> + best_ever_throughput(fixed_pt_t(0.0001))
> + {
> + }
> +
> + void
> + set_is_enabled(bool arg)
> + {
> + is_enabled = allows_reoptimization() && arg;
> + }
> +
> + void reoptimize_htm_execution();
> +
> + // Period, in number of transactions, between which we execute the
> + // re-optimization procedure.
> + static const uint32_t OPTIMIZATION_PERIOD_TXS = 500;
> + // Initialization value for the last, fictional, throughput measured.
> + static const fixed_pt_t MIN_THROUGHPUT = fixed_pt_t(0.001);
> + // This serves as an initialization for the cycles used so far.
> + static const uint32_t INIT_CYCLES = 100;
> +};
> +
> } // namespace GTM
>
> #include "tls.h"
> @@ -346,6 +437,9 @@ extern gtm_cacheline_mask gtm_mask_stack(gtm_cache
> // the name, avoiding complex name mangling.
> extern uint32_t htm_fastpath __asm__(UPFX "gtm_htm_fastpath");
>
> +// Maintains the optimization for HTM execution.
> +extern gtm_global_optimizer optimizer;
> +
> } // namespace GTM
>
> #endif // LIBITM_I_H
> Index: common.h
> ===================================================================
> --- common.h (revision 219316)
> +++ common.h (working copy)
> @@ -59,6 +59,11 @@ extern void * xcalloc (size_t s, bool separate_cl
> extern void * xrealloc (void *p, size_t s, bool separate_cl = false)
> __attribute__((malloc, nothrow));
>
> +// Linear congruential number generator.
> +//
> +// Used also in glibc for generators with small state.
> +extern uint32_t random_int(uint32_t max);
> +
> } // namespace GTM