Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()

2019-10-27 Thread Florian Weimer
* Mark Wielaard:

>> Current glibc versions have a thread-local fast path, which should
>> address some of these concerns.  It's still not a bump-pointer
>> allocator, but at least there are no atomics on that path.
>
> Since which version of glibc is there a thread-local fast path?

It was added in:

commit d5c3fafc4307c9b7a4c7d5cb381fcdbfad340bcc
Author: DJ Delorie 
Date:   Thu Jul 6 13:37:30 2017 -0400

Add per-thread cache to malloc

So glibc 2.26.  But it is a build-time option, enabled by default, but
it can be switched off by distributions.


Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.

2019-10-27 Thread Mark Wielaard
On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
> On Sat, Oct 26, 2019 at 01:50, Mark Wielaard  wrote:
> > For example I am not sure I can proof that in resize_worker() this
> > always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
> > In general the handling of the resizing_state field is pretty tricky.
> > What is the maximum number of concurrent threads it can handle?
> 
> The master thread waits for all workers to deregister before 
> transitioning from CLEANING to NO_RESIZING. Since the worker thread by 
> that point has registered (and the synchronization is arranged as such) 
> it will never see a transition all the way back to NO_RESIZING before 
> getting its word in. This way a worker thread doesn't "miss the boat" 
> and end up in an later resizing cycle.

Thanks. I think my confusion came from the fact that these states are
expressed by 2 STATE_BITs. I didn't realize they are distinct states
and cannot hold simultaneously. I somehow had the impression all states
could be "active" at the same time. The bit-fiddling with
atomic_fetch_xor_explicit is tricky.

So the busy loop to check for GET_ACTIVE_WORKERS in resize_master works
because the master itself never does an STATE_INCREMENT itself.

Are these busy loops not expensive? There are a couple in the code
where are thread is just spinning busily till all other threads are
ready and/or the state is changed.

> The maximum number of threads is a size_t minus 2 bits, so bare minimum 
> is 16384 on a 16-bit machine (which standard C still technically 
> supports). Any modern system will allow either a billion (30 bits) or 4 
> quintillion (62 bits).

Ah, right, that was also connected to my confusion about how many bits
were used to hold the state. 1 billion threads should be enough for
anybody :)

> > The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't
> > use
> > COMPARE (nor ITERATE and REVERSE). I haven't tried to replace that one
> > with the concurrent version, but I think we should. It is only used
> > when there are debug type units in the Dwarf. Which is not the default
> > (and some gcc hackers really don't like them) so you will probably not
> > encounter them normally. But we should probably make looking them up
> > also concurrent safe as a followup patch.
> 
> A quick grep -r shows that ITERATE and REVERSE are used for the 
> asm_symbol_tab. If iteration and insertion are not concurrent we can 
> easily add bidirectional iteration (using the same Treiber stack-like 
> structure as used for the memory management). Also COMPARE is not 
> defined to be (0) in this instance.

Yes. We would need some other solution for the libasm code. But I think
we can use the concurrent table for everything in libdw.

> > I only tested the single-threaded case. It is slightly slower, but not
> > very much. The previous patch made the code slightly faster, that
> > slight speed increase is gone now. But that also means it is about on
> > par with the old version without any concurrent improvements.
> > 
> > It does use a bit more memory though. For each CU the abbrev table
> > structure is 112 bytes larger. Given that there can be several 1000 
> > CUs
> > in a (large) Dwarf that means a couple of hundred K of extra memory
> > use. It is probably acceptable since such files contain 100 MBs of DIE
> > data.
> 
> That extra memory comes from the state for resizing mostly, so it 
> probably could be reclaimed. Maybe it could live on the master thread's 
> stack and then be accessed via pointer by the worker threads? If its 
> enough of an issue to consider fixing I can work out a simple patch for 
> it.

I am not too concerned by it. But haven't done much measurements. You
probably have one of the most demanding programs using this code. Are
you seeing a significant use in memory?

> > But I was wondering whether next_init_block and num_initialized_blocks
> > could be shared with next_move_block and num_moved_blocks. If I
> > understand the code in resize_helper() correctly then all threads are
> > either initializing or all threads are moving. So can't we just use 
> > one
> > next_block and one num_blocks field?
> 
> The main issue is that either a) somehow the fields have to be reset 
> between phases (which means something like a double barrier, and/or a 
> "chosen" thread), or b) the second phase has to go in reverse (which 
> has the complication of juggling the difference in ending conditions, 
> and dealing with the phase transition properly).
> 
> Not impossible, just probably too complex to be worth the 16 byte gain.

Yes. Now that I have a better handle on the state changes I see that
this is trickier than I thought. It looks like just 16 bytes, but this
is per CU. Given that there can be a thousand CUs that does add up. But
probably not worth it at this point (see also above about memory usage
measurements).

Thanks,

Mark


Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.

2019-10-27 Thread Jonathon Anderson




On Sun, Oct 27, 2019 at 17:13, Mark Wielaard  wrote:

On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
 On Sat, Oct 26, 2019 at 01:50, Mark Wielaard > wrote:

 > For example I am not sure I can proof that in resize_worker() this
 > always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
 > In general the handling of the resizing_state field is pretty 
tricky.

 > What is the maximum number of concurrent threads it can handle?

 The master thread waits for all workers to deregister before
 transitioning from CLEANING to NO_RESIZING. Since the worker thread 
by
 that point has registered (and the synchronization is arranged as 
such)
 it will never see a transition all the way back to NO_RESIZING 
before
 getting its word in. This way a worker thread doesn't "miss the 
boat"

 and end up in an later resizing cycle.


Thanks. I think my confusion came from the fact that these states are
expressed by 2 STATE_BITs. I didn't realize they are distinct states
and cannot hold simultaneously. I somehow had the impression all 
states

could be "active" at the same time. The bit-fiddling with
atomic_fetch_xor_explicit is tricky.

So the busy loop to check for GET_ACTIVE_WORKERS in resize_master 
works

because the master itself never does an STATE_INCREMENT itself.

Are these busy loops not expensive? There are a couple in the code
where are thread is just spinning busily till all other threads are
ready and/or the state is changed.


In theory (if the system isn't overloaded) the threads should finish 
their individual work at around the same time, so the amount of waiting 
any one thread would do (should be) very short. Also, this is only once 
per resize, which (shouldn't) happen very often anyway.


The other busy loops (in insert_helper) are also read-only with a 
single concurrent store, so they (should) spin in cache until the 
invalidation. Again, (should) be a short wait.


If it starts to be an issue we can add some exponential backoff, but I 
haven't noticed any issues in general (and, a nanosleep itself is a 
spin for suitably short durations, or so I've been told).




 The maximum number of threads is a size_t minus 2 bits, so bare 
minimum

 is 16384 on a 16-bit machine (which standard C still technically
 supports). Any modern system will allow either a billion (30 bits) 
or 4

 quintillion (62 bits).


Ah, right, that was also connected to my confusion about how many bits
were used to hold the state. 1 billion threads should be enough for
anybody :)


 > The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't
 > use
 > COMPARE (nor ITERATE and REVERSE). I haven't tried to replace 
that one
 > with the concurrent version, but I think we should. It is only 
used
 > when there are debug type units in the Dwarf. Which is not the 
default
 > (and some gcc hackers really don't like them) so you will 
probably not
 > encounter them normally. But we should probably make looking them 
up

 > also concurrent safe as a followup patch.

 A quick grep -r shows that ITERATE and REVERSE are used for the
 asm_symbol_tab. If iteration and insertion are not concurrent we can
 easily add bidirectional iteration (using the same Treiber 
stack-like

 structure as used for the memory management). Also COMPARE is not
 defined to be (0) in this instance.


Yes. We would need some other solution for the libasm code. But I 
think

we can use the concurrent table for everything in libdw.

 > I only tested the single-threaded case. It is slightly slower, 
but not

 > very much. The previous patch made the code slightly faster, that
 > slight speed increase is gone now. But that also means it is 
about on

 > par with the old version without any concurrent improvements.
 >
 > It does use a bit more memory though. For each CU the abbrev table
 > structure is 112 bytes larger. Given that there can be several 
1000

 > CUs
 > in a (large) Dwarf that means a couple of hundred K of extra 
memory
 > use. It is probably acceptable since such files contain 100 MBs 
of DIE

 > data.

 That extra memory comes from the state for resizing mostly, so it
 probably could be reclaimed. Maybe it could live on the master 
thread's

 stack and then be accessed via pointer by the worker threads? If its
 enough of an issue to consider fixing I can work out a simple patch 
for

 it.


I am not too concerned by it. But haven't done much measurements. You
probably have one of the most demanding programs using this code. Are
you seeing a significant use in memory?


Depends on your definition of significant, anything less than 10MB 
would fall entirely below my radar, and the biggest thing I've tried 
has 935 CUs. We also tend to close our Dwarfs regularly, I think the 
maximum open we've used so far is one per thread.


That said, I can also envision cases (say, a gdb on elfutils on a small 
system) that might be prohibited (or at least annoyed) by this. I can 
lower the overhead to 64 bytes easily (

Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()

2019-10-27 Thread Jonathon Anderson




On Sun, Oct 27, 2019 at 09:59, Florian Weimer  wrote:

* Mark Wielaard:


 Current glibc versions have a thread-local fast path, which should
 address some of these concerns.  It's still not a bump-pointer
 allocator, but at least there are no atomics on that path.


 Since which version of glibc is there a thread-local fast path?


It was added in:

commit d5c3fafc4307c9b7a4c7d5cb381fcdbfad340bcc
Author: DJ Delorie mailto:d...@delorie.com>>
Date:   Thu Jul 6 13:37:30 2017 -0400

Add per-thread cache to malloc

So glibc 2.26.  But it is a build-time option, enabled by default, but
it can be switched off by distributions.


I doubt any non-mobile distros would disable it, the cost seems fairly 
small.


My main concern is that it seems like chunks will only enter the 
thread-local cache in the presence of free()s (since they have to enter 
the "smallbins" or "fastbins" first, and those two at a glance seem to 
be filled very lazily or on free()); since the free()s are all on 
dwarf_end this would pose an issue. I could also be entirely mistaken, 
glibc is by no means a simple piece of code.


According to the comments, there might also be a 16 byte overhead per 
allocation, which would explode the small allocations considerably.


-Jonathon



Re: [PATCH] libdw: add thread-safety to dwarf_getabbrev()

2019-10-27 Thread Florian Weimer
* Jonathon Anderson:

> On Sun, Oct 27, 2019 at 09:59, Florian Weimer  wrote:
>> * Mark Wielaard:
>> 
  Current glibc versions have a thread-local fast path, which should
  address some of these concerns.  It's still not a bump-pointer
  allocator, but at least there are no atomics on that path.
>>> 
>>>  Since which version of glibc is there a thread-local fast path?
>> 
>> It was added in:
>> 
>> commit d5c3fafc4307c9b7a4c7d5cb381fcdbfad340bcc
>> Author: DJ Delorie mailto:d...@delorie.com>>
>> Date:   Thu Jul 6 13:37:30 2017 -0400
>> 
>> Add per-thread cache to malloc
>> 
>> So glibc 2.26.  But it is a build-time option, enabled by default, but
>> it can be switched off by distributions.
>
> I doubt any non-mobile distros would disable it, the cost seems fairly 
> small.

It increases fragmentation.  Vmware's Photon distribution disables it.

> My main concern is that it seems like chunks will only enter the 
> thread-local cache in the presence of free()s (since they have to enter 
> the "smallbins" or "fastbins" first, and those two at a glance seem to 
> be filled very lazily or on free()); since the free()s are all on 
> dwarf_end this would pose an issue. I could also be entirely mistaken, 
> glibc is by no means a simple piece of code.

No, there is a prefill step if the cache is empty, where the cache is
populated with one arena allocation which is then split up.

> According to the comments, there might also be a 16 byte overhead per 
> allocation, which would explode the small allocations considerably.

Available allocatable sizes in bytes are congruent 8 modulo (16), and
the smallest allocatable size is 24.  In general, the overhead is
8 bytes.  (All numbers are for 64-bit architectures.)