Re: [PATCH] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.

2019-11-08 Thread Mark Wielaard
Hi,

On Thu, 2019-11-07 at 09:24 -0600, Jonathon Anderson wrote:
> On Thu, Nov 7, 2019 at 12:07, Mark Wielaard  wrote:
> > Looking at the difference between the previous version and this one, 
> > it
> > incorporates my simplification of FIND and lookup functions. And fixes
> > it by making insert_helper consistently return -1 when the value was
> > already inserted (hash == hval). And it fixes an issue where we were
> > using the the table entry val_ptr instead of the hashval as hash (was
> > that the typo? it didn't look harmless).
> 
> Yep, those are the changes (plus the Sig8 patch). That typo was 
> harmless because hash would be overwritten before its next use (or just 
> unused), now with the (hash == hval) clause its always read so the typo 
> is fixed.

Thanks for explaining. I have now finally pushed this to master.

> Regarding the Sig8 table: I took a close look, and at the moment its 
> use is in an area that isn't thread-safe anyway (in particular, 
> __libdw_intern_next_unit). Depending on how that is parallelized there 
> might be an issue (if its just wrapped with a separate mutex a thread 
> might "miss" a CU if its not already in the table), but since that 
> region would need inspection at that time anyway I'm fine with either 
> option.

I still kept the code to handle the Sig8 table with the new concurrent-
safe code, since I think it is better if we use the new code always
(even in the single threaded case).

So to fix this we do need some mutex to protect the binary search tree
when calling tsearch/tfind? Or do you see other issues too?

> This isn't an issue for Dwarf_Abbrev, the worst that can happen there 
> is a duplicate alloc and parsing (as long as the DWARF doesn't have 
> erroneous abbrev entries, if it does we would need to thread-safe the 
> error handling too).

Unfortunately we don't always control the data, so bad abbrev entries
could happen. The extra alloc wouldn't really "leak" because it would
be freed with the Dwarf. So I am not too concerned about that. Is that
the worse that can happen in __libdw_getabbrev? When we goto invalid
the Dwarf_Abbrev would indeed "leak", but it isn't really lost, it will
get cleaned up when the Dwarf is destroyed.

Thanks,

Makr


Re: [PATCH] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.

2019-11-08 Thread Jonathon Anderson




On Fri, Nov 8, 2019 at 15:07, Mark Wielaard  wrote:

Hi,

On Thu, 2019-11-07 at 09:24 -0600, Jonathon Anderson wrote:
 On Thu, Nov 7, 2019 at 12:07, Mark Wielaard > wrote:
 > Looking at the difference between the previous version and this 
one,

 > it
 > incorporates my simplification of FIND and lookup functions. And 
fixes
 > it by making insert_helper consistently return -1 when the value 
was
 > already inserted (hash == hval). And it fixes an issue where we 
were
 > using the the table entry val_ptr instead of the hashval as hash 
(was

 > that the typo? it didn't look harmless).

 Yep, those are the changes (plus the Sig8 patch). That typo was
 harmless because hash would be overwritten before its next use (or 
just
 unused), now with the (hash == hval) clause its always read so the 
typo

 is fixed.


Thanks for explaining. I have now finally pushed this to master.


 Regarding the Sig8 table: I took a close look, and at the moment its
 use is in an area that isn't thread-safe anyway (in particular,
 __libdw_intern_next_unit). Depending on how that is parallelized 
there
 might be an issue (if its just wrapped with a separate mutex a 
thread

 might "miss" a CU if its not already in the table), but since that
 region would need inspection at that time anyway I'm fine with 
either

 option.


I still kept the code to handle the Sig8 table with the new 
concurrent-

safe code, since I think it is better if we use the new code always
(even in the single threaded case).

So to fix this we do need some mutex to protect the binary search tree
when calling tsearch/tfind? Or do you see other issues too?


The search tree can be handled with a mutex, the issue is with 
next_{tu,cu}_offset and the general logic of the function. As an 
example: suppose two threads look up in the Sig8 for A and see that its 
not currently present. They'll both use __libdw_intern_next_unit to 
load CUs (or units, I suppose) until they either find it or run out of 
units.


If the entirety of intern_next_unit is wrapped in a mutex, one of the 
two will load in the missing entry and finish, while the other has 
"missed" it and will keep going until no units remain. The easy 
solution is to have the threads check the table again on next_unit 
failure for whether the entry has been added, but that incurs a 
large-ish overhead for the constant reprobing. The easiest way around 
that is to add an interface on the Sig8 table that returns a "probe" on 
lookup failure that can be continued with only a handful of atomics 
(essentially saving state from the first find). The downside to this 
approach is that unit parsing is fully serialized.


If the next_*_offset is handled with a separate mutex or as an atomic 
(say, using fetch_add), the same issue occurs but without the mutex 
there's no guarantee that another thread isn't currently parsing and 
will write the entry soon, so the easy solution doesn't work. Since the 
Sig8 key is only known after the parsing is complete, we can't even 
insert a "in progress" entry. One solution is to allow for duplicate 
parsing (but then next_*_offset would have to be updated *after* 
Sig8_Hash_insert), another is to use a condition variable on whether 
all the units have been parsed (so threads that don't find what they're 
looking for would block until its certain that it doesn't exist).


Both are viable directions, but neither are trivial.



 This isn't an issue for Dwarf_Abbrev, the worst that can happen 
there

 is a duplicate alloc and parsing (as long as the DWARF doesn't have
 erroneous abbrev entries, if it does we would need to thread-safe 
the

 error handling too).


Unfortunately we don't always control the data, so bad abbrev entries
could happen. The extra alloc wouldn't really "leak" because it would
be freed with the Dwarf. So I am not too concerned about that. Is that
the worse that can happen in __libdw_getabbrev? When we goto invalid
the Dwarf_Abbrev would indeed "leak", but it isn't really lost, it 
will

get cleaned up when the Dwarf is destroyed.


It wouldn't "leak," but it would be taking up space until the 
dwarf_end. Not that I mind (they're really small).


I'm thinking more of the case where the Abbrev_insert returns -1 (entry 
collision), in that case the new Abbrev would stick around until the 
dwarf_end.




Thanks,

Makr




Re: [PATCH 2/2] libdw: Rewrite the memory handler to be more robust.

2019-11-08 Thread Mark Wielaard
On Thu, 2019-11-07 at 12:40 -0600, Jonathon Anderson wrote:
> I haven't benchmarked this version, but I did benchmark the equivalent 
> earlier version (this version is almost quite literally a rebase of the 
> other). I don't have the exact results on hand, what I remember is that 
> the pthread_key method was faster (and handled the many-thread case 
> better), by maybe a factor of 1.5x-2x in parallel. In serial the 
> overhead was minimal (just an extra pointer indirection on allocations).

I just tested the single-threaded case a bit and is not measurable
slower than the previous version, and compared to 0.177 things are
maybe ~1% slower (so probably in the noise).

A factor 1.5x-2.0x slower in parallel does seem significant. Is that in
the case of many-threads that are colliding a lot or in general?

Thanks,

Mark


Re: [PATCH 2/2] libdw: Rewrite the memory handler to be more robust.

2019-11-08 Thread Mark Wielaard
Hi,

On Thu, 2019-11-07 at 12:13 -0600, Jonathon Anderson wrote:
> On Thu, Nov 7, 2019 at 18:20, Mark Wielaard  wrote:
> > Do we really need this?
> > We already use __thread unconditionally in the rest of the code.
> > The usage of threads.h seems to imply we actually want C11
> > _Thread_local. Is that what you really want, or can we just use
> > __thread in libdw_alloc.c for thread_id?
> 
> We don't really need it, I just got in the habit of writing 
> thread_local (and, proper C11 compat). __thread is perfectly fine
> for thread_id.

Great, removed.

> > I think if you include helgrind.h you won't get the drd.h
> > ANNOTATE_HAPPENS_BEFORE/AFTER. So do you also need to include
> > drd.h?
> 
> Not really, just another habit. Since this is file only needs
> HAPPENS_* helgrind.h is sufficient.

Thanks. drd.h include removed.

> > 
> > >  +#else
> > >  +#define ANNOTATE_HAPPENS_BEFORE(X)
> > >  +#define ANNOTATE_HAPPENS_AFTER(X)
> > >  +#endif
> > 
> > Could you explain the usage of the happens_before/after annotations in
> > this code. I must admit that I don't fully understand why/how it works
> > in this case. Specifically since realloc might change the address that
> > mem_tails points to.
> 
> Reader-writer locks ensure no "readers" are present whenever a "writer" 
> is around. In this case we use the "write" side for resizing mem_tails 
> and the "read" side when mem_tails needs to stay stable. Which is why 
> most of the time we have a read lock and then promote to a write lock 
> when we need to reallocate.
> 
> The annotations are to clean up a minor deficiency in Helgrind: for 
> whatever reason if you do writes under a read lock it reports races 
> with the writes from under the write lock (in this case, 
> __libdw_allocate and the realloc). I haven't dug deep enough to know 
> exactly why it happens, just that it does and adding this H-B arc seems 
> to fix the issue.

OK, lets keep them in for now. They are disabled by default anyway. For
now people who want a "helgrindable" libdw will need to rebuild libdw
with them enabled.

> > >  +#define THREAD_ID_UNSET ((size_t) -1)
> > >  +static thread_local size_t thread_id = THREAD_ID_UNSET;
> > >  +static atomic_size_t next_id = ATOMIC_VAR_INIT(0);
> > 
> > OK, but maybe use static __thread size_t thread_id as explained
> > above?
> 
> Fine by me.

Done.

> > O, and I now think you would then also need something for dwarf_begin
> > to reset any set thread_ids... bleah. So probably way too complicated.
> > So lets not, unless you think this is actually simple.
> 
> Which is why I didn't want to do that.
> 
> The other option was to have a sort of free list for ids, but in that 
> case the cleanup isn't great (sometime after all threads have 
> completed... if you consider detached threads things get hairy). Plus 
> it requires a fully concurrent stack or queue, which is a complicated 
> data structure itself.

Yeah, agreed, lets keep it with a simple monotonically increasing
next_id. Things need to get really big before this ever gets a problem.
And I don't think programs will keep spawning new threads and using
Dwarfs on each of them anyway. I expect longer running processes that
do need to handle Dwarfs in a concurrent fashion to use thread pools.

Pushed with the small changes noted above.

Thanks,

Mark


Re: [PATCH 2/2] libdw: Rewrite the memory handler to be more robust.

2019-11-08 Thread Jonathon Anderson




On Fri, Nov 8, 2019 at 17:22, Mark Wielaard  wrote:

On Thu, 2019-11-07 at 12:40 -0600, Jonathon Anderson wrote:
 I haven't benchmarked this version, but I did benchmark the 
equivalent
 earlier version (this version is almost quite literally a rebase of 
the
 other). I don't have the exact results on hand, what I remember is 
that

 the pthread_key method was faster (and handled the many-thread case
 better), by maybe a factor of 1.5x-2x in parallel. In serial the
 overhead was minimal (just an extra pointer indirection on 
allocations).


I just tested the single-threaded case a bit and is not measurable
slower than the previous version, and compared to 0.177 things are
maybe ~1% slower (so probably in the noise).

A factor 1.5x-2.0x slower in parallel does seem significant. Is that 
in

the case of many-threads that are colliding a lot or in general?


I believe it was 64 threads colliding a lot (on the reader side of 
mem_rwl). That said, this is all based on my memory from before the 
semester started. (They may also be numbers munged out of a larger 
benchmark, so don't trust them too much).


As it happens, on our end any slowdown is entirely hidden by all the 
other work we do while reading DIEs, so its not a critical concern. Our 
code opens a Dwarf and then uses a #pragma parallel for across the CUs 
(using a serial recursion to read the DIEs), if you want to benchmark 
it that should suffice on a large enough example.




Thanks,

Mark