Deadlock from --enable-thread-safety

2023-08-04 Thread Heather McIntyre via Elfutils-devel
I've been making --enable-thread-safety (USE_LOCKS) more viable by fixing
deadlocks throughout the libelf library. This has required minimal code
changes so far. However, I've hit a snag where "__elf64_updatenull_wrlock"
calls "elf64_getchdr," which leads to "elf64_getshdr." This sequence
attempts to acquire a read lock under a write lock and fails. Similarly,
"elf64_getchdr" calls "elf_getdata," which also tries and fails to
implement a read lock.

Unfortunately, fixing this particular deadlock requires more than minimal
code changes. From what I understand, I may have a few options on how to
fix this:

1) Create copies of the getchdr and elf_getdata functions, renaming them
getchdr_wrlock and elf_getdata_wrlock. However, multiple copies of these
functions could bloat the code, which may not be ideal.
2) Some type of write lock flag to indicate when a write lock is active.
Either within the locking macro in eu-config.h or passed as an argument in
the functions.

Ultimately, I thought others may want to look into this to see if there
might be options for a better solution. Here is the backtrace:

Program received signal SIGABRT, Aborted.
0x77837aff in raise () from /lib64/libc.so.6
#0  0x77837aff in raise () from /lib64/libc.so.6
#1  0x7780aea5 in abort () from /lib64/libc.so.6
#2  0x7780ad79 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x778304c9 in __assert_perror_fail () from /lib64/libc.so.6
#4  0x77fda554 in elf64_getshdr (scn=0x40fc20) at
../../libelf/elf32_getshdr.c:292
#5  0x77fe9590 in elf64_getchdr (scn=0x40fc20) at
../../libelf/elf32_getchdr.c:45
#6  0x77fdf690 in __elf64_updatenull_wrlock (elf=0x40d880,
change_bop=0x7fffac60, shnum=30) at ../../libelf/elf32_updatenull.c:407
#7  0x77fdd83d in elf_update (elf=0x40d880, cmd=ELF_C_WRITE) at
../../libelf/elf_update.c:211
#8  0x00405d9e in process_file (fname=0x7fffbb96
"testfile-s390x-hash-both") at ../../src/elfcompress.c:1336
#9  0x00406232 in main (argc=7, argv=0x7fffb768) at
../../src/elfcompress.c:1458


Re: Deadlock from --enable-thread-safety

2023-08-04 Thread John Mellor-Crummey via Elfutils-devel
Mark,

A third option that Heather and I discussed for the routines that needed 
replication (because sometimes they are called holding a write lock and 
sometimes not) was to put each routine in an include file where the write lock 
status is expected to be defined as a macro. Then, we can define the macro one 
way (write lock held), include the file, then define the macro the other way 
(write lock not held) and include the function again. The write lock status 
would be incorporated into the function name. 

This avoids two copies of the code to maintain and calls the correct helper 
routines that either 
- use the lock already possessed or 
- acquire their own lock if the caller doesn’t have it already.

Would that strategy be acceptable to you. None of the code does anything like 
that as far as I know. 

John Mellor-Crummey

(sent from my phone)

> On Aug 4, 2023, at 8:21 AM, Heather McIntyre via Elfutils-devel 
>  wrote:
> 
> I've been making --enable-thread-safety (USE_LOCKS) more viable by fixing
> deadlocks throughout the libelf library. This has required minimal code
> changes so far. However, I've hit a snag where "__elf64_updatenull_wrlock"
> calls "elf64_getchdr," which leads to "elf64_getshdr." This sequence
> attempts to acquire a read lock under a write lock and fails. Similarly,
> "elf64_getchdr" calls "elf_getdata," which also tries and fails to
> implement a read lock.
> 
> Unfortunately, fixing this particular deadlock requires more than minimal
> code changes. From what I understand, I may have a few options on how to
> fix this:
> 
> 1) Create copies of the getchdr and elf_getdata functions, renaming them
> getchdr_wrlock and elf_getdata_wrlock. However, multiple copies of these
> functions could bloat the code, which may not be ideal.
> 2) Some type of write lock flag to indicate when a write lock is active.
> Either within the locking macro in eu-config.h or passed as an argument in
> the functions.
> 
> Ultimately, I thought others may want to look into this to see if there
> might be options for a better solution. Here is the backtrace:
> 
> Program received signal SIGABRT, Aborted.
> 0x77837aff in raise () from /lib64/libc.so.6
> #0  0x77837aff in raise () from /lib64/libc.so.6
> #1  0x7780aea5 in abort () from /lib64/libc.so.6
> #2  0x7780ad79 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
> #3  0x778304c9 in __assert_perror_fail () from /lib64/libc.so.6
> #4  0x77fda554 in elf64_getshdr (scn=0x40fc20) at
> ../../libelf/elf32_getshdr.c:292
> #5  0x77fe9590 in elf64_getchdr (scn=0x40fc20) at
> ../../libelf/elf32_getchdr.c:45
> #6  0x77fdf690 in __elf64_updatenull_wrlock (elf=0x40d880,
> change_bop=0x7fffac60, shnum=30) at ../../libelf/elf32_updatenull.c:407
> #7  0x77fdd83d in elf_update (elf=0x40d880, cmd=ELF_C_WRITE) at
> ../../libelf/elf_update.c:211
> #8  0x00405d9e in process_file (fname=0x7fffbb96
> "testfile-s390x-hash-both") at ../../src/elfcompress.c:1336
> #9  0x00406232 in main (argc=7, argv=0x7fffb768) at
> ../../src/elfcompress.c:1458