Re: Buildbot failure in Wildebeest Builder on whole buildset
Hi, On Mon, 2021-09-13 at 11:06 +0200, Mark Wielaard wrote: > On Sun, Sep 12, 2021 at 11:16:09PM +, > build...@builder.wildebeest.org wrote: > > The Buildbot has detected a new failure on builder elfutils-fedora- > > s390x while building elfutils. > > Full details are available at: > > https://builder.wildebeest.org/buildbot/#builders/10/builds/795 > > > > Buildbot URL: https://builder.wildebeest.org/buildbot/ > > > > Worker for this Build: fedora-s390x > > This is the same failure we saw on fedora-ppc64 and centos-x86_64 > yesterday. > https://builder.wildebeest.org/buildbot/#/builders/10/builds/795/steps/8/logs/test-suite_log > > I still don't understand why. In the logs we can see (for the PORT2 > server): > > [Sun Sep 12 22:56:26 2021] (1493056/1493066): recorded > buildid=a0a48245eb29786f7b6853df68ab23cb608b344b > file=/home/mjw/bb/wildebeest/elfutils-fedora- > s390x/build/tests/dwfllines mtime=1631486319 atype=ED > > But then, 2 seconds later: > [Sun Sep 12 22:56:28 2021] (1493056/1493388): searching for > buildid=a0a48245eb29786f7b6853df68ab23cb608b344b > artifacttype=debuginfo suffix= > [Sun Sep 12 22:56:28 2021] (1493056/1493388): not found > [Sun Sep 12 22:56:28 2021] (1493056/1493388): 127.0.0.1:47886 > UA:elfutils/0.185,Linux/s390x,fedora/34 XFF: GET > /buildid/a0a48245eb29786f7b6853df68ab23cb608b344b/debuginfo 404 9 > 0+2ms > > Somewhere inbetween the buildid seems to have been forgotten. But I > cannot figure out why or where. It is clearly non-deterministic since > normally the tests PASS. So the issue is triggered by this part in groom (): // delete buildids with no references in _r_de or _f_de tables; // cascades to _r_sref & _f_s records sqlite_ps buildids_del (db, "nuke orphan buildids", "delete from " BUILDIDS "_buildids " "where not exists (select 1 from " BUILDIDS "_f_de d where " BUILDIDS "_buildids.id = d.buildid) " "and not exists (select 1 from " BUILDIDS "_r_de d where " BUILDIDS "_buildids.id = d.buildid)"); buildids_del.reset().step_ok_done(); When commenting that out I can run the tests (or a simplified version using just one server and on one client request as attached) 3 times without issue. While with groom executing that part of the code the test will fail after a couple hundred cycles. Now the question is whether it is reasonable that groom removes the buildid here. Is that because of the way the test is written? Or is this a real bug where there is a bad interaction between a (partial?) scan run and a groom cycle? Cheers, Mark run-debuginfod-federation-link.sh Description: application/shellscript
[Bug debuginfod/28339] New: groom/scan race condition
https://sourceware.org/bugzilla/show_bug.cgi?id=28339 Bug ID: 28339 Summary: groom/scan race condition Product: elfutils Version: unspecified Status: NEW Severity: normal Priority: P2 Component: debuginfod Assignee: fche at redhat dot com Reporter: fche at redhat dot com CC: elfutils-devel at sourceware dot org Target Milestone: --- debuginfod's scan and groom operations (thread_main_scanner, thread_main_fts_source_paths) are intended to be mutually exclusive, as a substitute for more complicated sql transaction batching. (This is because scanning / grooming involves inserting or deleting data from multiple related tables.) The workq class that governs this in debuginfod.cxx has a problem: if the workq just becomes empty, its sole entry pulled by a scanner thread in response to a wait_front(), an 'idler' groomer thread is ALSO permitted to run, because there is no indication as to when the scanner thread operation finishes, only when it starts. Extending the workq with a counter ("fronters") to track any active scanning activity (even if the workq is empty) lets us block idlers groomers a little longer. -- You are receiving this mail because: You are on the CC list for the bug.
[patch] PR28339: debuginfod groom/scan concurrency fix
commit ce695bedf073127883bbbaab528dd1f2b0e955f1 (HEAD -> master) Author: Frank Ch. Eigler Date: Tue Sep 14 08:15:23 2021 -0400 PR28339: debuginfod: fix groom/scan race condition on just-emptied queue debuginfod's scan and groom operations (thread_main_scanner, thread_main_fts_source_paths) are intended to be mutually exclusive, as a substitute for more complicated sql transaction batching. (This is because scanning / grooming involves inserting or deleting data from multiple related tables.) The workq class that governs this in debuginfod.cxx has a problem: if the workq just becomes empty, its sole entry pulled by a scanner thread in response to a wait_front(), an 'idler' groomer thread is ALSO permitted to run, because there is no indication as to when the scanner thread operation finishes, only when it starts. Extending the workq with a counter ("fronters") to track any active scanning activity (even if the workq is empty) lets us block idlers groomers a little longer. Signed-off-by: Frank Ch. Eigler diff --git a/debuginfod/ChangeLog b/debuginfod/ChangeLog index c54598239c82..07ddad3019d9 100644 --- a/debuginfod/ChangeLog +++ b/debuginfod/ChangeLog @@ -1,3 +1,11 @@ +2021-09-14 Frank Ch. Eigler + + PRPR28339 + * debuginfod.cxx (waitq::fronters): New field. + (waitq::wait_idle): Respect it. + (waitq::done_front): New function. + (thread_main_scanner): Call it to match wait_front(). + 2021-08-28 Mark Wielaard * debuginfod.cxx (parse_opt): Turn the -d arg ":memory:" into diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx index 3269f657cc8d..6387a27ff820 100644 --- a/debuginfod/debuginfod.cxx +++ b/debuginfod/debuginfod.cxx @@ -663,10 +663,11 @@ class workq mutex mtx; condition_variable cv; bool dead; - unsigned idlers; + unsigned idlers; // number of threads busy with wait_idle / done_idle + unsigned fronters; // number of threads busy with wait_front / done_front public: - workq() { dead = false; idlers = 0; } + workq() { dead = false; idlers = 0; fronters = 0; } ~workq() {} void push_back(const Payload& p) @@ -690,10 +691,11 @@ class workq unique_lock lock(mtx); q.clear(); set_metric("thread_work_pending","role","scan", q.size()); +// NB: there may still be some live fronters cv.notify_all(); // maybe wake up waiting idlers } - // block this scanner thread until there is work to do and no active + // block this scanner thread until there is work to do and no active idler bool wait_front (Payload& p) { unique_lock lock(mtx); @@ -705,19 +707,29 @@ class workq { p = * q.begin(); q.erase (q.begin()); +fronters ++; // prevent idlers from starting awhile, even if empty q set_metric("thread_work_pending","role","scan", q.size()); -if (q.size() == 0) - cv.notify_all(); // maybe wake up waiting idlers +// NB: don't wake up idlers yet! The consumer is busy +// processing this element until it calls done_front(). return true; } } + // notify waitq that scanner thread is done with that last item + void done_front () + { +unique_lock lock(mtx); +fronters --; +if (q.size() == 0 && fronters == 0) + cv.notify_all(); // maybe wake up waiting idlers + } + // block this idler thread until there is no work to do void wait_idle () { unique_lock lock(mtx); cv.notify_all(); // maybe wake up waiting scanners -while (!dead && (q.size() != 0)) +while (!dead && ((q.size() != 0) || fronters > 0)) cv.wait(lock); idlers ++; } @@ -3145,6 +3157,8 @@ thread_main_scanner (void* arg) e.report(cerr); } + scanq.done_front(); // let idlers run + if (fts_cached || fts_executable || fts_debuginfo || fts_sourcefiles || fts_sref || fts_sdef) {} // NB: not just if a successful scan - we might have encountered -ENOSPC & failed (void) statfs_free_enough_p(db_path, "database"); // report sqlite filesystem size
Re: [patch] PR28339: debuginfod groom/scan concurrency fix
Hi Frank, On Tue, 2021-09-14 at 11:20 -0400, Frank Ch. Eigler via Elfutils-devel wrote: > commit ce695bedf073127883bbbaab528dd1f2b0e955f1 (HEAD -> master) > Author: Frank Ch. Eigler > Date: Tue Sep 14 08:15:23 2021 -0400 > > PR28339: debuginfod: fix groom/scan race condition on just-emptied queue > > debuginfod's scan and groom operations (thread_main_scanner, > thread_main_fts_source_paths) are intended to be mutually exclusive, > as a substitute for more complicated sql transaction batching. (This > is because scanning / grooming involves inserting or deleting data > from multiple related tables.) > > The workq class that governs this in debuginfod.cxx has a problem: if > the workq just becomes empty, its sole entry pulled by a scanner > thread in response to a wait_front(), an 'idler' groomer thread is > ALSO permitted to run, because there is no indication as to when the > scanner thread operation finishes, only when it starts. > > Extending the workq with a counter ("fronters") to track any active > scanning activity (even if the workq is empty) lets us block idlers > groomers a little longer. > > Signed-off-by: Frank Ch. Eigler Thanks so much for finding (and fixing!) this. It survived 3850+ cycles on my local machine. So I am hopeful this will also work on the buildbots. Please push. Cheers, Mark