Re: Buildbot failure in Wildebeest Builder on whole buildset

2021-09-14 Thread Mark Wielaard
Hi,

On Mon, 2021-09-13 at 11:06 +0200, Mark Wielaard wrote:
> On Sun, Sep 12, 2021 at 11:16:09PM +, 
> build...@builder.wildebeest.org wrote:
> > The Buildbot has detected a new failure on builder elfutils-fedora-
> > s390x while building elfutils.
> > Full details are available at:
> > https://builder.wildebeest.org/buildbot/#builders/10/builds/795
> > 
> > Buildbot URL: https://builder.wildebeest.org/buildbot/
> > 
> > Worker for this Build: fedora-s390x
> 
> This is the same failure we saw on fedora-ppc64 and centos-x86_64
> yesterday.
> https://builder.wildebeest.org/buildbot/#/builders/10/builds/795/steps/8/logs/test-suite_log
> 
> I still don't understand why. In the logs we can see (for the PORT2
> server):
> 
> [Sun Sep 12 22:56:26 2021] (1493056/1493066): recorded
> buildid=a0a48245eb29786f7b6853df68ab23cb608b344b
> file=/home/mjw/bb/wildebeest/elfutils-fedora-
> s390x/build/tests/dwfllines mtime=1631486319 atype=ED
> 
> But then, 2 seconds later:
> [Sun Sep 12 22:56:28 2021] (1493056/1493388): searching for
> buildid=a0a48245eb29786f7b6853df68ab23cb608b344b
> artifacttype=debuginfo suffix=
> [Sun Sep 12 22:56:28 2021] (1493056/1493388): not found
> [Sun Sep 12 22:56:28 2021] (1493056/1493388): 127.0.0.1:47886
> UA:elfutils/0.185,Linux/s390x,fedora/34 XFF: GET
> /buildid/a0a48245eb29786f7b6853df68ab23cb608b344b/debuginfo 404 9
> 0+2ms
> 
> Somewhere inbetween the buildid seems to have been forgotten. But I
> cannot figure out why or where. It is clearly non-deterministic since
> normally the tests PASS.

So the issue is triggered by this part in groom ():

   // delete buildids with no references in _r_de or _f_de tables;
   // cascades to _r_sref & _f_s records
   sqlite_ps buildids_del (db, "nuke orphan buildids",
   "delete from " BUILDIDS "_buildids "
   "where not exists (select 1 from " BUILDIDS "_f_de d 
where " BUILDIDS "_buildids.id = d.buildid) "
   "and not exists (select 1 from " BUILDIDS "_r_de d 
where " BUILDIDS "_buildids.id = d.buildid)");
   buildids_del.reset().step_ok_done();

When commenting that out I can run the tests (or a simplified version
using just one server and on one client request as attached) 3
times without issue. While with groom executing that part of the 
code the test will fail after a couple hundred cycles.

Now the question is whether it is reasonable that groom removes the
buildid here. Is that because of the way the test is written? Or is
this a real bug where there is a bad interaction between a (partial?)
scan run and a groom cycle?

Cheers,

Mark



run-debuginfod-federation-link.sh
Description: application/shellscript


[Bug debuginfod/28339] New: groom/scan race condition

2021-09-14 Thread fche at redhat dot com via Elfutils-devel
https://sourceware.org/bugzilla/show_bug.cgi?id=28339

Bug ID: 28339
   Summary: groom/scan race condition
   Product: elfutils
   Version: unspecified
Status: NEW
  Severity: normal
  Priority: P2
 Component: debuginfod
  Assignee: fche at redhat dot com
  Reporter: fche at redhat dot com
CC: elfutils-devel at sourceware dot org
  Target Milestone: ---

debuginfod's scan and groom operations (thread_main_scanner, 
thread_main_fts_source_paths) are intended to be mutually exclusive,
as a substitute for more complicated sql transaction batching.  (This
is because scanning / grooming involves inserting or deleting data
from multiple related tables.)

The workq class that governs this in debuginfod.cxx has a problem: if
the workq just becomes empty, its sole entry pulled by a scanner thread
in response to a wait_front(), an 'idler' groomer thread is ALSO permitted
to run, because there is no indication as to when the scanner thread
operation finishes, only when it starts.

Extending the workq with a counter ("fronters") to track any active
scanning activity (even if the workq is empty) lets us block idlers
groomers a little longer.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

[patch] PR28339: debuginfod groom/scan concurrency fix

2021-09-14 Thread Frank Ch. Eigler via Elfutils-devel


commit ce695bedf073127883bbbaab528dd1f2b0e955f1 (HEAD -> master)
Author: Frank Ch. Eigler 
Date:   Tue Sep 14 08:15:23 2021 -0400

PR28339: debuginfod: fix groom/scan race condition on just-emptied queue

debuginfod's scan and groom operations (thread_main_scanner,
thread_main_fts_source_paths) are intended to be mutually exclusive,
as a substitute for more complicated sql transaction batching.  (This
is because scanning / grooming involves inserting or deleting data
from multiple related tables.)

The workq class that governs this in debuginfod.cxx has a problem: if
the workq just becomes empty, its sole entry pulled by a scanner
thread in response to a wait_front(), an 'idler' groomer thread is
ALSO permitted to run, because there is no indication as to when the
scanner thread operation finishes, only when it starts.

Extending the workq with a counter ("fronters") to track any active
scanning activity (even if the workq is empty) lets us block idlers
groomers a little longer.

Signed-off-by: Frank Ch. Eigler 

diff --git a/debuginfod/ChangeLog b/debuginfod/ChangeLog
index c54598239c82..07ddad3019d9 100644
--- a/debuginfod/ChangeLog
+++ b/debuginfod/ChangeLog
@@ -1,3 +1,11 @@
+2021-09-14  Frank Ch. Eigler 
+
+   PRPR28339
+   * debuginfod.cxx (waitq::fronters): New field.
+   (waitq::wait_idle): Respect it.
+   (waitq::done_front): New function.
+   (thread_main_scanner): Call it to match wait_front().
+
 2021-08-28  Mark Wielaard  
 
* debuginfod.cxx (parse_opt): Turn the -d arg ":memory:" into
diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 3269f657cc8d..6387a27ff820 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -663,10 +663,11 @@ class workq
   mutex mtx;
   condition_variable cv;
   bool dead;
-  unsigned idlers;
+  unsigned idlers;   // number of threads busy with wait_idle / done_idle
+  unsigned fronters; // number of threads busy with wait_front / done_front
 
 public:
-  workq() { dead = false; idlers = 0; }
+  workq() { dead = false; idlers = 0; fronters = 0; }
   ~workq() {}
 
   void push_back(const Payload& p)
@@ -690,10 +691,11 @@ class workq
 unique_lock lock(mtx);
 q.clear();
 set_metric("thread_work_pending","role","scan", q.size());
+// NB: there may still be some live fronters
 cv.notify_all(); // maybe wake up waiting idlers
   }
 
-  // block this scanner thread until there is work to do and no active
+  // block this scanner thread until there is work to do and no active idler
   bool wait_front (Payload& p)
   {
 unique_lock lock(mtx);
@@ -705,19 +707,29 @@ class workq
   {
 p = * q.begin();
 q.erase (q.begin());
+fronters ++; // prevent idlers from starting awhile, even if empty q
 set_metric("thread_work_pending","role","scan", q.size());
-if (q.size() == 0)
-  cv.notify_all(); // maybe wake up waiting idlers
+// NB: don't wake up idlers yet!  The consumer is busy
+// processing this element until it calls done_front().
 return true;
   }
   }
 
+  // notify waitq that scanner thread is done with that last item
+  void done_front ()
+  {
+unique_lock lock(mtx);
+fronters --;
+if (q.size() == 0 && fronters == 0)
+  cv.notify_all(); // maybe wake up waiting idlers
+  }
+  
   // block this idler thread until there is no work to do
   void wait_idle ()
   {
 unique_lock lock(mtx);
 cv.notify_all(); // maybe wake up waiting scanners
-while (!dead && (q.size() != 0))
+while (!dead && ((q.size() != 0) || fronters > 0))
   cv.wait(lock);
 idlers ++;
   }
@@ -3145,6 +3157,8 @@ thread_main_scanner (void* arg)
   e.report(cerr);
 }
 
+  scanq.done_front(); // let idlers run
+  
   if (fts_cached || fts_executable || fts_debuginfo || fts_sourcefiles || 
fts_sref || fts_sdef)
 {} // NB: not just if a successful scan - we might have encountered 
-ENOSPC & failed
   (void) statfs_free_enough_p(db_path, "database"); // report sqlite 
filesystem size



Re: [patch] PR28339: debuginfod groom/scan concurrency fix

2021-09-14 Thread Mark Wielaard
Hi Frank,

On Tue, 2021-09-14 at 11:20 -0400, Frank Ch. Eigler via Elfutils-devel wrote:
> commit ce695bedf073127883bbbaab528dd1f2b0e955f1 (HEAD -> master)
> Author: Frank Ch. Eigler 
> Date:   Tue Sep 14 08:15:23 2021 -0400
> 
> PR28339: debuginfod: fix groom/scan race condition on just-emptied queue
> 
> debuginfod's scan and groom operations (thread_main_scanner,
> thread_main_fts_source_paths) are intended to be mutually exclusive,
> as a substitute for more complicated sql transaction batching.  (This
> is because scanning / grooming involves inserting or deleting data
> from multiple related tables.)
> 
> The workq class that governs this in debuginfod.cxx has a problem: if
> the workq just becomes empty, its sole entry pulled by a scanner
> thread in response to a wait_front(), an 'idler' groomer thread is
> ALSO permitted to run, because there is no indication as to when the
> scanner thread operation finishes, only when it starts.
> 
> Extending the workq with a counter ("fronters") to track any active
> scanning activity (even if the workq is empty) lets us block idlers
> groomers a little longer.
> 
> Signed-off-by: Frank Ch. Eigler 

Thanks so much for finding (and fixing!) this.
It survived 3850+ cycles on my local machine. So I am hopeful this will
also work on the buildbots. Please push.

Cheers,

Mark