date:20240719

[PATCH v3 0/7] debuginfod: speed up extraction from kernel debuginfo packages by 200x

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

This is v3 of my patch series optimizing debuginfod for kernel
debuginfo.  v1 is here [7], v2 is here [8].  This version fixes a couple
of minor bugs and adds test cases.

Changes from v2 to v3:

- Added a test case with seekable rpm and deb files.
- Added a couple of independent fixes uncovered while adding tests.
- Added a few more prometheus metrics.
- Fixed passive mode.

Patches 1 and 2 fix existing bugs that the were uncovered by adding new
test package files.  Patch 3 is a preparatory refactor.  Patch 4 makes
the schema changes. Patch 5 implements the seekable xz extraction.
Patch 6 populates the table of seekable entries at scan time and adds a
test.  Patch 7 does it for pre-existing files at request time.

Here is the background copied and pasted from v1:

drgn [1] currently uses debuginfod with great success for debugging
userspace processes.  However, for debugging the Linux kernel (drgn's
main use case), we have had some performance issues with debuginfod, so
we intentionally avoid using it.  Specifically, it sometimes takes over
a minute for debuginfod to respond to queries for vmlinux and kernel
modules (not including the actual download time).

The reason for the slowness is that Linux kernel debuginfo packages are
very large and contain lots of files.  To respond to a query for a Linux
kernel debuginfo file, debuginfod has to decompress and iterate through
the whole package until it finds that file.  If the file is towards the
end of the package, this can take a very long time.  This was previously
reported for vdso files [2][3], which debuginfod was able to mitigate
with improved caching and prefetching.  However, kernel modules are far
greater in number, vary drastically by hardware and workload, and can be
spread all over the package, so in practice I've still been seeing long
delays.  This was also discussed on the drgn issue tracker [4].

The fundamental limitation is that Linux packages, which are essentially
compressed archives with extra metadata headers, don't support random
access to specific files.  However, the multi-threaded xz compression
format does actually support random access.  And, luckily, the kernel
debuginfo packages on Fedora, Debian, and Ubuntu all happen to use
multi-threaded xz compression!

debuginfod can take advantage of this: when it scans a package, if it is
a seekable xz archive, it can save the uncompressed offset and size of
each file.  Then, when it needs a file, it can seek to that offset and
extract it from there.  This requires some understanding of the xz
format and low-level liblzma code, but the speedup is massive: where the
worst case was previously about 50 seconds just to find a file in a
kernel debuginfo package, with this change the worst case is 0.25
seconds, a ~200x improvement! This works for both .rpm and .deb files.

I tested this by requesting and verifying the digest of every file from
a few kernel debuginfo rpms and debs [5].

P.S. The biggest downside of this change is that it depends on a very
specific compression format that is only used by kernel packages
incidentally.  I think this is something we should formalize with Linux
distributions: large debuginfo packages should use a seekable format.
Currently, xz in multi-threaded mode is the only option, but Zstandard
also has an experimental seekable format that is worth looking into [6].

Thanks,
Omar

1: https://github.com/osandov/drgn
2: https://sourceware.org/bugzilla/show_bug.cgi?id=29478
3: https://bugzilla.redhat.com/show_bug.cgi?id=1970578
4: https://github.com/osandov/drgn/pull/380
5: https://gist.github.com/osandov/89d521fdc6c9a07aa8bb0ebf91974346
6: https://github.com/facebook/zstd/tree/dev/contrib/seekable_format
7: https://sourceware.org/pipermail/elfutils-devel/2024q3/007191.html
8: https://sourceware.org/pipermail/elfutils-devel/2024q3/007208.html

Omar Sandoval (7):
  debuginfod: fix skipping  source file
  tests/run-debuginfod-fd-prefetch-caches.sh: disable fdcache limit
check
  debuginfod: factor out common code for responding from an archive
  debugifod: add new table and views for seekable archives
  debuginfod: optimize extraction from seekable xz archives
  debuginfod: populate _r_seekable on scan
  debuginfod: populate _r_seekable on request

 configure.ac  |   5 +
 debuginfod/Makefile.am|   2 +-
 debuginfod/debuginfod.cxx | 928 +++---
 tests/Makefile.am |   4 +-
 ...pressme-seekable-xz-dbgsym_1.0-1_amd64.deb | Bin 0 -> 6288 bytes
 ...compressme-seekable-xz_1.0-1.debian.tar.xz | Bin 0 -> 1440 bytes
 .../compressme-seekable-xz_1.0-1.dsc  |  19 +
 .../compressme-seekable-xz_1.0-1_amd64.deb| Bin 0 -> 6208 bytes
 .../compressme-seekable-xz_1.0.orig.tar.xz| Bin 0 -> 7160 bytes
 .../compressme-seekable-xz-1.0-1.src.rpm  | Bin 0 -> 15880 bytes
 .../compressme-seekable-xz-1.0-1.x86_64.rpm   | Bin 0 -> 31873 byte

[PATCH v3 4/7] debugifod: add new table and views for seekable archives

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

In order to extract a file from a seekable archive, we need to know
where in the uncompressed archive the file data starts and its size.
Additionally, in order to populate the response headers, we need the
file modification time (since we won't be able to get it from the
archive metadata).  Add a new table, _r_seekable, keyed on the archive
file id and entry file id and containing the size, offset, and mtime.
It also contains the compression type just in case new seekable formats
are supported in the future.

In order to search this table when we get a request, we need the file
ids available.  Add the ids to the _query_d and _query_e views, and
rename them to _query_d2 and _query_e2.

This schema change is backward compatible and doesn't require
reindexing.  _query_d2 and _query_e2 can be renamed back the next time
BUILDIDS needs to be bumped.

Before this change, the database for a single kernel debuginfo RPM
(kernel-debuginfo-6.9.6-200.fc40.x86_64.rpm) was about 15MB.  This
change increases that by about 70kB, only a 0.5% increase.

Signed-off-by: Omar Sandoval 
---
 debuginfod/debuginfod.cxx | 34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 24702c23..b3d80090 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -265,25 +265,39 @@ static const char DEBUGINFOD_SQLITE_DDL[] =
   "foreign key (content) references " BUILDIDS "_files(id) on update 
cascade on delete cascade,\n"
   "primary key (content, file, mtime)\n"
   ") " WITHOUT_ROWID ";\n"
+  "create table if not exists " BUILDIDS "_r_seekable (\n" // seekable rpm 
contents
+  "file integer not null,\n"
+  "content integer not null,\n"
+  "type text not null,\n"
+  "size integer not null,\n"
+  "offset integer not null,\n"
+  "mtime integer not null,\n"
+  "foreign key (file) references " BUILDIDS "_files(id) on update 
cascade on delete cascade,\n"
+  "foreign key (content) references " BUILDIDS "_files(id) on update 
cascade on delete cascade,\n"
+  "primary key (file, content)\n"
+  ") " WITHOUT_ROWID ";\n"
   // create views to glue together some of the above tables, for webapi D 
queries
-  "create view if not exists " BUILDIDS "_query_d as \n"
+  // NB: _query_d2 and _query_e2 were added to replace _query_d and _query_e
+  // without updating BUILDIDS.  They can be renamed back the next time 
BUILDIDS
+  // is updated.
+  "create view if not exists " BUILDIDS "_query_d2 as \n"
   "select\n"
-  "b.hex as buildid, n.mtime, 'F' as sourcetype, f0.name as source0, 
n.mtime as mtime, null as source1\n"
+  "b.hex as buildid, 'F' as sourcetype, n.file as id0, f0.name as 
source0, n.mtime as mtime, null as id1, null as source1\n"
   "from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS 
"_f_de n\n"
   "where b.id = n.buildid and f0.id = n.file and n.debuginfo_p = 1\n"
   "union all select\n"
-  "b.hex as buildid, n.mtime, 'R' as sourcetype, f0.name as source0, 
n.mtime as mtime, f1.name as source1\n"
+  "b.hex as buildid, 'R' as sourcetype, n.file as id0, f0.name as 
source0, n.mtime as mtime, n.content as id1, f1.name as source1\n"
   "from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS 
"_files_v f1, " BUILDIDS "_r_de n\n"
   "where b.id = n.buildid and f0.id = n.file and f1.id = n.content and 
n.debuginfo_p = 1\n"
   ";"
   // ... and for E queries
-  "create view if not exists " BUILDIDS "_query_e as \n"
+  "create view if not exists " BUILDIDS "_query_e2 as \n"
   "select\n"
-  "b.hex as buildid, n.mtime, 'F' as sourcetype, f0.name as source0, 
n.mtime as mtime, null as source1\n"
+  "b.hex as buildid, 'F' as sourcetype, n.file as id0, f0.name as 
source0, n.mtime as mtime, null as id1, null as source1\n"
   "from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS 
"_f_de n\n"
   "where b.id = n.buildid and f0.id = n.file and n.executable_p = 1\n"
   "union all select\n"
-  "b.hex as buildid, n.mtime, 'R' as sourcetype, f0.name as source0, 
n.mtime as mtime, f1.name as source1\n"
+  "b.hex as buildid, 'R' as sourcetype, n.file as id0, f0.name as 
source0, n.mtime as mtime, n.content as id1, f1.name as source1\n"
   "from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS 
"_files_v f1, " BUILDIDS "_r_de n\n"
   "where b.id = n.buildid and f0.id = n.file and f1.id = n.content and 
n.executable_p = 1\n"
   ";"
@@ -2557,7 +2571,7 @@ handle_buildid (MHD_Connection* conn,
   if (atype_code == "D")
 {
   pp = new sqlite_ps (thisdb, "mhd-query-d",
-  "select mtime, sourcetype, source0, source1 from " 
BUILDIDS "_query_d where buildid = ? "
+  "select mtime, sourcetype, sour

[PATCH v3 5/7] debuginfod: optimize extraction from seekable xz archives

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

The kernel debuginfo packages on Fedora, Debian, and Ubuntu, and many of
their downstreams, are all compressed with xz in multi-threaded mode,
which allows random access.  We can use this to bypass the full archive
extraction and dramatically speed up kernel debuginfo requests (from ~50
seconds in the worst case to < 0.25 seconds).

This works because multi-threaded xz compression splits up the stream
into many independently compressed blocks.  The stream ends with an
index of blocks.  So, to seek to an offset, we find the block containing
that offset in the index and then decompress and throw away data until
we reach the offset within the block.  We can then decompress the
desired amount of data, possibly from subsequent blocks.  There's no
high-level API in liblzma to do this, but we can do it by stitching
together a few low-level APIs.

We need to pass down the file ids then look up the size, uncompressed
offset, and mtime in the _r_seekable table.  Note that this table is not
yet populated, so this commit has no functional change on its own.

Signed-off-by: Omar Sandoval 
---
 configure.ac  |   5 +
 debuginfod/Makefile.am|   2 +-
 debuginfod/debuginfod.cxx | 456 +-
 3 files changed, 457 insertions(+), 6 deletions(-)

diff --git a/configure.ac b/configure.ac
index 24e68d94..9c5f7e51 100644
--- a/configure.ac
+++ b/configure.ac
@@ -441,8 +441,13 @@ eu_ZIPLIB(bzlib,BZLIB,bz2,BZ2_bzdopen,bzip2)
 # We need this since bzip2 doesn't have a pkgconfig file.
 BZ2_LIB="$LIBS"
 AC_SUBST([BZ2_LIB])
+save_LIBS="$LIBS"
+LIBS=
 eu_ZIPLIB(lzma,LZMA,lzma,lzma_auto_decoder,[LZMA (xz)])
+lzma_LIBS="$LIBS"
+LIBS="$lzma_LIBS $save_LIBS"
 AS_IF([test "x$with_lzma" = xyes], [LIBLZMA="liblzma"], [LIBLZMA=""])
+AC_SUBST([lzma_LIBS])
 AC_SUBST([LIBLZMA])
 eu_ZIPLIB(zstd,ZSTD,zstd,ZSTD_decompress,[ZSTD (zst)])
 AS_IF([test "x$with_zstd" = xyes], [LIBZSTD="libzstd"], [LIBLZSTD=""])
diff --git a/debuginfod/Makefile.am b/debuginfod/Makefile.am
index b74e3673..e199dc0c 100644
--- a/debuginfod/Makefile.am
+++ b/debuginfod/Makefile.am
@@ -70,7 +70,7 @@ bin_PROGRAMS += debuginfod-find
 endif
 
 debuginfod_SOURCES = debuginfod.cxx
-debuginfod_LDADD = $(libdw) $(libelf) $(libeu) $(libdebuginfod) $(argp_LDADD) 
$(fts_LIBS) $(libmicrohttpd_LIBS) $(sqlite3_LIBS) $(libarchive_LIBS) 
$(rpm_LIBS) $(jsonc_LIBS) $(libcurl_LIBS) -lpthread -ldl
+debuginfod_LDADD = $(libdw) $(libelf) $(libeu) $(libdebuginfod) $(argp_LDADD) 
$(fts_LIBS) $(libmicrohttpd_LIBS) $(sqlite3_LIBS) $(libarchive_LIBS) 
$(rpm_LIBS) $(jsonc_LIBS) $(libcurl_LIBS) $(lzma_LIBS) -lpthread -ldl
 
 debuginfod_find_SOURCES = debuginfod-find.c
 debuginfod_find_LDADD = $(libdw) $(libelf) $(libeu) $(libdebuginfod) 
$(argp_LDADD) $(fts_LIBS) $(jsonc_LIBS)
diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index b3d80090..cf7f48ab 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -63,6 +63,10 @@ extern "C" {
 #undef __attribute__ /* glibc bug - rhbz 1763325 */
 #endif
 
+#ifdef USE_LZMA
+#include 
+#endif
+
 #include 
 #include 
 #include 
@@ -1961,6 +1965,385 @@ handle_buildid_f_match (bool internal_req_t,
   return r;
 }
 
+
+#ifdef USE_LZMA
+struct lzma_exception: public reportable_exception
+{
+  lzma_exception(int rc, const string& msg):
+// liblzma doesn't have a lzma_ret -> string conversion function, so just
+// report the value.
+reportable_exception(string ("lzma error: ") + msg + ": error " + 
to_string(rc)) {
+  inc_metric("error_count","lzma",to_string(rc));
+}
+};
+
+// Neither RPM nor deb files support seeking to a specific file in the package.
+// Instead, to extract a specific file, we normally need to read the archive
+// sequentially until we find the file.  This is very slow for files at the end
+// of a large package with lots of files, like kernel debuginfo.
+//
+// However, if the compression format used in the archive supports seeking, we
+// can accelerate this.  As of July 2024, xz is the only widely-used format 
that
+// supports seeking, and usually only in multi-threaded mode.  Luckily, the
+// kernel-debuginfo package in Fedora and its downstreams, and the
+// linux-image-*-dbg package in Debian and its downstreams, all happen to use
+// this.
+//
+// The xz format [1] ends with an index of independently compressed blocks in
+// the stream.  In RPM and deb files, the xz stream is the last thing in the
+// file, so we assume that the xz Stream Footer is at the end of the package
+// file and do everything relative to that.  For each file in the archive, we
+// remember the size and offset of the file data in the uncompressed xz stream,
+// then we use the index to seek to that offset when we need that file.
+//
+// 1: https://xz.tukaani.org/format/xz-file-format.txt
+
+// Read the Index at the end of an xz file.
+static lzma_index*
+read_xz_index (int fd)
+{
+  off_t footer_pos = -LZMA_STREAM_HEADER_SIZE;
+  if (lseek (fd, footer

[PATCH v3 7/7] debuginfod: populate _r_seekable on request

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

Since the schema change adding _r_seekable was done in a backward
compatible way, seekable archives that were previously scanned will not
be in _r_seekable.  Whenever an archive is going to be extracted to
satisfy a request, check if it is seekable.  If so, populate _r_seekable
while extracting it so that future requests use the optimized path.

The next time that BUILDIDS is bumped, all archives will be checked at
scan time.  At that point, checking again will be unnecessary and this
commit (including the test case modification) can be reverted.

Signed-off-by: Omar Sandoval 
---
 debuginfod/debuginfod.cxx| 76 +---
 tests/run-debuginfod-seekable.sh | 45 +++
 2 files changed, 115 insertions(+), 6 deletions(-)

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 677eca30..d8a02fb5 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -2740,6 +2740,7 @@ handle_buildid_r_match (bool internal_req_p,
 }
 
   // no match ... look for a seekable entry
+  bool populate_seekable = ! passive_p;
   unique_ptr pp (new sqlite_ps (internal_req_p ? db : dbq,
"rpm-seekable-query",
"select type, size, offset, mtime 
from " BUILDIDS "_r_seekable "
@@ -2749,6 +2750,9 @@ handle_buildid_r_match (bool internal_req_p,
 {
   if (rc != SQLITE_ROW)
 throw sqlite_exception(rc, "step");
+  // if we found a match in _r_seekable but we fail to extract it, don't
+  // bother populating it again
+  populate_seekable = false;
   const char* seekable_type = (const char*) sqlite3_column_text (*pp, 0);
   if (seekable_type != NULL && strcmp (seekable_type, "xz") == 0)
 {
@@ -2840,16 +2844,39 @@ handle_buildid_r_match (bool internal_req_p,
   throw archive_exception(a, "cannot open archive from pipe");
 }
 
-  // archive traversal is in three stages, no, four stages:
-  // 1) skip entries whose names do not match the requested one
-  // 2) extract the matching entry name (set r = result)
-  // 3) extract some number of prefetched entries (just into fdcache)
-  // 4) abort any further processing
+  // If the archive was scanned in a version without _r_seekable, then we may
+  // need to populate _r_seekable now.  This can be removed the next time
+  // BUILDIDS is updated.
+  if (populate_seekable)
+{
+  populate_seekable = is_seekable_archive (b_source0, a);
+  if (populate_seekable)
+{
+  // NB: the names are already interned
+  pp.reset(new sqlite_ps (db, "rpm-seekable-insert2",
+  "insert or ignore into " BUILDIDS 
"_r_seekable (file, content, type, size, offset, mtime) "
+  "values (?, "
+  "(select id from " BUILDIDS "_files "
+  "where dirname = (select id from " BUILDIDS 
"_fileparts where name = ?) "
+  "and basename = (select id from " BUILDIDS 
"_fileparts where name = ?) "
+  "), 'xz', ?, ?, ?)"));
+}
+}
+
+  // archive traversal is in five stages:
+  // 1) before we find a matching entry, insert it into _r_seekable if needed 
or
+  //skip it otherwise
+  // 2) extract the matching entry (set r = result).  Also insert it into
+  //_r_seekable if needed
+  // 3) extract some number of prefetched entries (just into fdcache).  Also
+  //insert them into _r_seekable if needed
+  // 4) if needed, insert all of the remaining entries into _r_seekable
+  // 5) abort any further processing
   struct MHD_Response* r = 0; // will set in stage 2
   unsigned prefetch_count =
 internal_req_p ? 0 : fdcache_prefetch;// will decrement in stage 3
 
-  while(r == 0 || prefetch_count > 0) // stage 1, 2, or 3
+  while(r == 0 || prefetch_count > 0 || populate_seekable) // stage 1-4
 {
   if (interrupted)
 break;
@@ -2863,6 +2890,43 @@ handle_buildid_r_match (bool internal_req_p,
 continue;
 
   string fn = canonicalized_archive_entry_pathname (e);
+
+  if (populate_seekable)
+{
+  string dn, bn;
+  size_t slash = fn.rfind('/');
+  if (slash == std::string::npos) {
+dn = "";
+bn = fn;
+  } else {
+dn = fn.substr(0, slash);
+bn = fn.substr(slash + 1);
+  }
+
+  int64_t seekable_size = archive_entry_size (e);
+  int64_t seekable_offset = archive_filter_bytes (a, 0);
+  time_t seekable_mtime = archive_entry_mtime (e);
+
+  pp->reset();
+  pp->bind(1, b_id0);
+  pp->bind(2, dn);
+  pp->bind(3, bn);
+  pp->bind(4, seekable_size);
+  pp->bind(5, seekable_offset);
+  pp->bind(6, seekable_mtime);
+  rc = pp->step();
+

[PATCH v3 6/7] debuginfod: populate _r_seekable on scan

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

Whenever a new archive is scanned, check if it is seekable with a little
liblzma magic, and populate _r_seekable if so.  With this, newly scanned
seekable archives will used the optimized extraction path added in the
previous commit.  Also add a test case using some artificial packages.

Signed-off-by: Omar Sandoval 
---
 debuginfod/debuginfod.cxx | 150 +-
 tests/Makefile.am |   4 +-
 ...pressme-seekable-xz-dbgsym_1.0-1_amd64.deb | Bin 0 -> 6288 bytes
 ...compressme-seekable-xz_1.0-1.debian.tar.xz | Bin 0 -> 1440 bytes
 .../compressme-seekable-xz_1.0-1.dsc  |  19 +++
 .../compressme-seekable-xz_1.0-1_amd64.deb| Bin 0 -> 6208 bytes
 .../compressme-seekable-xz_1.0.orig.tar.xz| Bin 0 -> 7160 bytes
 .../compressme-seekable-xz-1.0-1.src.rpm  | Bin 0 -> 15880 bytes
 .../compressme-seekable-xz-1.0-1.x86_64.rpm   | Bin 0 -> 31873 bytes
 ...sme-seekable-xz-debuginfo-1.0-1.x86_64.rpm | Bin 0 -> 21917 bytes
 ...e-seekable-xz-debugsource-1.0-1.x86_64.rpm | Bin 0 -> 7961 bytes
 tests/run-debuginfod-seekable.sh  | 141 
 12 files changed, 309 insertions(+), 5 deletions(-)
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz-dbgsym_1.0-1_amd64.deb
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0-1.debian.tar.xz
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0-1.dsc
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0-1_amd64.deb
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0.orig.tar.xz
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-1.0-1.src.rpm
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-1.0-1.x86_64.rpm
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-debuginfo-1.0-1.x86_64.rpm
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-debugsource-1.0-1.x86_64.rpm
 create mode 100755 tests/run-debuginfod-seekable.sh

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index cf7f48ab..677eca30 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -1998,6 +1998,109 @@ struct lzma_exception: public reportable_exception
 //
 // 1: https://xz.tukaani.org/format/xz-file-format.txt
 
+// Return whether an archive supports seeking.
+static bool
+is_seekable_archive (const string& rps, struct archive* a)
+{
+  // Only xz supports seeking.
+  if (archive_filter_code (a, 0) != ARCHIVE_FILTER_XZ)
+return false;
+
+  int fd = open (rps.c_str(), O_RDONLY);
+  if (fd < 0)
+return false;
+  defer_dtor fd_closer (fd, close);
+
+  // Seek to the xz Stream Footer.  We assume that it's the last thing in the
+  // file, which is true for RPM and deb files.
+  off_t footer_pos = -LZMA_STREAM_HEADER_SIZE;
+  if (lseek (fd, footer_pos, SEEK_END) == -1)
+return false;
+
+  // Decode the Stream Footer.
+  uint8_t footer[LZMA_STREAM_HEADER_SIZE];
+  size_t footer_read = 0;
+  while (footer_read < sizeof (footer))
+{
+  ssize_t bytes_read = read (fd, footer + footer_read,
+ sizeof (footer) - footer_read);
+  if (bytes_read < 0)
+{
+  if (errno == EINTR)
+continue;
+  return false;
+}
+  if (bytes_read == 0)
+return false;
+  footer_read += bytes_read;
+}
+
+  lzma_stream_flags stream_flags;
+  lzma_ret ret = lzma_stream_footer_decode (&stream_flags, footer);
+  if (ret != LZMA_OK)
+return false;
+
+  // Seek to the xz Index.
+  if (lseek (fd, footer_pos - stream_flags.backward_size, SEEK_END) == -1)
+return false;
+
+  // Decode the Number of Records in the Index.  liblzma doesn't have an API 
for
+  // this if you don't want to decode the whole Index, so we have to do it
+  // ourselves.
+  //
+  // We need 1 byte for the Index Indicator plus 1-9 bytes for the
+  // variable-length integer Number of Records.
+  uint8_t index[10];
+  size_t index_read = 0;
+  while (index_read == 0) {
+  ssize_t bytes_read = read (fd, index, sizeof (index));
+  if (bytes_read < 0)
+{
+  if (errno == EINTR)
+continue;
+  return false;
+}
+  if (bytes_read == 0)
+return false;
+  index_read += bytes_read;
+  }
+  // The Index Indicator must be 0.
+  if (index[0] != 0)
+return false;
+
+  lzma_vli num_records;
+  size_t pos = 0;
+  size_t in_pos = 1;
+  while (true)
+{
+  if (in_pos >= index_read)
+{
+  ssize_t bytes_read = read (fd, index, sizeof (index));
+  if (bytes_read < 0)
+  {
+if (errno == EINTR)
+  continue;
+return false;
+  }
+  if (bytes_read == 0)
+return false;
+  index_read = bytes_read;
+  in_pos = 0;
+

[PATCH v3 1/7] debuginfod: fix skipping source file

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

dwarf_extract_source_paths explicitly skips source files that equal
"", but dwarf_filesrc may return a path like "dir/".
Check for and skip that case, too.

In particular, the test debuginfod RPMs have paths like this.  However,
the test cases didn't catch this because they have a bug, too: they
follow symlinks, which results in double-counting every file.  Fix that,
too.

Signed-off-by: Omar Sandoval 
---
 debuginfod/debuginfod.cxx | 3 ++-
 tests/run-debuginfod-archive-groom.sh | 2 +-
 tests/run-debuginfod-extraction.sh| 2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 305edde8..92022f3d 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -3446,7 +3446,8 @@ dwarf_extract_source_paths (Elf *elf, set& 
debug_sourcefiles)
   if (hat == NULL)
 continue;
 
-  if (string(hat) == "") // gcc intrinsics, don't bother 
record
+  if (string(hat) == ""
+  || string_endswith(hat, "")) // gcc intrinsics, don't 
bother record
 continue;
 
   string waldo;
diff --git a/tests/run-debuginfod-archive-groom.sh 
b/tests/run-debuginfod-archive-groom.sh
index e2c394ef..0131158f 100755
--- a/tests/run-debuginfod-archive-groom.sh
+++ b/tests/run-debuginfod-archive-groom.sh
@@ -109,7 +109,7 @@ for i in $newrpms; do
 rpm2cpio ../$i | cpio -ivd;
 cd ..;
 done
-sourcefiles=$(find -name \*\\.debug \
+sourcefiles=$(find -name \*\\.debug -type f \
   | env LD_LIBRARY_PATH=$ldpath xargs \
 ${abs_top_builddir}/src/readelf --debug-dump=decodedline \
   | grep mtime: | wc --lines)
diff --git a/tests/run-debuginfod-extraction.sh 
b/tests/run-debuginfod-extraction.sh
index da6b25cf..f49dc6f6 100755
--- a/tests/run-debuginfod-extraction.sh
+++ b/tests/run-debuginfod-extraction.sh
@@ -94,7 +94,7 @@ for i in $newrpms; do
 rpm2cpio ../$i | cpio -ivd;
 cd ..;
 done
-sourcefiles=$(find -name \*\\.debug \
+sourcefiles=$(find -name \*\\.debug -type f \
   | env LD_LIBRARY_PATH=$ldpath xargs \
 ${abs_top_builddir}/src/readelf --debug-dump=decodedline \
   | grep mtime: | wc --lines)
-- 
2.45.2

[PATCH v3 3/7] debuginfod: factor out common code for responding from an archive

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

handle_buildid_r_match has two very similar branches where it optionally
extracts a section and then creates a microhttpd response.  In
preparation for adding a third one, factor it out into a function.

Signed-off-by: Omar Sandoval 
---
 debuginfod/debuginfod.cxx | 213 +-
 1 file changed, 96 insertions(+), 117 deletions(-)

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 92022f3d..24702c23 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -1965,6 +1965,81 @@ string canonicalized_archive_entry_pathname(struct 
archive_entry *e)
 }
 
 
+// NB: takes ownership of, and may reassign, fd.
+static struct MHD_Response*
+create_buildid_r_response (int64_t b_mtime0,
+   const string& b_source0,
+   const string& b_source1,
+   const string& section,
+   const string& ima_sig,
+   const char* tmppath,
+   int& fd,
+   off_t size,
+   time_t mtime,
+   const string& metric,
+   const struct timespec& extract_begin)
+{
+  if (tmppath != NULL)
+{
+  struct timespec extract_end;
+  clock_gettime (CLOCK_MONOTONIC, &extract_end);
+  double extract_time = (extract_end.tv_sec - extract_begin.tv_sec)
++ (extract_end.tv_nsec - extract_begin.tv_nsec)/1.e9;
+  fdcache.intern(b_source0, b_source1, tmppath, size, true, extract_time);
+}
+
+  if (!section.empty ())
+{
+  int scn_fd = extract_section (fd, b_mtime0,
+b_source0 + ":" + b_source1,
+section, extract_begin);
+  close (fd);
+  if (scn_fd >= 0)
+fd = scn_fd;
+  else
+{
+  if (verbose)
+obatched (clog) << "cannot find section " << section
+<< " for archive " << b_source0
+<< " file " << b_source1 << endl;
+  return 0;
+}
+
+  struct stat fs;
+  if (fstat (fd, &fs) < 0)
+{
+  close (fd);
+  throw libc_exception (errno,
+string ("fstat ") + b_source0 + string (" ") + section);
+}
+  size = fs.st_size;
+}
+
+  struct MHD_Response* r = MHD_create_response_from_fd (size, fd);
+  if (r == 0)
+{
+  if (verbose)
+obatched(clog) << "cannot create fd-response for " << b_source0 << 
endl;
+  close(fd);
+}
+  else
+{
+  inc_metric ("http_responses_total","result",metric);
+  add_mhd_response_header (r, "Content-Type", "application/octet-stream");
+  add_mhd_response_header (r, "X-DEBUGINFOD-SIZE", 
to_string(size).c_str());
+  add_mhd_response_header (r, "X-DEBUGINFOD-ARCHIVE", b_source0.c_str());
+  add_mhd_response_header (r, "X-DEBUGINFOD-FILE", b_source1.c_str());
+  if(!ima_sig.empty()) add_mhd_response_header(r, 
"X-DEBUGINFOD-IMASIGNATURE", ima_sig.c_str());
+  add_mhd_last_modified (r, mtime);
+  if (verbose > 1)
+obatched(clog) << "serving " << metric << " " << b_source0
+   << " file " << b_source1
+   << " section=" << section
+   << " IMA signature=" << ima_sig << endl;
+  /* libmicrohttpd will close fd. */
+}
+  return r;
+}
 
 static struct MHD_Response*
 handle_buildid_r_match (bool internal_req_p,
@@ -2142,57 +2217,15 @@ handle_buildid_r_match (bool internal_req_p,
   break; // branch out of if "loop", to try new libarchive fetch 
attempt
 }
 
-  if (!section.empty ())
-   {
- int scn_fd = extract_section (fd, fs.st_mtime,
-   b_source0 + ":" + b_source1,
-   section, extract_begin);
- close (fd);
- if (scn_fd >= 0)
-   fd = scn_fd;
- else
-   {
- if (verbose)
-   obatched (clog) << "cannot find section " << section
-   << " for archive " << b_source0
-   << " file " << b_source1 << endl;
- return 0;
-   }
-
- rc = fstat(fd, &fs);
- if (rc < 0)
-   {
- close (fd);
- throw libc_exception (errno,
-   string ("fstat archive ") + b_source0 + string (" file ") + 
b_source1
-   + string (" section ") + section);
-   }
-   }
-
-  struct MHD_Response* r = MHD_create_response_from_fd (fs.st_size, fd);
+  struct MHD_Response* r = create_buildid_r_response (b_mtime, b_source0,
+  b_source1, section,
+  ima_sig, NULL, fd,
+  f

[PATCH v3 2/7] tests/run-debuginfod-fd-prefetch-caches.sh: disable fdcache limit check

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

Since commit acd9525e93d7 ("PR31265 - rework debuginfod archive-extract
fdcache"), the fdcache limit is only applied when a new file is interned
and it has been at least 10 seconds since the limit was last applied.
This means that the fdcache can go over the limit temporarily.

run-debuginfod-fd-prefetch-caches.sh happens to avoid tripping over this
because of lucky sizes of the files used in the test.  However, adding
new files for an upcoming test exposed this failure.

Disable this part of the test for now.

Signed-off-by: Omar Sandoval 
---
 tests/run-debuginfod-fd-prefetch-caches.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/tests/run-debuginfod-fd-prefetch-caches.sh 
b/tests/run-debuginfod-fd-prefetch-caches.sh
index 3db78ade..90730555 100755
--- a/tests/run-debuginfod-fd-prefetch-caches.sh
+++ b/tests/run-debuginfod-fd-prefetch-caches.sh
@@ -99,6 +99,9 @@ kill $PID1
 wait $PID1
 PID1=0
 
+# Since we now only limit the fd cache every 10 seconds, it can temporarily go
+# over the limit.  That makes this part of the test unreliable.
+if false; then
 #
 # Test mb limit on fd cache
 #
@@ -148,3 +151,4 @@ kill $PID1
 wait $PID1
 PID1=0
 exit 0
+fi
-- 
2.45.2

Re: [PATCH 7/9 v2] libdw: Make libdw_findcu thread-safe

2024-07-19 Thread Mark Wielaard

Hi,

On Wed, 2024-07-17 at 18:34 -0400, Aaron Merey wrote:
> From: Heather McIntyre 
> 
> * libdw/libdw_findcu.c (__libdw_findcu): Use eu_tfind
>   and dwarf_lock
> (__libdw_intern_next_unit): Use per-Dwarf_CU locks.
> 
> Signed-off-by: Heather S. McIntyre 
> Signed-off-by: Aaron Merey 
> Signed-off-by: Mark Wielaard 
> 
> ---
> v2 changes:
> Use per-Dwarf_CU lock instead of a global lock.

Are you sure this description and the ChangeLog entry are correct?
This patch doesn't contain a change to use eu_tfind (there is already
one). And it seems to use a per-Dwarf lock, not a per-Dwarf_CU lock.


>  libdw/libdw_findcu.c | 47 
>  1 file changed, 30 insertions(+), 17 deletions(-)
> 
> diff --git a/libdw/libdw_findcu.c b/libdw/libdw_findcu.c
> index 72cf261c..8acff448 100644
> --- a/libdw/libdw_findcu.c
> +++ b/libdw/libdw_findcu.c
> @@ -177,6 +177,8 @@ __libdw_intern_next_unit (Dwarf *dbg, bool debug_types)
>newp->startp = data->d_buf + newp->start;
>newp->endp = data->d_buf + newp->end;
>eu_search_tree_init (&newp->locs_tree);
> +  rwlock_init (newp->abbrev_lock);
> +  rwlock_init (newp->split_lock);
>  
>/* v4 debug type units have version == 4 and unit_type == DW_UT_type.  */
>if (debug_types)

Neither of these locks are used in the rest of the patch.

> @@ -243,27 +245,38 @@ __libdw_findcu (Dwarf *dbg, Dwarf_Off start, bool 
> v4_debug_types)
>/* Maybe we already know that CU.  */
>struct Dwarf_CU fake = { .start = start, .end = 0 };
>struct Dwarf_CU **found = eu_tfind (&fake, tree, findcu_cb);
> +  struct Dwarf_CU *result = NULL;
>if (found != NULL)
>  return *found;
>  
> -  if (start < *next_offset)
> -{
> -  __libdw_seterrno (DWARF_E_INVALID_DWARF);
> -  return NULL;
> -}
> +  rwlock_wrlock (dbg->dwarf_lock);
>  
> -  /* No.  Then read more CUs.  */
> -  while (1)
> -{
> -  struct Dwarf_CU *newp = __libdw_intern_next_unit (dbg, v4_debug_types);
> -  if (newp == NULL)
> - return NULL;
> -
> -  /* Is this the one we are looking for?  */
> -  if (start < *next_offset || start == newp->start)
> - return newp;
> -}
> -  /* NOTREACHED */
> +  if (start < *next_offset)
> +__libdw_seterrno (DWARF_E_INVALID_DWARF);
> +  else
> +{   
> +  /* No.  Then read more CUs.  */
> +  while (1) 
> +{   
> +  struct Dwarf_CU *newp = __libdw_intern_next_unit (dbg,
> +v4_debug_types);
> +  if (newp == NULL)
> +{   
> +  result = NULL;
> +  break;
> +}
> +
> +  /* Is this the one we are looking for?  */
> +  if (start < *next_offset || start == newp->start)
> +{
> +  result = newp;
> +  break;
> +}
> +}
> +}   
> +
> +  rwlock_unlock (dbg->dwarf_lock);
> +  return result;
>  }
>  
>  struct Dwarf_CU *

This uses the global Dwarf structure lock, not a per Dwarf_CU lock.

Cheers,

Mark

Re: [PATCH 3/9 v2] lib: Add eu_tsearch, eu_tfind, eu_tdelete and eu_tdestroy

2024-07-19 Thread Mark Wielaard

Hi,

On Wed, 2024-07-17 at 18:34 -0400, Aaron Merey wrote:
> From: Heather McIntyre 
> 
> Add new struct search_tree to hold tree root and lock.  Add new eu_t*
> functions for ensuring synchronized tree access.
> 
> Replace tsearch, tfind, etc with eu_t* equivalents.
> 
> Move the rwlock_* macros out of eu-config.h and into a new header file
> locks.h.  This was done so that the rwlock_* macros can be included
> in libdwP.h without having to also include the rest of eu-config.h.
> 
> Signed-off-by: Heather S. McIntyre 
> Signed-off-by: Aaron Merey 
> Signed-off-by: Mark Wielaard 
> 
> v2 changes:
> 
> This patch replaces v1 03/16 and 14/16.

In this case I missed a ChangeLog entry which would have helped knowing
which changes were deliberate.

In general this looks good. Mostly the same "style" comment that local
includes should use #include "lock.h", not  (we aren't totally
consistent here and the <> variant does work).

Besides the new search_tree (which includes a lock object) it does seem
to introduce various locks that aren't used in the rest of code. It
would be better to introduce them in a later patch were those are
actually used.

> ---
>  lib/Makefile.am   |  5 ++-
>  lib/eu-config.h   | 30 +
>  lib/eu-search.c   | 85 +++
>  lib/eu-search.h   | 64 ++
>  lib/locks.h   | 62 +
>  libdw/cfi.h   |  6 +--
>  libdw/cie.c   | 10 +++--
>  libdw/dwarf_begin_elf.c   |  7 +--
>  libdw/dwarf_end.c | 17 +++
>  libdw/dwarf_getcfi.c  |  5 ++-
>  libdw/dwarf_getlocation.c | 24 +-
>  libdw/dwarf_getmacros.c   |  6 +--
>  libdw/dwarf_getsrclines.c |  8 ++--
>  libdw/fde.c   |  6 +--
>  libdw/frame-cache.c   |  8 ++--
>  libdw/libdwP.h| 26 ---
>  libdw/libdw_find_split_unit.c | 10 ++---
>  libdw/libdw_findcu.c  | 18 
>  libdwfl/cu.c  |  8 ++--
>  libdwfl/dwfl_module.c |  4 +-
>  libdwfl/libdwflP.h|  3 +-
>  libelf/elf_begin.c|  2 +
>  libelf/elf_end.c  | 13 +++---
>  libelf/elf_getdata_rawchunk.c | 12 ++---
>  libelf/libelfP.h  | 10 +++--
>  25 files changed, 331 insertions(+), 118 deletions(-)
>  create mode 100644 lib/eu-search.c
>  create mode 100644 lib/eu-search.h
>  create mode 100644 lib/locks.h
> 
> diff --git a/lib/Makefile.am b/lib/Makefile.am
> index b3bb929f..e324c18d 100644
> --- a/lib/Makefile.am
> +++ b/lib/Makefile.am
> @@ -34,10 +34,11 @@ AM_CPPFLAGS += -I$(srcdir)/../libelf
>  noinst_LIBRARIES = libeu.a
>  
>  libeu_a_SOURCES = xasprintf.c xstrdup.c xstrndup.c xmalloc.c next_prime.c \
> -   crc32.c crc32_file.c \
> +   crc32.c crc32_file.c eu-search.c \
> color.c error.c printversion.c
>  
>  noinst_HEADERS = fixedsizehash.h libeu.h system.h dynamicsizehash.h list.h \
>eu-config.h color.h printversion.h bpf.h \
> -  atomics.h stdatomic-fbsd.h dynamicsizehash_concurrent.h
> +  atomics.h stdatomic-fbsd.h dynamicsizehash_concurrent.h \
> +  eu-search.h locks.h
>  EXTRA_DIST = dynamicsizehash.c dynamicsizehash_concurrent.c

OK.

> diff --git a/lib/eu-config.h b/lib/eu-config.h
> index feb079db..a38d75da 100644
> --- a/lib/eu-config.h
> +++ b/lib/eu-config.h
> @@ -29,35 +29,7 @@
>  #ifndef EU_CONFIG_H
>  #define EU_CONFIG_H  1
>  
> -#ifdef USE_LOCKS
> -# include 
> -# include 
> -# define rwlock_define(class,name)   class pthread_rwlock_t name
> -# define once_define(class,name)  class pthread_once_t name = 
> PTHREAD_ONCE_INIT
> -# define RWLOCK_CALL(call)   \
> -  ({ int _err = pthread_rwlock_ ## call; assert_perror (_err); })
> -# define ONCE_CALL(call)  \
> -  ({ int _err = pthread_ ## call; assert_perror (_err); })
> -# define rwlock_init(lock)   RWLOCK_CALL (init (&lock, NULL))
> -# define rwlock_fini(lock)   RWLOCK_CALL (destroy (&lock))
> -# define rwlock_rdlock(lock) RWLOCK_CALL (rdlock (&lock))
> -# define rwlock_wrlock(lock) RWLOCK_CALL (wrlock (&lock))
> -# define rwlock_unlock(lock) RWLOCK_CALL (unlock (&lock))
> -# define once(once_control, init_routine)  \
> -  ONCE_CALL (once (&once_control, init_routine))
> -#else
> -/* Eventually we will allow multi-threaded applications to use the
> -   libraries.  Therefore we will add the necessary locking although
> -   the macros used expand to nothing for now.  */
> -# define rwlock_define(class,name) class int name
> -# define rwlock_init(lock) ((void) (lock))
> -# define rwlock_fini(lock) ((void) (lock))
> -# define rwlock_rdlock(lock) ((void) (lock))
> -# define rwlock_wrlock(lock) ((void) (lock))
> -# define rwlock_unlock(lock) ((void) (lock))
> -# define once_define(class,name)
> -# define once(once_control, init_routine)init_routine(

Re: [PATCH v3 0/7] debuginfod: speed up extraction from kernel debuginfo packages by 200x

2024-07-19 Thread Frank Ch. Eigler

Hi -

> This is v3 of my patch series optimizing debuginfod for kernel
> debuginfo.  v1 is here [7], v2 is here [8].  This version fixes a couple
> of minor bugs and adds test cases. [...]

Thanks, LGTM, running through try-buildbots to make sure.

- FChE

Re: [PATCH v3 0/7] debuginfod: speed up extraction from kernel debuginfo packages by 200x

2024-07-19 Thread Omar Sandoval

On Fri, Jul 19, 2024 at 01:34:48PM -0400, Frank Ch. Eigler wrote:
> Hi -
> 
> > This is v3 of my patch series optimizing debuginfod for kernel
> > debuginfo.  v1 is here [7], v2 is here [8].  This version fixes a couple
> > of minor bugs and adds test cases. [...]
> 
> Thanks, LGTM, running through try-buildbots to make sure.

Sorry about the distcheck failures, looks like I forgot to add the new
test files to EXTRA_DIST.  I'll be sure to run distcheck next time.

I'll send v4 shortly.

Thanks,
Omar

[PATCH v4 0/7] debuginfod: speed up extraction from kernel debuginfo packages by 200x

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

This is v4 of my patch series optimizing debuginfod for kernel
debuginfo.  v1 is here [1], v2 is here [2], v3 is here [3].  The only
changes from v3 in this version are fixing a bogus maybe-uninitialized
error on the Debian build and adding the new test files to EXTRA_DIST so
that make distcheck passes.

Thanks,
Omar

1: https://sourceware.org/pipermail/elfutils-devel/2024q3/007191.html
2: https://sourceware.org/pipermail/elfutils-devel/2024q3/007208.html
3: https://sourceware.org/pipermail/elfutils-devel/2024q3/007243.html

Omar Sandoval (7):
  debuginfod: fix skipping  source file
  tests/run-debuginfod-fd-prefetch-caches.sh: disable fdcache limit
check
  debuginfod: factor out common code for responding from an archive
  debugifod: add new table and views for seekable archives
  debuginfod: optimize extraction from seekable xz archives
  debuginfod: populate _r_seekable on scan
  debuginfod: populate _r_seekable on request

 configure.ac  |   5 +
 debuginfod/Makefile.am|   2 +-
 debuginfod/debuginfod.cxx | 923 +++---
 tests/Makefile.am |  13 +-
 ...pressme-seekable-xz-dbgsym_1.0-1_amd64.deb | Bin 0 -> 6288 bytes
 ...compressme-seekable-xz_1.0-1.debian.tar.xz | Bin 0 -> 1440 bytes
 .../compressme-seekable-xz_1.0-1.dsc  |  19 +
 .../compressme-seekable-xz_1.0-1_amd64.deb| Bin 0 -> 6208 bytes
 .../compressme-seekable-xz_1.0.orig.tar.xz| Bin 0 -> 7160 bytes
 .../compressme-seekable-xz-1.0-1.src.rpm  | Bin 0 -> 15880 bytes
 .../compressme-seekable-xz-1.0-1.x86_64.rpm   | Bin 0 -> 31873 bytes
 ...sme-seekable-xz-debuginfo-1.0-1.x86_64.rpm | Bin 0 -> 21917 bytes
 ...e-seekable-xz-debugsource-1.0-1.x86_64.rpm | Bin 0 -> 7961 bytes
 tests/run-debuginfod-archive-groom.sh |   2 +-
 tests/run-debuginfod-extraction.sh|   2 +-
 tests/run-debuginfod-fd-prefetch-caches.sh|   4 +
 tests/run-debuginfod-seekable.sh  | 186 
 17 files changed, 1011 insertions(+), 145 deletions(-)
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz-dbgsym_1.0-1_amd64.deb
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0-1.debian.tar.xz
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0-1.dsc
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0-1_amd64.deb
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0.orig.tar.xz
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-1.0-1.src.rpm
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-1.0-1.x86_64.rpm
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-debuginfo-1.0-1.x86_64.rpm
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-debugsource-1.0-1.x86_64.rpm
 create mode 100755 tests/run-debuginfod-seekable.sh

-- 
2.45.2

[PATCH v4 1/7] debuginfod: fix skipping source file

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

dwarf_extract_source_paths explicitly skips source files that equal
"", but dwarf_filesrc may return a path like "dir/".
Check for and skip that case, too.

In particular, the test debuginfod RPMs have paths like this.  However,
the test cases didn't catch this because they have a bug, too: they
follow symlinks, which results in double-counting every file.  Fix that,
too.

Signed-off-by: Omar Sandoval 
---
 debuginfod/debuginfod.cxx | 3 ++-
 tests/run-debuginfod-archive-groom.sh | 2 +-
 tests/run-debuginfod-extraction.sh| 2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 305edde8..92022f3d 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -3446,7 +3446,8 @@ dwarf_extract_source_paths (Elf *elf, set& 
debug_sourcefiles)
   if (hat == NULL)
 continue;
 
-  if (string(hat) == "") // gcc intrinsics, don't bother 
record
+  if (string(hat) == ""
+  || string_endswith(hat, "")) // gcc intrinsics, don't 
bother record
 continue;
 
   string waldo;
diff --git a/tests/run-debuginfod-archive-groom.sh 
b/tests/run-debuginfod-archive-groom.sh
index e2c394ef..0131158f 100755
--- a/tests/run-debuginfod-archive-groom.sh
+++ b/tests/run-debuginfod-archive-groom.sh
@@ -109,7 +109,7 @@ for i in $newrpms; do
 rpm2cpio ../$i | cpio -ivd;
 cd ..;
 done
-sourcefiles=$(find -name \*\\.debug \
+sourcefiles=$(find -name \*\\.debug -type f \
   | env LD_LIBRARY_PATH=$ldpath xargs \
 ${abs_top_builddir}/src/readelf --debug-dump=decodedline \
   | grep mtime: | wc --lines)
diff --git a/tests/run-debuginfod-extraction.sh 
b/tests/run-debuginfod-extraction.sh
index da6b25cf..f49dc6f6 100755
--- a/tests/run-debuginfod-extraction.sh
+++ b/tests/run-debuginfod-extraction.sh
@@ -94,7 +94,7 @@ for i in $newrpms; do
 rpm2cpio ../$i | cpio -ivd;
 cd ..;
 done
-sourcefiles=$(find -name \*\\.debug \
+sourcefiles=$(find -name \*\\.debug -type f \
   | env LD_LIBRARY_PATH=$ldpath xargs \
 ${abs_top_builddir}/src/readelf --debug-dump=decodedline \
   | grep mtime: | wc --lines)
-- 
2.45.2

[PATCH v4 2/7] tests/run-debuginfod-fd-prefetch-caches.sh: disable fdcache limit check

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

Since commit acd9525e93d7 ("PR31265 - rework debuginfod archive-extract
fdcache"), the fdcache limit is only applied when a new file is interned
and it has been at least 10 seconds since the limit was last applied.
This means that the fdcache can go over the limit temporarily.

run-debuginfod-fd-prefetch-caches.sh happens to avoid tripping over this
because of lucky sizes of the files used in the test.  However, adding
new files for an upcoming test exposed this failure.

Disable this part of the test for now.

Signed-off-by: Omar Sandoval 
---
 tests/run-debuginfod-fd-prefetch-caches.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/tests/run-debuginfod-fd-prefetch-caches.sh 
b/tests/run-debuginfod-fd-prefetch-caches.sh
index 3db78ade..90730555 100755
--- a/tests/run-debuginfod-fd-prefetch-caches.sh
+++ b/tests/run-debuginfod-fd-prefetch-caches.sh
@@ -99,6 +99,9 @@ kill $PID1
 wait $PID1
 PID1=0
 
+# Since we now only limit the fd cache every 10 seconds, it can temporarily go
+# over the limit.  That makes this part of the test unreliable.
+if false; then
 #
 # Test mb limit on fd cache
 #
@@ -148,3 +151,4 @@ kill $PID1
 wait $PID1
 PID1=0
 exit 0
+fi
-- 
2.45.2

[PATCH v4 4/7] debugifod: add new table and views for seekable archives

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

In order to extract a file from a seekable archive, we need to know
where in the uncompressed archive the file data starts and its size.
Additionally, in order to populate the response headers, we need the
file modification time (since we won't be able to get it from the
archive metadata).  Add a new table, _r_seekable, keyed on the archive
file id and entry file id and containing the size, offset, and mtime.
It also contains the compression type just in case new seekable formats
are supported in the future.

In order to search this table when we get a request, we need the file
ids available.  Add the ids to the _query_d and _query_e views, and
rename them to _query_d2 and _query_e2.

This schema change is backward compatible and doesn't require
reindexing.  _query_d2 and _query_e2 can be renamed back the next time
BUILDIDS needs to be bumped.

Before this change, the database for a single kernel debuginfo RPM
(kernel-debuginfo-6.9.6-200.fc40.x86_64.rpm) was about 15MB.  This
change increases that by about 70kB, only a 0.5% increase.

Signed-off-by: Omar Sandoval 
---
 debuginfod/debuginfod.cxx | 34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 24702c23..b3d80090 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -265,25 +265,39 @@ static const char DEBUGINFOD_SQLITE_DDL[] =
   "foreign key (content) references " BUILDIDS "_files(id) on update 
cascade on delete cascade,\n"
   "primary key (content, file, mtime)\n"
   ") " WITHOUT_ROWID ";\n"
+  "create table if not exists " BUILDIDS "_r_seekable (\n" // seekable rpm 
contents
+  "file integer not null,\n"
+  "content integer not null,\n"
+  "type text not null,\n"
+  "size integer not null,\n"
+  "offset integer not null,\n"
+  "mtime integer not null,\n"
+  "foreign key (file) references " BUILDIDS "_files(id) on update 
cascade on delete cascade,\n"
+  "foreign key (content) references " BUILDIDS "_files(id) on update 
cascade on delete cascade,\n"
+  "primary key (file, content)\n"
+  ") " WITHOUT_ROWID ";\n"
   // create views to glue together some of the above tables, for webapi D 
queries
-  "create view if not exists " BUILDIDS "_query_d as \n"
+  // NB: _query_d2 and _query_e2 were added to replace _query_d and _query_e
+  // without updating BUILDIDS.  They can be renamed back the next time 
BUILDIDS
+  // is updated.
+  "create view if not exists " BUILDIDS "_query_d2 as \n"
   "select\n"
-  "b.hex as buildid, n.mtime, 'F' as sourcetype, f0.name as source0, 
n.mtime as mtime, null as source1\n"
+  "b.hex as buildid, 'F' as sourcetype, n.file as id0, f0.name as 
source0, n.mtime as mtime, null as id1, null as source1\n"
   "from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS 
"_f_de n\n"
   "where b.id = n.buildid and f0.id = n.file and n.debuginfo_p = 1\n"
   "union all select\n"
-  "b.hex as buildid, n.mtime, 'R' as sourcetype, f0.name as source0, 
n.mtime as mtime, f1.name as source1\n"
+  "b.hex as buildid, 'R' as sourcetype, n.file as id0, f0.name as 
source0, n.mtime as mtime, n.content as id1, f1.name as source1\n"
   "from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS 
"_files_v f1, " BUILDIDS "_r_de n\n"
   "where b.id = n.buildid and f0.id = n.file and f1.id = n.content and 
n.debuginfo_p = 1\n"
   ";"
   // ... and for E queries
-  "create view if not exists " BUILDIDS "_query_e as \n"
+  "create view if not exists " BUILDIDS "_query_e2 as \n"
   "select\n"
-  "b.hex as buildid, n.mtime, 'F' as sourcetype, f0.name as source0, 
n.mtime as mtime, null as source1\n"
+  "b.hex as buildid, 'F' as sourcetype, n.file as id0, f0.name as 
source0, n.mtime as mtime, null as id1, null as source1\n"
   "from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS 
"_f_de n\n"
   "where b.id = n.buildid and f0.id = n.file and n.executable_p = 1\n"
   "union all select\n"
-  "b.hex as buildid, n.mtime, 'R' as sourcetype, f0.name as source0, 
n.mtime as mtime, f1.name as source1\n"
+  "b.hex as buildid, 'R' as sourcetype, n.file as id0, f0.name as 
source0, n.mtime as mtime, n.content as id1, f1.name as source1\n"
   "from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS 
"_files_v f1, " BUILDIDS "_r_de n\n"
   "where b.id = n.buildid and f0.id = n.file and f1.id = n.content and 
n.executable_p = 1\n"
   ";"
@@ -2557,7 +2571,7 @@ handle_buildid (MHD_Connection* conn,
   if (atype_code == "D")
 {
   pp = new sqlite_ps (thisdb, "mhd-query-d",
-  "select mtime, sourcetype, source0, source1 from " 
BUILDIDS "_query_d where buildid = ? "
+  "select mtime, sourcetype, sour

[PATCH v4 5/7] debuginfod: optimize extraction from seekable xz archives

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

The kernel debuginfo packages on Fedora, Debian, and Ubuntu, and many of
their downstreams, are all compressed with xz in multi-threaded mode,
which allows random access.  We can use this to bypass the full archive
extraction and dramatically speed up kernel debuginfo requests (from ~50
seconds in the worst case to < 0.25 seconds).

This works because multi-threaded xz compression splits up the stream
into many independently compressed blocks.  The stream ends with an
index of blocks.  So, to seek to an offset, we find the block containing
that offset in the index and then decompress and throw away data until
we reach the offset within the block.  We can then decompress the
desired amount of data, possibly from subsequent blocks.  There's no
high-level API in liblzma to do this, but we can do it by stitching
together a few low-level APIs.

We need to pass down the file ids then look up the size, uncompressed
offset, and mtime in the _r_seekable table.  Note that this table is not
yet populated, so this commit has no functional change on its own.

Signed-off-by: Omar Sandoval 
---
 configure.ac  |   5 +
 debuginfod/Makefile.am|   2 +-
 debuginfod/debuginfod.cxx | 456 +-
 3 files changed, 457 insertions(+), 6 deletions(-)

diff --git a/configure.ac b/configure.ac
index 24e68d94..9c5f7e51 100644
--- a/configure.ac
+++ b/configure.ac
@@ -441,8 +441,13 @@ eu_ZIPLIB(bzlib,BZLIB,bz2,BZ2_bzdopen,bzip2)
 # We need this since bzip2 doesn't have a pkgconfig file.
 BZ2_LIB="$LIBS"
 AC_SUBST([BZ2_LIB])
+save_LIBS="$LIBS"
+LIBS=
 eu_ZIPLIB(lzma,LZMA,lzma,lzma_auto_decoder,[LZMA (xz)])
+lzma_LIBS="$LIBS"
+LIBS="$lzma_LIBS $save_LIBS"
 AS_IF([test "x$with_lzma" = xyes], [LIBLZMA="liblzma"], [LIBLZMA=""])
+AC_SUBST([lzma_LIBS])
 AC_SUBST([LIBLZMA])
 eu_ZIPLIB(zstd,ZSTD,zstd,ZSTD_decompress,[ZSTD (zst)])
 AS_IF([test "x$with_zstd" = xyes], [LIBZSTD="libzstd"], [LIBLZSTD=""])
diff --git a/debuginfod/Makefile.am b/debuginfod/Makefile.am
index b74e3673..e199dc0c 100644
--- a/debuginfod/Makefile.am
+++ b/debuginfod/Makefile.am
@@ -70,7 +70,7 @@ bin_PROGRAMS += debuginfod-find
 endif
 
 debuginfod_SOURCES = debuginfod.cxx
-debuginfod_LDADD = $(libdw) $(libelf) $(libeu) $(libdebuginfod) $(argp_LDADD) 
$(fts_LIBS) $(libmicrohttpd_LIBS) $(sqlite3_LIBS) $(libarchive_LIBS) 
$(rpm_LIBS) $(jsonc_LIBS) $(libcurl_LIBS) -lpthread -ldl
+debuginfod_LDADD = $(libdw) $(libelf) $(libeu) $(libdebuginfod) $(argp_LDADD) 
$(fts_LIBS) $(libmicrohttpd_LIBS) $(sqlite3_LIBS) $(libarchive_LIBS) 
$(rpm_LIBS) $(jsonc_LIBS) $(libcurl_LIBS) $(lzma_LIBS) -lpthread -ldl
 
 debuginfod_find_SOURCES = debuginfod-find.c
 debuginfod_find_LDADD = $(libdw) $(libelf) $(libeu) $(libdebuginfod) 
$(argp_LDADD) $(fts_LIBS) $(jsonc_LIBS)
diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index b3d80090..cf7f48ab 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -63,6 +63,10 @@ extern "C" {
 #undef __attribute__ /* glibc bug - rhbz 1763325 */
 #endif
 
+#ifdef USE_LZMA
+#include 
+#endif
+
 #include 
 #include 
 #include 
@@ -1961,6 +1965,385 @@ handle_buildid_f_match (bool internal_req_t,
   return r;
 }
 
+
+#ifdef USE_LZMA
+struct lzma_exception: public reportable_exception
+{
+  lzma_exception(int rc, const string& msg):
+// liblzma doesn't have a lzma_ret -> string conversion function, so just
+// report the value.
+reportable_exception(string ("lzma error: ") + msg + ": error " + 
to_string(rc)) {
+  inc_metric("error_count","lzma",to_string(rc));
+}
+};
+
+// Neither RPM nor deb files support seeking to a specific file in the package.
+// Instead, to extract a specific file, we normally need to read the archive
+// sequentially until we find the file.  This is very slow for files at the end
+// of a large package with lots of files, like kernel debuginfo.
+//
+// However, if the compression format used in the archive supports seeking, we
+// can accelerate this.  As of July 2024, xz is the only widely-used format 
that
+// supports seeking, and usually only in multi-threaded mode.  Luckily, the
+// kernel-debuginfo package in Fedora and its downstreams, and the
+// linux-image-*-dbg package in Debian and its downstreams, all happen to use
+// this.
+//
+// The xz format [1] ends with an index of independently compressed blocks in
+// the stream.  In RPM and deb files, the xz stream is the last thing in the
+// file, so we assume that the xz Stream Footer is at the end of the package
+// file and do everything relative to that.  For each file in the archive, we
+// remember the size and offset of the file data in the uncompressed xz stream,
+// then we use the index to seek to that offset when we need that file.
+//
+// 1: https://xz.tukaani.org/format/xz-file-format.txt
+
+// Read the Index at the end of an xz file.
+static lzma_index*
+read_xz_index (int fd)
+{
+  off_t footer_pos = -LZMA_STREAM_HEADER_SIZE;
+  if (lseek (fd, footer

[PATCH v4 3/7] debuginfod: factor out common code for responding from an archive

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

handle_buildid_r_match has two very similar branches where it optionally
extracts a section and then creates a microhttpd response.  In
preparation for adding a third one, factor it out into a function.

Signed-off-by: Omar Sandoval 
---
 debuginfod/debuginfod.cxx | 213 +-
 1 file changed, 96 insertions(+), 117 deletions(-)

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 92022f3d..24702c23 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -1965,6 +1965,81 @@ string canonicalized_archive_entry_pathname(struct 
archive_entry *e)
 }
 
 
+// NB: takes ownership of, and may reassign, fd.
+static struct MHD_Response*
+create_buildid_r_response (int64_t b_mtime0,
+   const string& b_source0,
+   const string& b_source1,
+   const string& section,
+   const string& ima_sig,
+   const char* tmppath,
+   int& fd,
+   off_t size,
+   time_t mtime,
+   const string& metric,
+   const struct timespec& extract_begin)
+{
+  if (tmppath != NULL)
+{
+  struct timespec extract_end;
+  clock_gettime (CLOCK_MONOTONIC, &extract_end);
+  double extract_time = (extract_end.tv_sec - extract_begin.tv_sec)
++ (extract_end.tv_nsec - extract_begin.tv_nsec)/1.e9;
+  fdcache.intern(b_source0, b_source1, tmppath, size, true, extract_time);
+}
+
+  if (!section.empty ())
+{
+  int scn_fd = extract_section (fd, b_mtime0,
+b_source0 + ":" + b_source1,
+section, extract_begin);
+  close (fd);
+  if (scn_fd >= 0)
+fd = scn_fd;
+  else
+{
+  if (verbose)
+obatched (clog) << "cannot find section " << section
+<< " for archive " << b_source0
+<< " file " << b_source1 << endl;
+  return 0;
+}
+
+  struct stat fs;
+  if (fstat (fd, &fs) < 0)
+{
+  close (fd);
+  throw libc_exception (errno,
+string ("fstat ") + b_source0 + string (" ") + section);
+}
+  size = fs.st_size;
+}
+
+  struct MHD_Response* r = MHD_create_response_from_fd (size, fd);
+  if (r == 0)
+{
+  if (verbose)
+obatched(clog) << "cannot create fd-response for " << b_source0 << 
endl;
+  close(fd);
+}
+  else
+{
+  inc_metric ("http_responses_total","result",metric);
+  add_mhd_response_header (r, "Content-Type", "application/octet-stream");
+  add_mhd_response_header (r, "X-DEBUGINFOD-SIZE", 
to_string(size).c_str());
+  add_mhd_response_header (r, "X-DEBUGINFOD-ARCHIVE", b_source0.c_str());
+  add_mhd_response_header (r, "X-DEBUGINFOD-FILE", b_source1.c_str());
+  if(!ima_sig.empty()) add_mhd_response_header(r, 
"X-DEBUGINFOD-IMASIGNATURE", ima_sig.c_str());
+  add_mhd_last_modified (r, mtime);
+  if (verbose > 1)
+obatched(clog) << "serving " << metric << " " << b_source0
+   << " file " << b_source1
+   << " section=" << section
+   << " IMA signature=" << ima_sig << endl;
+  /* libmicrohttpd will close fd. */
+}
+  return r;
+}
 
 static struct MHD_Response*
 handle_buildid_r_match (bool internal_req_p,
@@ -2142,57 +2217,15 @@ handle_buildid_r_match (bool internal_req_p,
   break; // branch out of if "loop", to try new libarchive fetch 
attempt
 }
 
-  if (!section.empty ())
-   {
- int scn_fd = extract_section (fd, fs.st_mtime,
-   b_source0 + ":" + b_source1,
-   section, extract_begin);
- close (fd);
- if (scn_fd >= 0)
-   fd = scn_fd;
- else
-   {
- if (verbose)
-   obatched (clog) << "cannot find section " << section
-   << " for archive " << b_source0
-   << " file " << b_source1 << endl;
- return 0;
-   }
-
- rc = fstat(fd, &fs);
- if (rc < 0)
-   {
- close (fd);
- throw libc_exception (errno,
-   string ("fstat archive ") + b_source0 + string (" file ") + 
b_source1
-   + string (" section ") + section);
-   }
-   }
-
-  struct MHD_Response* r = MHD_create_response_from_fd (fs.st_size, fd);
+  struct MHD_Response* r = create_buildid_r_response (b_mtime, b_source0,
+  b_source1, section,
+  ima_sig, NULL, fd,
+  f

[PATCH v4 7/7] debuginfod: populate _r_seekable on request

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

Since the schema change adding _r_seekable was done in a backward
compatible way, seekable archives that were previously scanned will not
be in _r_seekable.  Whenever an archive is going to be extracted to
satisfy a request, check if it is seekable.  If so, populate _r_seekable
while extracting it so that future requests use the optimized path.

The next time that BUILDIDS is bumped, all archives will be checked at
scan time.  At that point, checking again will be unnecessary and this
commit (including the test case modification) can be reverted.

Signed-off-by: Omar Sandoval 
---
 debuginfod/debuginfod.cxx| 76 +---
 tests/run-debuginfod-seekable.sh | 45 +++
 2 files changed, 115 insertions(+), 6 deletions(-)

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 5fe2db0c..fb7873ae 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -2740,6 +2740,7 @@ handle_buildid_r_match (bool internal_req_p,
 }
 
   // no match ... look for a seekable entry
+  bool populate_seekable = ! passive_p;
   unique_ptr pp (new sqlite_ps (internal_req_p ? db : dbq,
"rpm-seekable-query",
"select type, size, offset, mtime 
from " BUILDIDS "_r_seekable "
@@ -2749,6 +2750,9 @@ handle_buildid_r_match (bool internal_req_p,
 {
   if (rc != SQLITE_ROW)
 throw sqlite_exception(rc, "step");
+  // if we found a match in _r_seekable but we fail to extract it, don't
+  // bother populating it again
+  populate_seekable = false;
   const char* seekable_type = (const char*) sqlite3_column_text (*pp, 0);
   if (seekable_type != NULL && strcmp (seekable_type, "xz") == 0)
 {
@@ -2840,16 +2844,39 @@ handle_buildid_r_match (bool internal_req_p,
   throw archive_exception(a, "cannot open archive from pipe");
 }
 
-  // archive traversal is in three stages, no, four stages:
-  // 1) skip entries whose names do not match the requested one
-  // 2) extract the matching entry name (set r = result)
-  // 3) extract some number of prefetched entries (just into fdcache)
-  // 4) abort any further processing
+  // If the archive was scanned in a version without _r_seekable, then we may
+  // need to populate _r_seekable now.  This can be removed the next time
+  // BUILDIDS is updated.
+  if (populate_seekable)
+{
+  populate_seekable = is_seekable_archive (b_source0, a);
+  if (populate_seekable)
+{
+  // NB: the names are already interned
+  pp.reset(new sqlite_ps (db, "rpm-seekable-insert2",
+  "insert or ignore into " BUILDIDS 
"_r_seekable (file, content, type, size, offset, mtime) "
+  "values (?, "
+  "(select id from " BUILDIDS "_files "
+  "where dirname = (select id from " BUILDIDS 
"_fileparts where name = ?) "
+  "and basename = (select id from " BUILDIDS 
"_fileparts where name = ?) "
+  "), 'xz', ?, ?, ?)"));
+}
+}
+
+  // archive traversal is in five stages:
+  // 1) before we find a matching entry, insert it into _r_seekable if needed 
or
+  //skip it otherwise
+  // 2) extract the matching entry (set r = result).  Also insert it into
+  //_r_seekable if needed
+  // 3) extract some number of prefetched entries (just into fdcache).  Also
+  //insert them into _r_seekable if needed
+  // 4) if needed, insert all of the remaining entries into _r_seekable
+  // 5) abort any further processing
   struct MHD_Response* r = 0; // will set in stage 2
   unsigned prefetch_count =
 internal_req_p ? 0 : fdcache_prefetch;// will decrement in stage 3
 
-  while(r == 0 || prefetch_count > 0) // stage 1, 2, or 3
+  while(r == 0 || prefetch_count > 0 || populate_seekable) // stage 1-4
 {
   if (interrupted)
 break;
@@ -2863,6 +2890,43 @@ handle_buildid_r_match (bool internal_req_p,
 continue;
 
   string fn = canonicalized_archive_entry_pathname (e);
+
+  if (populate_seekable)
+{
+  string dn, bn;
+  size_t slash = fn.rfind('/');
+  if (slash == std::string::npos) {
+dn = "";
+bn = fn;
+  } else {
+dn = fn.substr(0, slash);
+bn = fn.substr(slash + 1);
+  }
+
+  int64_t seekable_size = archive_entry_size (e);
+  int64_t seekable_offset = archive_filter_bytes (a, 0);
+  time_t seekable_mtime = archive_entry_mtime (e);
+
+  pp->reset();
+  pp->bind(1, b_id0);
+  pp->bind(2, dn);
+  pp->bind(3, bn);
+  pp->bind(4, seekable_size);
+  pp->bind(5, seekable_offset);
+  pp->bind(6, seekable_mtime);
+  rc = pp->step();
+

[PATCH v4 6/7] debuginfod: populate _r_seekable on scan

2024-07-19 Thread Omar Sandoval

From: Omar Sandoval 

Whenever a new archive is scanned, check if it is seekable with a little
liblzma magic, and populate _r_seekable if so.  With this, newly scanned
seekable archives will used the optimized extraction path added in the
previous commit.  Also add a test case using some artificial packages.

Signed-off-by: Omar Sandoval 
---
 debuginfod/debuginfod.cxx | 145 +-
 tests/Makefile.am |  13 +-
 ...pressme-seekable-xz-dbgsym_1.0-1_amd64.deb | Bin 0 -> 6288 bytes
 ...compressme-seekable-xz_1.0-1.debian.tar.xz | Bin 0 -> 1440 bytes
 .../compressme-seekable-xz_1.0-1.dsc  |  19 +++
 .../compressme-seekable-xz_1.0-1_amd64.deb| Bin 0 -> 6208 bytes
 .../compressme-seekable-xz_1.0.orig.tar.xz| Bin 0 -> 7160 bytes
 .../compressme-seekable-xz-1.0-1.src.rpm  | Bin 0 -> 15880 bytes
 .../compressme-seekable-xz-1.0-1.x86_64.rpm   | Bin 0 -> 31873 bytes
 ...sme-seekable-xz-debuginfo-1.0-1.x86_64.rpm | Bin 0 -> 21917 bytes
 ...e-seekable-xz-debugsource-1.0-1.x86_64.rpm | Bin 0 -> 7961 bytes
 tests/run-debuginfod-seekable.sh  | 141 +
 12 files changed, 313 insertions(+), 5 deletions(-)
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz-dbgsym_1.0-1_amd64.deb
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0-1.debian.tar.xz
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0-1.dsc
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0-1_amd64.deb
 create mode 100644 
tests/debuginfod-debs/seekable-xz/compressme-seekable-xz_1.0.orig.tar.xz
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-1.0-1.src.rpm
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-1.0-1.x86_64.rpm
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-debuginfo-1.0-1.x86_64.rpm
 create mode 100644 
tests/debuginfod-rpms/seekable-xz/compressme-seekable-xz-debugsource-1.0-1.x86_64.rpm
 create mode 100755 tests/run-debuginfod-seekable.sh

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index cf7f48ab..5fe2db0c 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -1998,6 +1998,109 @@ struct lzma_exception: public reportable_exception
 //
 // 1: https://xz.tukaani.org/format/xz-file-format.txt
 
+// Return whether an archive supports seeking.
+static bool
+is_seekable_archive (const string& rps, struct archive* a)
+{
+  // Only xz supports seeking.
+  if (archive_filter_code (a, 0) != ARCHIVE_FILTER_XZ)
+return false;
+
+  int fd = open (rps.c_str(), O_RDONLY);
+  if (fd < 0)
+return false;
+  defer_dtor fd_closer (fd, close);
+
+  // Seek to the xz Stream Footer.  We assume that it's the last thing in the
+  // file, which is true for RPM and deb files.
+  off_t footer_pos = -LZMA_STREAM_HEADER_SIZE;
+  if (lseek (fd, footer_pos, SEEK_END) == -1)
+return false;
+
+  // Decode the Stream Footer.
+  uint8_t footer[LZMA_STREAM_HEADER_SIZE];
+  size_t footer_read = 0;
+  while (footer_read < sizeof (footer))
+{
+  ssize_t bytes_read = read (fd, footer + footer_read,
+ sizeof (footer) - footer_read);
+  if (bytes_read < 0)
+{
+  if (errno == EINTR)
+continue;
+  return false;
+}
+  if (bytes_read == 0)
+return false;
+  footer_read += bytes_read;
+}
+
+  lzma_stream_flags stream_flags;
+  lzma_ret ret = lzma_stream_footer_decode (&stream_flags, footer);
+  if (ret != LZMA_OK)
+return false;
+
+  // Seek to the xz Index.
+  if (lseek (fd, footer_pos - stream_flags.backward_size, SEEK_END) == -1)
+return false;
+
+  // Decode the Number of Records in the Index.  liblzma doesn't have an API 
for
+  // this if you don't want to decode the whole Index, so we have to do it
+  // ourselves.
+  //
+  // We need 1 byte for the Index Indicator plus 1-9 bytes for the
+  // variable-length integer Number of Records.
+  uint8_t index[10];
+  size_t index_read = 0;
+  while (index_read == 0) {
+  ssize_t bytes_read = read (fd, index, sizeof (index));
+  if (bytes_read < 0)
+{
+  if (errno == EINTR)
+continue;
+  return false;
+}
+  if (bytes_read == 0)
+return false;
+  index_read += bytes_read;
+  }
+  // The Index Indicator must be 0.
+  if (index[0] != 0)
+return false;
+
+  lzma_vli num_records;
+  size_t pos = 0;
+  size_t in_pos = 1;
+  while (true)
+{
+  if (in_pos >= index_read)
+{
+  ssize_t bytes_read = read (fd, index, sizeof (index));
+  if (bytes_read < 0)
+  {
+if (errno == EINTR)
+  continue;
+return false;
+  }
+  if (bytes_read == 0)
+return false;
+  index_read = bytes_read;
+  in_pos = 0;
+

[PATCH v3 0/7] debuginfod: speed up extraction from kernel debuginfo packages by 200x

[PATCH v3 4/7] debugifod: add new table and views for seekable archives

[PATCH v3 5/7] debuginfod: optimize extraction from seekable xz archives

[PATCH v3 7/7] debuginfod: populate _r_seekable on request

[PATCH v3 6/7] debuginfod: populate _r_seekable on scan

[PATCH v3 1/7] debuginfod: fix skipping source file

[PATCH v3 3/7] debuginfod: factor out common code for responding from an archive

[PATCH v3 2/7] tests/run-debuginfod-fd-prefetch-caches.sh: disable fdcache limit check

Re: [PATCH 7/9 v2] libdw: Make libdw_findcu thread-safe

Re: [PATCH 3/9 v2] lib: Add eu_tsearch, eu_tfind, eu_tdelete and eu_tdestroy

Re: [PATCH v3 0/7] debuginfod: speed up extraction from kernel debuginfo packages by 200x

Re: [PATCH v3 0/7] debuginfod: speed up extraction from kernel debuginfo packages by 200x

[PATCH v4 0/7] debuginfod: speed up extraction from kernel debuginfo packages by 200x

[PATCH v4 1/7] debuginfod: fix skipping source file

[PATCH v4 2/7] tests/run-debuginfod-fd-prefetch-caches.sh: disable fdcache limit check

[PATCH v4 4/7] debugifod: add new table and views for seekable archives

[PATCH v4 5/7] debuginfod: optimize extraction from seekable xz archives

[PATCH v4 3/7] debuginfod: factor out common code for responding from an archive

[PATCH v4 7/7] debuginfod: populate _r_seekable on request

[PATCH v4 6/7] debuginfod: populate _r_seekable on scan

20 matches

Site Navigation

Mail list logo

Footer information