Re: PR29472: debuginfod metadata query

2023-07-07 Thread Mark Wielaard
Hi Frank,

And finally the debuginfod server code itself.

On Wed, 2023-04-12 at 16:31 -0400, Frank Ch. Eigler via Elfutils-devel
wrote:
> + * debuginfod.cxx (DEBUGINFOD_SQLITE_DDL): Add an index or two.
> + (metadata_maxtime_s, parse_opt): New parameter for load control.
> + (add_client_federation_headers): New refactored function.
> + (handle_metadata): New function.
> + (handler_cb): Call it for /metadata URL.  Trace it.
> + (groom): Tweak sqlite_ps object lifetimes.
> [...]
> diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
> index 5ef6cc32189b..000820fec5ea 100644
> --- a/debuginfod/debuginfod.cxx
> +++ b/debuginfod/debuginfod.cxx
> @@ -1,5 +1,5 @@
>  /* Debuginfo-over-http server.
> -   Copyright (C) 2019-2021 Red Hat, Inc.
> +   Copyright (C) 2019-2023 Red Hat, Inc.
> Copyright (C) 2021, 2022 Mark J. Wielaard 
> This file is part of elfutils.
>  
> @@ -68,6 +68,7 @@ extern "C" {
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  
>  /* If fts.h is included before config.h, its indirect inclusions may not
> @@ -127,6 +128,9 @@ using namespace std;
>  #define tid() pthread_self()
>  #endif
>  
> +#ifdef HAVE_JSON_C
> +  #include 
> +#endif
>  
>  inline bool
>  string_endswith(const string& haystack, const string& needle)
> @@ -185,7 +189,7 @@ static const char DEBUGINFOD_SQLITE_DDL[] =
>"foreign key (buildid) references " BUILDIDS "_buildids(id) on 
> update cascade on delete cascade,\n"
>"primary key (buildid, file, mtime)\n"
>") " WITHOUT_ROWID ";\n"
> -  // Index for faster delete by file identifier
> +  // Index for faster delete by file identifier and metadata searches
>"create index if not exists " BUILDIDS "_f_de_idx on " BUILDIDS "_f_de 
> (file, mtime);\n"
>"create table if not exists " BUILDIDS "_f_s (\n"
>"buildid integer not null,\n"
> @@ -211,6 +215,8 @@ static const char DEBUGINFOD_SQLITE_DDL[] =
>") " WITHOUT_ROWID ";\n"
>// Index for faster delete by archive file identifier
>"create index if not exists " BUILDIDS "_r_de_idx on " BUILDIDS "_r_de 
> (file, mtime);\n"
> +  // Index for metadata searches
> +  "create index if not exists " BUILDIDS "_r_de_idx2 on " BUILDIDS "_r_de 
> (content);\n"  
>"create table if not exists " BUILDIDS "_r_sref (\n" // outgoing dwarf 
> sourcefile references from rpm
>"buildid integer not null,\n"
>"artifactsrc integer not null,\n"

OK, and extra index on content in the _r_de table. Which is an index
into the _files table.

In general I find is hard to keep the whole sql structure in my head.
Is this described somewhere apart from just reading the whole
DEBUGINFOD_SQLITE_DDL?

> @@ -398,6 +404,9 @@ static const struct argp_option options[] =
> { "passive", ARGP_KEY_PASSIVE, NULL, 0, "Do not scan or groom, read-only 
> database.", 0 },
>  #define ARGP_KEY_DISABLE_SOURCE_SCAN 0x1009
> { "disable-source-scan", ARGP_KEY_DISABLE_SOURCE_SCAN, NULL, 0, "Do not 
> scan dwarf source info.", 0 },
> +#define ARGP_KEY_METADATA_MAXTIME 0x100A
> +   { "metadata-maxtime", ARGP_KEY_METADATA_MAXTIME, "SECONDS", 0,
> + "Number of seconds to limit metadata query run time, 0=unlimited.", 0 },
> { NULL, 0, NULL, 0, NULL, 0 },
>};

Ack. New --metadata-maxtime argument.

> @@ -452,6 +461,8 @@ static unsigned forwarded_ttl_limit = 8;
>  static bool scan_source_info = true;
>  static string tmpdir;
>  static bool passive_p = false;
> +static unsigned metadata_maxtime_s = 5;
> +

OK. Defaults to 5 seconds.

>  static void set_metric(const string& key, double value);
>  // static void inc_metric(const string& key);
> @@ -653,6 +664,9 @@ parse_opt (int key, char *arg,
>  case ARGP_KEY_DISABLE_SOURCE_SCAN:
>scan_source_info = false;
>break;
> +case ARGP_KEY_METADATA_MAXTIME:
> +  metadata_maxtime_s = (unsigned) atoi(arg);
> +  break;
>// case 'h': argp_state_help (state, stderr, 
> ARGP_HELP_LONG|ARGP_HELP_EXIT_OK);
>  default: return ARGP_ERR_UNKNOWN;
>  }

OK. Can be set by user.

> @@ -2040,6 +2054,58 @@ handle_buildid_r_match (bool internal_req_p,
>return r;
>  }
>  
> +void
> +add_client_federation_headers(debuginfod_client *client, MHD_Connection* 
> conn){
> +  // Transcribe incoming User-Agent:
> +  string ua = MHD_lookup_connection_value (conn, MHD_HEADER_KIND, 
> "User-Agent") ?: "";
> +  string ua_complete = string("User-Agent: ") + ua;
> +  debuginfod_add_http_header (client, ua_complete.c_str());
> +
> +  // Compute larger XFF:, for avoiding info loss during
> +  // federation, and for future cyclicity detection.
> +  string xff = MHD_lookup_connection_value (conn, MHD_HEADER_KIND, 
> "X-Forwarded-For") ?: "";
> +  if (xff != "")
> +xff += string(", "); // comma separated list
> +
> +  unsigned int xff_count = 0;
> +  for (auto&& i : xff){
> +if (i == ',') xff_count++;
> +  }
> +
> +  // if X-Forwarded-For: exceeds N hops,
> +  // 

Re: PR29472: debuginfod metadata query

2023-07-07 Thread Mark Wielaard
Hi Frank,

On Wed, 2023-04-12 at 16:31 -0400, Frank Ch. Eigler via Elfutils-devel
wrote:
> At long last, for your review, an updated & merged patch for $SUBJECT.
> "git diff -s" against git/master follows, code also on
> users/fche/try-pr29472e.
> 
> 
> Author: Frank Ch. Eigler 
> Date:   Tue Apr 11 23:35:25 2023 -0400
> 
> PR29472: debuginfod: add metadata query webapi, C api, client
> 
> This patch extends the debuginfod API with a "metadata query"
> operation.  It allows clients to request an enumeration of file names
> known to debuginfod servers, returning a JSON response including the
> matching buildids.  This lets clients later download debuginfo for a
> range of versions of the same named binaries, in case they need to to
> prospective work (like systemtap-based live-patching).  It also lets
> server operators implement prefetch triggering operations for popular
> but slow debuginfo slivers like kernel vdso.debug files on fedora.
> 
> Implementation requires a modern enough json-c library.
> 
> % debuginfod-find metadata file /bin/ls
> % debuginfod-find metadata glob "/usr/local/bin/c*"
> 
> Documentation and testing are included.
> 
> Signed-off-by: Ryan Goldberg 
> Signed-off-by: Frank Ch. Eigler 

Some higher level comments.

- Take my comments on the sql commands and json-c reference counting
with a grain of salt, I might not fully understand.

- It might be better to just require the json-c library than make it
all optional (when building the debuginfod code). If possible I would
like to see the debuginfod-client code be usable without json-c (the
merging might be possible by simple text manipulation, but if not lets
just require it also). Optional features are a bit of a pain imho.

- This could really use a couple of concrete examples. The interface is
really abstract (just a provide a key and value and some json will come
out). Which is good because that makes it extensible, but imho really
confusing without some concrete examples how to use it.

- Unfortunately debuginfod-find is not great for this right now,
because it outputs results into the cache only (and with a file name
that needs shell escaping). Might be a good idea to output pretty
printed json to stdout by default. Then you could show some simple
example easily. And it would make it easier for users to try some
things out quickly.

- Currently it seems we expect all results to be json arrays with the
empty array representing either no results or error. I think it might
be better to output a json object. Then you can have a object
representing (empty) results as:

{
  "paths": [ 'frob/baz', 'foo/bar/baz' ]
}

Or

{
  "results": [ { buildid: "hexhexhex",
   file: "/path/to/file",
   archive: "" } ]

Or return an error as:

{
  "error": "Server unreachable code: 554"
}

- I don't really know how to reason properly about the merging done for
results from multiple federated servers. But it seems to make it
impossible to distinguish between no results or an error fetching
results. Maybe creating one big results array as is done is the most
sane way. Have you though of other ways to represent "merged" results?

Cheers,

Mark