Bug#1003089: man-db has become prohibitively slow

Colin Watson Sun, 16 Jan 2022 20:15:16 -0800

Significant progress!  See the end of this email.

On Tue, Jan 04, 2022 at 10:21:57AM +0100, Steinar H. Gunderson wrote:
> On Mon, Jan 03, 2022 at 10:34:08PM +0000, Colin Watson wrote:
> > This part is already covered in https://bugs.debian.org/696503.  I admit
> > that this was last updated ten years ago; I really need to get back to
> > that and get it to work one way or another. :-/
> 
> Ah, well, that wasn't so easy to understand from man-db's bug list. :-)

In fact also https://bugs.debian.org/630799.

> > The seccomp sandbox is mainly so that man-db can more safely operate on
> > manual pages shipped by packaging systems that expect to run code under
> > confinement and thus can ship relatively untrusted code (such as snapd).
> > If we limited the mandb invocation in the postinst to only
> > /usr/share/man, we could reasonably assume that all pages there were
> > installed by dpkg, and those don't need the same level of care since any
> > package installed by dpkg can run code directly as root anyway.  It
> > would then be reasonable to run it under MAN_DISABLE_SECCOMP=1.
> 
> Let me see if I understand the threat model here, to see where you're coming
> from. The idea is that the user installs a snap (whose installation and
> running is nominally sandboxed), which contains a malicious man page
> containing an arbitrary execution exploit for man-db's apropos parser
> or similar? That sounds like something that is possible for the postinst to
> ignore, yes.
> 
> OTOH, if you assume packages can carry malicious man files, is it also part
> of the threat model that it could be overriding other package's documentation
> with spam and/or really bad advice (like “run this vulnerable script as
> root”)?

No, the threat model is just dealing with arbitrary-code-execution
attacks in the various tools man-db spawns that might be exploitable by
way of bad input.  I agree that on the whole this *seems* low-priority,
but parsers are often viable targets.

(I agree that bad-advice type attacks are *possible*, but "man -a"
exists, and other things about at least the snap ecosystem will
generally make it hard to override names shipped by .deb packages, so
I'm not worrying about this rather thornier problem for now.)

> TBH man-db sounds fairly safe compared to many other things, though. I've
> only skimmed the code, but as I understand it, a typical mandb run sends the
> file through gzip and a flex parser? If you find a bug in zlib or in flex,
> I would assume there are _much_ juicier targets out there than dropping a
> malicious man page into a snap -- so we're really left with the surrounding
> code that takes the output of the flex parser. Which isn't really all that
> much code. I understand these are the famous last words, of course... :-)
> 
> I understand that man(1) itself pushes things through groff, which is a
> different story -- but only for the case where you run man setuid.

It's still worth defending against arbitrary-code-execution attacks by a
malicious snap in groff even if it doesn't permit set-id escalation -
after all, snaps do promise a degree of sandboxing.  (But as you say,
groff parsing isn't relevant to mandb(8).)

> > I really don't want to make apropos less useful.  I'll have another go
> > at tackling the core problem here; but the fixes described above should
> > at least get us back to merely the situation we've had for years, rather
> > than the (as you say) much worse behaviour recently.
> 
> Yes, it would be really nice to have speedups and not just no regressions. :-)
> We keep installing more packages and man pages, so software probably needs to
> be made for larger data sets eventually. It's a bit like dpkg -- a lot of the
> design decisions made sense with 100 packages and 10k files, but when you get
> to 10x or 100x that, it starts getting a real pain point even if you just
> freeze it and don't make it any slower.
> 
> I took a look at mandb's profile in perf, and even after turning off
> libseccomp, it appears that perhaps 10–11% of its time is spent doing real
> work (decompression, lexer, malloc, character set conversion). The rest is
> kernel overhead from launching and exiting pipelines, which is a fairly steep
> price.
> 
> On the surface of it, it's not immediately obvious to me what the pipeline
> abstraction is all about. Is there a reason why you can't just open the man
> page, decompress it in one shot and then use fmemopen()? (Or similarly,
> open_memstream().) If it's bigger than, say, 16 MB uncompressed, you can
> probably just kick it from apropos.

The real issue is flexibility.  For instance:

 * man-db intentionally supports various compression formats without
   needing to link with all the possible decompression libraries out
   there.
 * Although mandb(8) currently scans the entire man page to determine
   which filters are applicable, that might be something I can change
   one day (it essentially involves making sure that pages that do weird
   stuff with tbl/eqn/etc. all have an appropriate '\" line at the top -
   that's been documented for long enough so it may in fact be true, but
   it needs checking).  A streaming approach is nice for that kind of
   thing.
 * The abstraction pays off much more in man(1), where the practical
   effect of a few more subprocesses is fairly minimal (except for a few
   pathological cases in tools such as Lintian), and the number of
   external tools we need to glue together is often quite high.  Being
   able to reuse the same code in mandb(8) has kept the bug count down.

But I agree that the current performance cost in mandb(8) is a high
price to pay for that flexibility.  Read to the end of this email before
you spend any time arguing with the above. :-)

> > Given #696503, would you mind if I repurposed this bug to be just about
> > fixing the seccomp-related performance regression?  I think the plan
> > above will let me deal with just that part of it relatively quickly, and
> > the fixes would be small enough to be viable candidates for stable.
> 
> Sure, that would be fine. It would perhaps be useful to make 696503 a bit
> clearer to the casual observer, though; i.e., link it to man-db somehow and
> make the title more about performance.

I think I'll leave that bug alone since #630799 exists and has a clear
enough title.

On Tue, Jan 04, 2022 at 10:28:50PM +0100, Steinar H. Gunderson wrote:
> On Tue, Jan 04, 2022 at 10:21:58AM +0100, Steinar H. Gunderson wrote:
> > I took a look at mandb's profile in perf, and even after turning off
> > libseccomp, it appears that perhaps 10–11% of its time is spent doing real
> > work (decompression, lexer, malloc, character set conversion). The rest is
> > kernel overhead from launching and exiting pipelines, which is a fairly 
> > steep
> > price.
>
> I made a straw man to test whether this was really true, and it turns it is.
> See the attached patch, which rips out the pipelines from mandb and replaces
> them by simple one-shot decompression buffers; it's by no means something
> that should be applied in its current state, but it reliably finds and parses
> all 24216 man pages on my system, giving identical behavior (as determined by
> “apropos ''”) to the current man-db. But it finishes in 4.7 seconds instead
> of 51, so indeed, about 9% of the time, and at a performance point where it
> is unlikely to be too painful in a full-upgrade.
>
> Of course, this is an unfair comparison because it does not do encoding
> conversion, but most pages do not need that, and in any case, adding it would
> be unlikely to get up to more than 10–11%. It also does not do error handling,
> arbitrary-size man pages (it uses fixed buffers instead of streaming; again
> completely possible to fix, or one could set upper limits), anything
> resembling good coding style, or shelling out to col(1). (I believe the
> latter is only for parsing cat pages, though, which would seem less relevant
> today.)
>
> But it does set a bar that I think should be approachable by mandb, without
> any special microoptimization, multithreading, fancy parsing algorithms
> or the likes. :-)

Thanks for your patch.

In the long term, I think the best approach with the neatest
abstractions may be to retrofit multithreading to libpipeline: threading
allows retaining most of the existing logic for pushing data through
subprocesses, and retains all the flexibility.  Of course we'd need to
benchmark any such approach to ensure that it actually makes enough of a
difference; also, some of the critical bits of libpipeline need to be
made async-signal-safe first.  I think I know roughly how to do this,
but I don't necessarily want to promise to get it done in any particular
amount of time.

So, I'm basically OK with installing an optimization until such time as
I get more blue-sky work done, since I don't want to get in the way
here.  Many of the obvious limits of this kind of approach could be
dealt with by treating it as a fast path that we use if conditions
permit: large files, cat pages, etc. could fall back to the pipeline
approach.

We definitely do need to sort out encoding conversion, though.  Although
UTF-8 has been recommended for many years, policy still allows "the
usual legacy encoding" and we've never got round to mandating UTF-8:

  $ w3m -dump https://lintian.debian.org/tags/national-encoding | grep --count 
usr/share/man
  502

Furthermore, I don't want to do heavily Debian-specific optimizations in
man-db, so even if we were prepared to fix all these cases in Debian as
a prerequisite, it wouldn't really help much.  As a result I do think we
need to figure out a way to get manconv to work with this optimized
approach without breaking existing code or (ideally) duplicating the
whole rather fiddly thing.

I've been experimenting with a few things, and I think the most elegant
way to do this is in fact to add another layer of abstraction!  This is
a much cheaper one, though.  If instead of returning a pipeline we have
decompress_open return a new tagged union type, we can clone the simple
pipeline_{read,peek}* functions on top of that, and convert everything
over to use those rather than talking to libpipeline directly; the
functions that specifically need pipeline support (mainly in man(1), but
also things like the cat page case in lexgrog.l) can ask to drop down to
a lower level.

It's then fairly straightforward to implement the same interface using
an alternative in-memory decompression method, and this is transparent
to calling code that doesn't explicitly ask for the underlying pipeline.
The only minor difficulty is arranging to do in-memory encoding
conversion, but that's just slightly ugly rather than fundamentally
hard.

This all turned out to be remarkably easy.  As soon as I fixed all the
compiler errors, "mandb -c" worked first time and full accessdb dumps
showed no differences from the 2.9.4 baseline.  I'd made a small mistake
that I had to fix to stop "man -K" crashing, but so far that seems to
have been it.

Would you care to have a look at this?

  https://gitlab.com/cjwatson/man-db/-/merge_requests/2

There's probably still room for improvement, but unlikely to be much
more than a factor of two or so at this point, and I think this should
get us comfortably back to the point where it's no longer annoying
people during upgrades.

-- 
Colin Watson (he/him)                              [cjwat...@debian.org]

Bug#1003089: man-db has become prohibitively slow

Reply via email to