On Tue, Aug 06, 2019 at 03:53:00PM +0000, Niels Thykier wrote:
> On Wed, 31 Jul 2019 19:41:51 +0200 Robert Luberda <rob...@debian.org> wrote:
> > While working on my manpages-pl source package, I've noticed that the
> > dh_installman step takes more time to execute than all other build steps 
> > together.
> > 
> > This poor performance is caused by recoding all (i.e. about 1500 in case
> > of manpages-pl) man pages into UTF-8, what is pretty much useless in case 
> > of my package, because the pages are in UTF-8 already. 
> > 
> > It would be nice if dh_installman could have some option to disable 
> > recoding 
> > or if it could at least filter the manpages to recode with `isutf8 -l'
> > or similar command. (I've just checked that 
> > 'isutf8 -l debian/tmp/usr/share/man/pl/man*/*' inside the package is really
> > quick to determine that all files are in UTF-8).
> 
> Is there some way to trivially detect if the manpages need re-encoding
> (without pulling moreutils as dependency or re-implementing the relevant
> code in Perl)?  Like some troff-ish rune in the early part of the file
> that says "this file is definitely UTF-8" or something like that?

man itself has code to do that, but it's not trivial and I'd hate to see
it reimplemented in more places.

The actual recoding bit of "man --recode" is already practically a no-op
if both source and target are UTF-8: it takes less than a millisecond to
decide that the page is likely to be UTF-8, and then it just passes the
source through to the target.  However, based on a quick estimate from
strace output for a small page, about 98% of the wallclock time is spent
on process setup (initial memory allocation, parsing the configuration
file, checking the manpath, and such).  So I think a far more obvious
optimisation target would be to add a mode to man where it could recode
a batch of pages rather than just one at a time, in order that we'd only
have to incur that setup cost once (or at least once per xargs batch).
Making that parallel would take some more work, but honestly, since this
approach would probably give something like a 40x speedup, I'm not sure
that'd be necessary.

Does this make sense to you?  If so, do you have any opinions on the
interface?  (I'm open to it being a new program rather than having to
stuff even more complexity into man's command-line interface, which
would also make it easy to detect whether the new interface is
available.)

-- 
Colin Watson                                       [cjwat...@debian.org]

Reply via email to