Re: completion very slow with gigantic list

Chet Ramey Thu, 11 Jan 2024 14:21:11 -0800

On 1/10/24 12:28 AM, Eric Wong wrote:

Hi, I noticed bash struggles with gigantic completion lists
(100k items of ~70 chars each)


It's reproducible with both LANG+LC_ALL set to en_US.UTF-8 and C,
so it's not just locales slowing things down.

This happens on the up-to-date `devel' branch
(commit 584a2b4c9e11bd713030916d9d832602891733d7),
but I first noticed this on Debian oldstable (5.1.4)

strcoll and strlen seem to be at the top of profiles, and
mregister_free when building devel with default options...
ltrace reveals it's doing strlen repeatedly on the entire
(100k items * 70 chars each = ~7MB)


OK. Let's look at what happens. Let's say the text you're trying to
complete is "a" (or "").

You generate a huge list of strings and store that list into a string,
which the shell has to split into individual words (there's only one) and
run through word expansion, as part of running compgen -W on it.

Since you have to return just words from the list, you need to run
strcmp/strcoll against each member of the list to figure out which ones
to output to store in COMPREPLY. At least that's all done in a subshell.

So you end up with every substring in $wordlist as part of $COMPREPLY.

Then you hand that list to readline, which runs through it to find the
longest common substring of all the matches. This is where you get a
ton of strlen calls, since mbrtowc needs to know the maximum number of
bytes to examine when converting strings to potentially multibyte
characters. Room for improvement here, but strlen is pretty cheap.
Still, it would reduce the number of calls.
(lib/readline/complete.c:compute_lcd_of_matches()).

You end up with a list of matches that readline wants: LCD in matches[0],
the matches in the rest of the match list, NULL terminated. Now you have
to postprocess it.

You have to remove duplicate matches, since you didn't tell readline to
keep them. To make that easier, you have to sort them (since you didn't
tell readline not to). That's where you get the calls to strcmp/strcoll --
you can't avoid them. You can't count on qsort removing duplicates for
you, so you have to run through the array again, comparing each element
against the next until they don't match, marking the ones that do for
removal -- more strcmp/strcoll calls.

Now you have the list of possible matches, and readline will either insert
the longest common substring or display the match list. If you're going to
display the match list, you have to run through the array again to
determine the longest match, do any required processing for the
completion-prefix-display-length and completion-display-width variables,
possibly color the common prefix, then print the matches, which will
call strlen() on each match to determine how much screen space the match
will take.

A lot of this is caused by readline's passing around string vectors
(char **) instead of some struct that held a string and its length. But
that's the public API, and I have higher-priority things to do than
redo readline's internal completion architecture.

You're welcome to take your shot, and make improvements where there are
improvements to be made. I'd be glad to take them.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    c...@case.edu    http://tiswww.cwru.edu/~chet/

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: completion very slow with gigantic list

Reply via email to