On Thu, Jan 11, 2024, 23:21 Chet Ramey <chet.ra...@case.edu> wrote: > On 1/10/24 12:28 AM, Eric Wong wrote: > > Hi, I noticed bash struggles with gigantic completion lists > > (100k items of ~70 chars each) > > > > It's reproducible with both LANG+LC_ALL set to en_US.UTF-8 and C, > > so it's not just locales slowing things down. > > > > This happens on the up-to-date `devel' branch > > (commit 584a2b4c9e11bd713030916d9d832602891733d7), > > but I first noticed this on Debian oldstable (5.1.4) > > > > strcoll and strlen seem to be at the top of profiles, and > > mregister_free when building devel with default options... > > ltrace reveals it's doing strlen repeatedly on the entire > > (100k items * 70 chars each = ~7MB) > > OK. Let's look at what happens. Let's say the text you're trying to > complete is "a" (or ""). > > You generate a huge list of strings and store that list into a string, > which the shell has to split into individual words (there's only one) and > run through word expansion, as part of running compgen -W on it. >
dunno : this list as flat text , and the matches as regex .. Since you have to return just words from the list, you need to run > strcmp/strcoll against each member of the list to figure out which ones > to output to store in COMPREPLY. At least that's all done in a subshell. > > So you end up with every substring in $wordlist as part of $COMPREPLY. > > Then you hand that list to readline, which runs through it to find the > longest common substring of all the matches. This is where you get a > ton of strlen calls, since mbrtowc needs to know the maximum number of > bytes to examine when converting strings to potentially multibyte > characters. Room for improvement here, but strlen is pretty cheap. > Still, it would reduce the number of calls. > (lib/readline/complete.c:compute_lcd_of_matches()). > > You end up with a list of matches that readline wants: LCD in matches[0], > the matches in the rest of the match list, NULL terminated. Now you have > to postprocess it. > > You have to remove duplicate matches, since you didn't tell readline to > keep them. To make that easier, you have to sort them (since you didn't > tell readline not to). That's where you get the calls to strcmp/strcoll -- > you can't avoid them. You can't count on qsort removing duplicates for > you, so you have to run through the array again, comparing each element > against the next until they don't match, marking the ones that do for > removal -- more strcmp/strcoll calls. > > Now you have the list of possible matches, and readline will either insert > the longest common substring or display the match list. If you're going to > display the match list, you have to run through the array again to > determine the longest match, do any required processing for the > completion-prefix-display-length and completion-display-width variables, > possibly color the common prefix, then print the matches, which will > call strlen() on each match to determine how much screen space the match > will take. > > A lot of this is caused by readline's passing around string vectors > (char **) instead of some struct that held a string and its length. But > that's the public API, and I have higher-priority things to do than > redo readline's internal completion architecture. > > You're welcome to take your shot, and make improvements where there are > improvements to be made. I'd be glad to take them. > > Chet > > -- > ``The lyf so short, the craft so long to lerne.'' - Chaucer > ``Ars longa, vita brevis'' - Hippocrates > Chet Ramey, UTech, CWRU c...@case.edu http://tiswww.cwru.edu/~chet/ > >