Raphael Geissert <atomo64+deb...@gmail.com> writes: > There might indeed be a degradation on those texts where the number of > characters is near the limit of the methods switcher. Other than those > cases, when the number of words in the text is greater than the number > of words in the hash there should be a performance improvement when > searching for every known word in the text (there is indeed a > performance penalty because of the regex and the text traversal, but it > is not enough to make it slower than the other method).
But why would that be the case? There are two scaling factors here. Let's say n is the length of the text and w is the number of corrections we have. The current algorithm walks the text once and does a hash table lookup for each word. (Well, technically it probably walks the text once to split it on whitespace and then a second time to do the hash table lookups.) So, we do work proportional to the length of the text, or O(n). The factor is roughly 2 plus the length of time it takes to hash a word. Here's the new code: # Check whether we should check every word in the text or # better look for every word we know about in the text. # Use 10 as the average word length in %CORRECTIONS if (length($text) < $#CORRECTIONS * 10) { for my $word (split(/\s+/, $text)) { if (exists $CORRECTIONS{$word}) { _tag($tag, $filename, $word, $CORRECTIONS{$word}); } } } else { for my $word (@CORRECTIONS) { my $rword = qr<$word>; _tag($tag, $filename, $word, $CORRECTIONS{$word}) if ($text =~ m,(^|\s)$rword(\s|$),); } } The first branch is the current algorithm. This code chooses that algorithm if n is small. If n is large, it switches to doing a scan of the full text for each correction that we have, which is work proportional to w, the number of corrections. I was wrong originally -- this isn't O(n^2). It's still O(n), but with a different factor, this time proportional to w. It's O(wn), essentially. So, for this to be faster, the time it takes to check a portion of the text against every correction we have using a regex must be less than the time it takes to hash a word. This makes no sense to me. Any reasonable hash algorithm should be much faster than more than 100 regex matches. The benchmarking looks like it may just be in the noise for me: > Without the patch: > $ time > xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > [...] > real 0m8.380s > user 0m5.596s > sys 0m0.640s > (this is normal as the file is being read directly from the disk, not in > memory) > $ time > xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > [...] > real 0m6.701s > user 0m5.800s > sys 0m0.684s > $ time > xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > [...] > real 0m6.424s > user 0m5.740s > sys 0m0.520s > > This time with the patch: > $ time > xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > [...] > real 0m6.613s > user 0m5.728s > sys 0m0.580s > (Let's ignore this one, so that disk reads do not influence the results) > $ time > xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > [...] > real 0m6.254s > user 0m5.564s > sys 0m0.540s > $ time > xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > [...] > real 0m6.311s > user 0m5.456s > sys 0m0.600s Here are my results. Without the patch: windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb >/dev/null 2.304u 0.420s 0:05.08 53.5% 0+0k 16304+11392io 77pf+0w windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null 2.248u 0.468s 0:02.78 97.1% 0+0k 24+11392io 0pf+0w windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null 2.300u 0.400s 0:02.76 97.8% 0+0k 16+11392io 0pf+0w With the patch: windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null 2.264u 0.380s 0:02.75 96.0% 0+0k 0+11392io 0pf+0w windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null 2.228u 0.424s 0:02.82 93.6% 0+0k 0+11392io 0pf+0w windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null 2.144u 0.492s 0:02.79 94.2% 0+0k 0+11392io 0pf+0w Without the patch again: windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null 2.192u 0.448s 0:02.80 93.9% 0+0k 0+11392io 0pf+0w windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null 2.212u 0.448s 0:02.80 94.6% 0+0k 0+11392io 0pf+0w windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null 2.192u 0.484s 0:02.83 94.3% 0+0k 0+11392io 0pf+0w So I'm getting basically similar results either way. I would expect that. Either way, the spelling check is just not a big enough part of what Lintian does to make any noticable difference in the overall running time unless the algorithm were just horrible. You need to benchmark specifically the algorithms that you're changing in isolation to get a high enough granularity. When I do that, the new algorithm looks much slower for long text: windlord:~> ./benchmark Benchmark: timing 1000 iterations of current, new... current: 6 wallclock secs ( 5.11 usr + 0.00 sys = 5.11 CPU) @ 195.69/s (n=1000) new: 12 wallclock secs (12.39 usr + 0.01 sys = 12.40 CPU) @ 80.65/s (n=1000) Rate new current new 80.6/s -- -59% current 196/s 143% -- windlord:~> ./benchmark Benchmark: timing 1000 iterations of current, new... current: 6 wallclock secs ( 5.22 usr + 0.01 sys = 5.23 CPU) @ 191.20/s (n=1000) new: 12 wallclock secs (12.27 usr + 0.02 sys = 12.29 CPU) @ 81.37/s (n=1000) Rate new current new 81.4/s -- -57% current 191/s 135% -- Attached is the benchmarking script I used in case I made a mistake. -- Russ Allbery (r...@debian.org) <http://www.eyrie.org/~eagle/>
#!/usr/bin/perl use Benchmark qw(timethese cmpthese); use lib '/home/eagle/dvl/debian/lintian/lib'; use Spelling; our $text; open (IN, '<', '/usr/share/common-licenses/GPL-3') or die; { local $/; $text = <IN>; } close IN; our @CORRECTIONS = keys %Spelling::CORRECTIONS; sub current { for my $word (split(/\s+/, $text)) { $word = lc $word; if (exists $Spelling::CORRECTIONS{$word}) { print "Found spelling error $word\n"; } } } sub new { for my $word (@CORRECTIONS) { my $rword = qr<$word>; if ($text =~ m,(^|\s)$rword(\s|$),) { print "Found spelling error $word\n"; } } } $results = timethese(1000, { current => \¤t, new => \&new, }); cmpthese($results);