msgmerge speedup: fstrcmp and diffseq improvements

Ralf Wildenhues Sun, 14 Sep 2008 01:13:23 -0700

Hello Bruno, all,

msgmerge is slow: merging of outdated coreutils .po files takes roughly
12 minutes here.  gcov shows that more than 99% of the time is spent in
compareseq/diag, called from fstrcmp, called indirectly from
message_list_search_fuzzy.  At the heart of the fstrcmp code, a modified
scaled Levenshtein distance is computed.  This is used for matching each
given msgid in a .po file against the set of msgids in the .pot file,
looking for the best match.  Let's look at this in detail:


Let L be the Levenshtein distance of two strings s1 and s2 with lengths
l1 and l2, L' the Levenshtein distance where edits are not counted as
one change (but two: one insertion and one removal).  Then fstrcmp
computes a similarity value r as
  r = (l1 + l2 - L') / (l1 + l2)

It holds that 0 <= r <= 1, and r assumes the value of one for identical
strings only.  Further, L <= L'.


With this setup, I see some possibilities for improvement:

1) Avoid computing distances that are known to be worse than the current
best one.  There are bounds to the Levenshtein distance which can be
computed quickly.  The Hamming distance is an upper bound, the string
length difference a lower bound.  We are looking for an upper bound for
r, thus a lower one for L'.  The string length difference bound
  L' >= L >= |l1 - l2|

results in
  r <= (l1 + l2 - |l1 - l2|) / (l1 + l2) = 2 min{l1,l2} / (l1 + l2).

With this, computation time improves from 12 to roughly 3.5 minutes.


2) Abort the distance computation as soon as the result is known to be
suboptimal.  If the current best similarity is r0, then better strings
will have
  L' < (1 - r0) * (l1 + l2),

so as soon as the ongoing distance computation exceeds this value,
we can terminate.

With this, computation time decreases by another 30 seconds.


3) When computing one string pair distance, avoid computing the path
(i.e., the edit script).  While this may improve the algorithm by a
small constant, AFAIK there is no win otherwise, and the loss would
likely be a decoupling from diffutils' use of the code.

BTW, AFAICS these improvements do not help diffutils, because it is
always interested in the edit script that corresponds to a certain
string pair distance, no matter how far that distance.


4) Avoid quadratic scaling in the total number of msgids (i.e., number
of strings in .po file times number of strings in the .pot file).
There are several ways to do this, I'll only give a couple of ideas that
could be tried out when this issue becomes pressing.  With .po files the
size of coreutils' ones, I'm not sure this is needed yet.

You have two sets S1 and S2 (of msgids), and a metric to compute
distances between any two members of the union of both sets (in this
case the metric is L, not r).  For each member of the first set (from
the .po file), you search for the nearest neighbor in the second set.
A mapping of the metric space into another one in which the sets may
be partitioned efficiently, may admit fast nearest neighbor search
algorithms, which exist even for very high-dimensional spaces.  The
question is whether such mappings exist, and how to compute them.

A simple example: map each string to its length.  The condition from (1)
will allow to exclude, say, even trying to compare very short strings to
very long ones.  For msgids, this will probably not admit an efficient
partitioning, though, as I expect many strings to be of similar length.

message_fuzzy_index_search looks like the implementation of a similar
idea, but uses some heuristic.  It also exploits (1).  I assume it's
not used in the code path for coreutils' po files due to the heuristic?

OTOH, such an algorithm may require rethinking the (really nice BTW!)
parallelization that is achieved using OpenMP directives now.


5) Use a different similarity measure.  Not considered here.


I will reply to this message with a couple of patches against gnulib
for (1) and (2) (on bug-gnulib), and a patch against gettext (on
bug-gnu-gettext) to make use of the improvements.  For coreutils,
the produced merged po files are identical before and afterwards.

Cheers,
Ralf

msgmerge speedup: fstrcmp and diffseq improvements

Reply via email to