Hello Bruno, all, msgmerge is slow: merging of outdated coreutils .po files takes roughly 12 minutes here. gcov shows that more than 99% of the time is spent in compareseq/diag, called from fstrcmp, called indirectly from message_list_search_fuzzy. At the heart of the fstrcmp code, a modified scaled Levenshtein distance is computed. This is used for matching each given msgid in a .po file against the set of msgids in the .pot file, looking for the best match. Let's look at this in detail:
Let L be the Levenshtein distance of two strings s1 and s2 with lengths l1 and l2, L' the Levenshtein distance where edits are not counted as one change (but two: one insertion and one removal). Then fstrcmp computes a similarity value r as r = (l1 + l2 - L') / (l1 + l2) It holds that 0 <= r <= 1, and r assumes the value of one for identical strings only. Further, L <= L'. With this setup, I see some possibilities for improvement: 1) Avoid computing distances that are known to be worse than the current best one. There are bounds to the Levenshtein distance which can be computed quickly. The Hamming distance is an upper bound, the string length difference a lower bound. We are looking for an upper bound for r, thus a lower one for L'. The string length difference bound L' >= L >= |l1 - l2| results in r <= (l1 + l2 - |l1 - l2|) / (l1 + l2) = 2 min{l1,l2} / (l1 + l2). With this, computation time improves from 12 to roughly 3.5 minutes. 2) Abort the distance computation as soon as the result is known to be suboptimal. If the current best similarity is r0, then better strings will have L' < (1 - r0) * (l1 + l2), so as soon as the ongoing distance computation exceeds this value, we can terminate. With this, computation time decreases by another 30 seconds. 3) When computing one string pair distance, avoid computing the path (i.e., the edit script). While this may improve the algorithm by a small constant, AFAIK there is no win otherwise, and the loss would likely be a decoupling from diffutils' use of the code. BTW, AFAICS these improvements do not help diffutils, because it is always interested in the edit script that corresponds to a certain string pair distance, no matter how far that distance. 4) Avoid quadratic scaling in the total number of msgids (i.e., number of strings in .po file times number of strings in the .pot file). There are several ways to do this, I'll only give a couple of ideas that could be tried out when this issue becomes pressing. With .po files the size of coreutils' ones, I'm not sure this is needed yet. You have two sets S1 and S2 (of msgids), and a metric to compute distances between any two members of the union of both sets (in this case the metric is L, not r). For each member of the first set (from the .po file), you search for the nearest neighbor in the second set. A mapping of the metric space into another one in which the sets may be partitioned efficiently, may admit fast nearest neighbor search algorithms, which exist even for very high-dimensional spaces. The question is whether such mappings exist, and how to compute them. A simple example: map each string to its length. The condition from (1) will allow to exclude, say, even trying to compare very short strings to very long ones. For msgids, this will probably not admit an efficient partitioning, though, as I expect many strings to be of similar length. message_fuzzy_index_search looks like the implementation of a similar idea, but uses some heuristic. It also exploits (1). I assume it's not used in the code path for coreutils' po files due to the heuristic? OTOH, such an algorithm may require rethinking the (really nice BTW!) parallelization that is achieved using OpenMP directives now. 5) Use a different similarity measure. Not considered here. I will reply to this message with a couple of patches against gnulib for (1) and (2) (on bug-gnulib), and a patch against gettext (on bug-gnu-gettext) to make use of the improvements. For coreutils, the produced merged po files are identical before and afterwards. Cheers, Ralf