Re: msgmerge speedup: fstrcmp and diffseq improvements

Bruno Haible Mon, 15 Sep 2008 18:29:22 -0700

Hi Ralf,

> Your draft patch doesn't directly apply to the current CVS
> tree; I haven't tested and looked at it in detail yet.


It is all committed now.

> See below for a character frequency bound that is only a little more
> expensive than string length comparison (sort of a 1-gram version of
> msgl-fsearch).  Here, it gives me almost another factor of two speedup;
> I would be interested to see whether it still helps at all on top of
> your patch.

Excellent again!! Yes, for me too it nearly gives a 2x speedup:

> >   "msgmerge af.po coreutils.pot"
> >  152 sec.   before all your work,
> >   16.5 sec. with all your patches,
> >    7.6 sec. with this additional sorting.

now:   4.3 sec. with your second bound.

Since clearing and summing up an array of size 256 is a bit expensive if
both strings are small, I made it conditional on
   (xvec_length + yvec_length >= 20).
The threshold 20 was determined experimentally:

    Threshold    Time of "msgmerge af.po coreutils.pot"

         1          4.29 sec
        10          4.28 sec
        20          4.266 sec
        30          4.28 sec
        40          4.29 sec
        50          4.35 sec
        60          4.50 sec

Committed like this:


2008-09-15  Ralf Wildenhues  <[EMAIL PROTECTED]>

        * lib/fstrcmp.c (fstrcmp_bounded): Use a second, less quick upper bound
        based on character occurrence counts.

*** lib/fstrcmp.c.orig  2008-09-16 03:11:53.000000000 +0200
--- lib/fstrcmp.c       2008-09-16 03:10:03.000000000 +0200
***************
*** 141,146 ****
--- 141,195 ----
        if (upper_bound < lower_bound)
        /* Return an arbitrary value < LOWER_BOUND.  */
        return 0.0;
+ 
+ #if CHAR_BIT <= 8
+       /* When X and Y are both small, avoid the overhead of setting up an
+        array of size 256.  */
+       if (xvec_length + yvec_length >= 20)
+       {
+         /* Compute a less quick upper bound.
+            Each edit is an insertion or deletion of a character, hence
+            modifies the occurrence count of a character by 1 and leaves the
+            other occurrence counts unchanged.
+            Therefore, when starting from a sequence X and ending at a
+            sequence Y, and denoting the occurrence count of C in X with
+            OCC (X, C), with N edits,
+              sum_C | OCC (X, C) - OCC (Y, C) | <= N.
+            (Proof by induction over N.)
+            So, at the end, we will have
+              edit_count >= sum_C | OCC (X, C) - OCC (Y, C) |,
+            and hence
+              result
+                = (xvec_length + yvec_length - edit_count)
+                  / (xvec_length + yvec_length)
+                <= (xvec_length + yvec_length - sum_C | OCC(X,C) - OCC(Y,C) |)
+                   / (xvec_length + yvec_length).
+          */
+         int occ_diff[UCHAR_MAX + 1]; /* array C -> OCC(X,C) - OCC(Y,C) */
+         int sum;
+ 
+         /* Determine the occurrence counts in X.  */
+         memset (occ_diff, 0, sizeof (occ_diff));
+         for (i = xvec_length - 1; i >= 0; i--)
+           occ_diff[(unsigned char) string1[i]]++;
+         /* Subtract the occurrence counts in Y.  */
+         for (i = yvec_length - 1; i >= 0; i--)
+           occ_diff[(unsigned char) string2[i]]--;
+         /* Sum up the absolute values.  */
+         sum = 0;
+         for (i = 0; i <= UCHAR_MAX; i++)
+           {
+             int d = occ_diff[i];
+             sum += (d >= 0 ? d : -d);
+           }
+ 
+         upper_bound = 1.0 - (double) sum / (xvec_length + yvec_length);
+ 
+         if (upper_bound < lower_bound)
+           /* Return an arbitrary value < LOWER_BOUND.  */
+           return 0.0;
+       }
+ #endif
      }
  
    /* set the info for each string.  */

Re: msgmerge speedup: fstrcmp and diffseq improvements

Reply via email to