Hi Ralf, > Your draft patch doesn't directly apply to the current CVS > tree; I haven't tested and looked at it in detail yet.
It is all committed now. > See below for a character frequency bound that is only a little more > expensive than string length comparison (sort of a 1-gram version of > msgl-fsearch). Here, it gives me almost another factor of two speedup; > I would be interested to see whether it still helps at all on top of > your patch. Excellent again!! Yes, for me too it nearly gives a 2x speedup: > > "msgmerge af.po coreutils.pot" > > 152 sec. before all your work, > > 16.5 sec. with all your patches, > > 7.6 sec. with this additional sorting. now: 4.3 sec. with your second bound. Since clearing and summing up an array of size 256 is a bit expensive if both strings are small, I made it conditional on (xvec_length + yvec_length >= 20). The threshold 20 was determined experimentally: Threshold Time of "msgmerge af.po coreutils.pot" 1 4.29 sec 10 4.28 sec 20 4.266 sec 30 4.28 sec 40 4.29 sec 50 4.35 sec 60 4.50 sec Committed like this: 2008-09-15 Ralf Wildenhues <[EMAIL PROTECTED]> * lib/fstrcmp.c (fstrcmp_bounded): Use a second, less quick upper bound based on character occurrence counts. *** lib/fstrcmp.c.orig 2008-09-16 03:11:53.000000000 +0200 --- lib/fstrcmp.c 2008-09-16 03:10:03.000000000 +0200 *************** *** 141,146 **** --- 141,195 ---- if (upper_bound < lower_bound) /* Return an arbitrary value < LOWER_BOUND. */ return 0.0; + + #if CHAR_BIT <= 8 + /* When X and Y are both small, avoid the overhead of setting up an + array of size 256. */ + if (xvec_length + yvec_length >= 20) + { + /* Compute a less quick upper bound. + Each edit is an insertion or deletion of a character, hence + modifies the occurrence count of a character by 1 and leaves the + other occurrence counts unchanged. + Therefore, when starting from a sequence X and ending at a + sequence Y, and denoting the occurrence count of C in X with + OCC (X, C), with N edits, + sum_C | OCC (X, C) - OCC (Y, C) | <= N. + (Proof by induction over N.) + So, at the end, we will have + edit_count >= sum_C | OCC (X, C) - OCC (Y, C) |, + and hence + result + = (xvec_length + yvec_length - edit_count) + / (xvec_length + yvec_length) + <= (xvec_length + yvec_length - sum_C | OCC(X,C) - OCC(Y,C) |) + / (xvec_length + yvec_length). + */ + int occ_diff[UCHAR_MAX + 1]; /* array C -> OCC(X,C) - OCC(Y,C) */ + int sum; + + /* Determine the occurrence counts in X. */ + memset (occ_diff, 0, sizeof (occ_diff)); + for (i = xvec_length - 1; i >= 0; i--) + occ_diff[(unsigned char) string1[i]]++; + /* Subtract the occurrence counts in Y. */ + for (i = yvec_length - 1; i >= 0; i--) + occ_diff[(unsigned char) string2[i]]--; + /* Sum up the absolute values. */ + sum = 0; + for (i = 0; i <= UCHAR_MAX; i++) + { + int d = occ_diff[i]; + sum += (d >= 0 ? d : -d); + } + + upper_bound = 1.0 - (double) sum / (xvec_length + yvec_length); + + if (upper_bound < lower_bound) + /* Return an arbitrary value < LOWER_BOUND. */ + return 0.0; + } + #endif } /* set the info for each string. */