Re: [bug-diffutils] bug#16848: diff reports a diff on identical lines

Paul Eggert Sun, 23 Feb 2014 23:00:25 -0800

Thanks for the bug report.

I applied the attached three patches; the first is to gnulib and changesgnulib's diffseq module to handle that case better, the second is todiffutils and brings diffutils up to the latest gnulib, and the third isalso to diffutils and makes diffutils compatible with the new gnulib inorder to use the fix. I'll CC: this to bug-gnulib. The fix works forme so I'm marking this bug as fixed, but if it still doesn't work foryou I can reopen it.

diff --git a/ChangeLog b/ChangeLog
index a46689f..fcb9afc 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,27 @@
 2014-02-23  Paul Eggert  <egg...@cs.ucla.edu>
 
+       diffseq: remove TOO_EXPENSIVE heuristic
+       Problem with diffutils reported by Vincent Lefevre in
+       <http://bugs.gnu.org/16848>.  The simplest solution is to remove
+       the TOO_EXPENSIVE heuristic that I added to GNU diff in 1993.
+       Although appropriate for circa-1993 hardware, these days the heuristic
+       seems to be more trouble than it's worth.
+       * lib/diffseq.h: Modernize citations.
+       (struct context): Remove member too_expensive.
+       All uses changed.
+       (struct partition): Remove members lo_minimal, hi_minimal.
+       All uses changed.
+       (diag, compareseq): Remove arg find_minimal.  All uses changed.
+       (diag): Remove the TOO_EXPENSIVE heuristic that I added back in
+       1993 to make 'diff' run faster (but not as well) on large inputs.
+       These days, computers are fast enough that it's typically better
+       to run slower but more accurately.
+       * lib/fstrcmp.c: Remove duplicate comment.
+       * lib/fstrcmp.c (strcmp_bounded):
+       * lib/git-merge-changelog.c (compute_differences):
+       Adjust to diffseq.h changes.
+       * NEWS: Document the change.
+
        savedir: simplify by using stpcpy
        * lib/savedir.c (direntry_t): Remove size member.  All uses removed.
        (streamsavedir): Use stpcpy instead.
diff --git a/NEWS b/NEWS
index 4b02a5a..11cb2ff 100644
--- a/NEWS
+++ b/NEWS
@@ -3,6 +3,11 @@ Important notes
 
 Date        Modules         Changes
 
+2014-02-23  diffseq         The members too_expensive, lo_minimal and 
hi_minimal
+                            were removed from public structureas, and the
+                            find_minimal argument was removed from diag
+                            and compareseq.
+
 2014-02-11  savedir         The savedir and streamsavedir functions have a
                             new argument specifying how to sort the result.
                             The fdsavedir function is removed.
diff --git a/lib/diffseq.h b/lib/diffseq.h
index 73e5600..1974c36 100644
--- a/lib/diffseq.h
+++ b/lib/diffseq.h
@@ -26,18 +26,15 @@
    distance" in Wikipedia.
 
    The basic algorithm is described in:
-   "An O(ND) Difference Algorithm and its Variations", Eugene Myers,
-   Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;
-   see especially section 4.2, which describes the variation used below.
+   "An O(ND) Difference Algorithm and its Variations", Eugene W. Myers,
+   Algorithmica Vol. 1, 1986, pp. 251-266,
+   <http://dx.doi.org/10.1007/BF01840446>.
+   See especially section 4.2, which describes the variation used below.
 
    The basic algorithm was independently discovered as described in:
-   "Algorithms for Approximate String Matching", E. Ukkonen,
-   Information and Control Vol. 64, 1985, pp. 100-118.
-
-   Unless the 'find_minimal' flag is set, this code uses the TOO_EXPENSIVE
-   heuristic, by Paul Eggert, to limit the cost to O(N**1.5 log N)
-   at the price of producing suboptimal output for large inputs with
-   many differences.  */
+   "Algorithms for Approximate String Matching", Esko Ukkonen,
+   Information and Control Vol. 64, 1985, pp. 100-118,
+   <http://dx.doi.org/10.1016/S0019-9958(85)80046-2>.  */
 
 /* Before including this file, you need to define:
      ELEMENT                 The element type of the vectors being compared.
@@ -120,15 +117,12 @@ struct context
   OFFSET *bdiag;
 
   #ifdef USE_HEURISTIC
-  /* This corresponds to the diff -H flag.  With this heuristic, for
-     vectors with a constant small density of changes, the algorithm is
-     linear in the vectors size.  */
+  /* This corresponds to the diff --speed-large-files flag.  With this
+     heuristic, for vectors with a constant small density of changes,
+     the algorithm is linear in the vector size.  */
   bool heuristic;
   #endif
 
-  /* Edit scripts longer than this are too expensive to compute.  */
-  OFFSET too_expensive;
-
   /* Snakes bigger than this are considered "big".  */
   #define SNAKE_LIMIT 20
 };
@@ -138,12 +132,6 @@ struct partition
   /* Midpoints of this partition.  */
   OFFSET xmid;
   OFFSET ymid;
-
-  /* True if low half will be analyzed minimally.  */
-  bool lo_minimal;
-
-  /* Likewise for high half.  */
-  bool hi_minimal;
 };
 
 
@@ -155,17 +143,10 @@ struct partition
    When the two searches meet, we have found the midpoint of the shortest
    edit sequence.
 
-   If FIND_MINIMAL is true, find the minimal edit script regardless of
-   expense.  Otherwise, if the search is too expensive, use heuristics to
-   stop the search and report a suboptimal answer.
-
-   Set PART->(xmid,ymid) to the midpoint (XMID,YMID).  The diagonal number
+   Set *PART to the midpoint (XMID,YMID).  The diagonal number
    XMID - YMID equals the number of inserted elements minus the number
    of deleted elements (counting only elements before the midpoint).
 
-   Set PART->lo_minimal to true iff the minimal edit script for the
-   left half of the partition is known; similarly for PART->hi_minimal.
-
    This function assumes that the first elements of the specified portions
    of the two vectors do not match, and likewise that the last elements do not
    match.  The caller must trim matching elements from the beginning and end
@@ -175,7 +156,7 @@ struct partition
    suboptimal diff output.  It cannot cause incorrect diff output.  */
 
 static void
-diag (OFFSET xoff, OFFSET xlim, OFFSET yoff, OFFSET ylim, bool find_minimal,
+diag (OFFSET xoff, OFFSET xlim, OFFSET yoff, OFFSET ylim,
       struct partition *part, struct context *ctxt)
 {
   OFFSET *const fd = ctxt->fdiag;       /* Give the compiler a chance. */
@@ -235,7 +216,6 @@ diag (OFFSET xoff, OFFSET xlim, OFFSET yoff, OFFSET ylim, 
bool find_minimal,
             {
               part->xmid = x;
               part->ymid = y;
-              part->lo_minimal = part->hi_minimal = true;
               return;
             }
         }
@@ -268,14 +248,10 @@ diag (OFFSET xoff, OFFSET xlim, OFFSET yoff, OFFSET ylim, 
bool find_minimal,
             {
               part->xmid = x;
               part->ymid = y;
-              part->lo_minimal = part->hi_minimal = true;
               return;
             }
         }
 
-      if (find_minimal)
-        continue;
-
 #ifdef USE_HEURISTIC
       /* Heuristic: check occasionally for a diagonal that has made lots
          of progress compared with the edit distance.  If we have any
@@ -319,11 +295,7 @@ diag (OFFSET xoff, OFFSET xlim, OFFSET yoff, OFFSET ylim, 
bool find_minimal,
                   }
               }
             if (best > 0)
-              {
-                part->lo_minimal = true;
-                part->hi_minimal = false;
-                return;
-              }
+             return;
           }
 
           {
@@ -358,77 +330,10 @@ diag (OFFSET xoff, OFFSET xlim, OFFSET yoff, OFFSET ylim, 
bool find_minimal,
                   }
               }
             if (best > 0)
-              {
-                part->lo_minimal = false;
-                part->hi_minimal = true;
-                return;
-              }
+             return;
           }
         }
 #endif /* USE_HEURISTIC */
-
-      /* Heuristic: if we've gone well beyond the call of duty, give up
-         and report halfway between our best results so far.  */
-      if (c >= ctxt->too_expensive)
-        {
-          OFFSET fxybest;
-          OFFSET fxbest IF_LINT (= 0);
-          OFFSET bxybest;
-          OFFSET bxbest IF_LINT (= 0);
-
-          /* Find forward diagonal that maximizes X + Y.  */
-          fxybest = -1;
-          for (d = fmax; d >= fmin; d -= 2)
-            {
-              OFFSET x = MIN (fd[d], xlim);
-              OFFSET y = x - d;
-              if (ylim < y)
-                {
-                  x = ylim + d;
-                  y = ylim;
-                }
-              if (fxybest < x + y)
-                {
-                  fxybest = x + y;
-                  fxbest = x;
-                }
-            }
-
-          /* Find backward diagonal that minimizes X + Y.  */
-          bxybest = OFFSET_MAX;
-          for (d = bmax; d >= bmin; d -= 2)
-            {
-              OFFSET x = MAX (xoff, bd[d]);
-              OFFSET y = x - d;
-              if (y < yoff)
-                {
-                  x = yoff + d;
-                  y = yoff;
-                }
-              if (x + y < bxybest)
-                {
-                  bxybest = x + y;
-                  bxbest = x;
-                }
-            }
-
-          /* Use the better of the two diagonals.  */
-          if ((xlim + ylim) - bxybest < fxybest - (xoff + yoff))
-            {
-              part->xmid = fxbest;
-              part->ymid = fxybest - fxbest;
-              part->lo_minimal = true;
-              part->hi_minimal = false;
-            }
-          else
-            {
-              part->xmid = bxbest;
-              part->ymid = bxybest - bxbest;
-              part->lo_minimal = false;
-              part->hi_minimal = true;
-            }
-          return;
-        }
     }
   #undef XREF_YREF_EQUAL
 }
@@ -442,9 +347,6 @@ diag (OFFSET xoff, OFFSET xlim, OFFSET yoff, OFFSET ylim, 
bool find_minimal,
    Note that XLIM, YLIM are exclusive bounds.  All indices into the vectors
    are origin-0.
 
-   If FIND_MINIMAL, find a minimal difference no matter how
-   expensive it is.
-
    The results are recorded by invoking NOTE_DELETE and NOTE_INSERT.
 
    Return false if terminated normally, or true if terminated through early
@@ -452,7 +354,7 @@ diag (OFFSET xoff, OFFSET xlim, OFFSET yoff, OFFSET ylim, 
bool find_minimal,
 
 static bool
 compareseq (OFFSET xoff, OFFSET xlim, OFFSET yoff, OFFSET ylim,
-            bool find_minimal, struct context *ctxt)
+            struct context *ctxt)
 {
 #ifdef ELEMENT
   ELEMENT const *xv = ctxt->xvec; /* Help the compiler.  */
@@ -498,12 +400,12 @@ compareseq (OFFSET xoff, OFFSET xlim, OFFSET yoff, OFFSET 
ylim,
       struct partition part IF_LINT2 (= { .xmid = 0, .ymid = 0 });
 
       /* Find a point of correspondence in the middle of the vectors.  */
-      diag (xoff, xlim, yoff, ylim, find_minimal, &part, ctxt);
+      diag (xoff, xlim, yoff, ylim, &part, ctxt);
 
       /* Use the partitions to split this problem into subproblems.  */
-      if (compareseq (xoff, part.xmid, yoff, part.ymid, part.lo_minimal, ctxt))
+      if (compareseq (xoff, part.xmid, yoff, part.ymid, ctxt))
         return true;
-      if (compareseq (part.xmid, xlim, part.ymid, ylim, part.hi_minimal, ctxt))
+      if (compareseq (part.xmid, xlim, part.ymid, ylim, ctxt))
         return true;
     }
 
diff --git a/lib/fstrcmp.c b/lib/fstrcmp.c
index 7bb9eef..9c73904 100644
--- a/lib/fstrcmp.c
+++ b/lib/fstrcmp.c
@@ -13,33 +13,9 @@
    GNU General Public License for more details.
 
    You should have received a copy of the GNU General Public License
-   along with this program.  If not, see <http://www.gnu.org/licenses/>.
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
 
 
-   Derived from GNU diff 2.7, analyze.c et al.
-
-   The basic idea is to consider two vectors as similar if, when
-   transforming the first vector into the second vector through a
-   sequence of edits (inserts and deletes of one element each),
-   this sequence is short - or equivalently, if the ordered list
-   of elements that are untouched by these edits is long.  For a
-   good introduction to the subject, read about the "Levenshtein
-   distance" in Wikipedia.
-
-   The basic algorithm is described in:
-   "An O(ND) Difference Algorithm and its Variations", Eugene Myers,
-   Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;
-   see especially section 4.2, which describes the variation used below.
-
-   The basic algorithm was independently discovered as described in:
-   "Algorithms for Approximate String Matching", E. Ukkonen,
-   Information and Control Vol. 64, 1985, pp. 100-118.
-
-   Unless the 'find_minimal' flag is set, this code uses the TOO_EXPENSIVE
-   heuristic, by Paul Eggert, to limit the cost to O(N**1.5 log N)
-   at the price of producing suboptimal output for large inputs with
-   many differences.  */
-
 #include <config.h>
 
 /* Specification.  */
@@ -203,16 +179,6 @@ fstrcmp_bounded (const char *string1, const char *string2, 
double lower_bound)
   ctxt.xvec = string1;
   ctxt.yvec = string2;
 
-  /* Set TOO_EXPENSIVE to be approximate square root of input size,
-     bounded below by 256.  */
-  ctxt.too_expensive = 1;
-  for (i = xvec_length + yvec_length;
-       i != 0;
-       i >>= 2)
-    ctxt.too_expensive <<= 1;
-  if (ctxt.too_expensive < 256)
-    ctxt.too_expensive = 256;
-
   /* Allocate memory for fdiag and bdiag from a thread-local pool.  */
   fdiag_len = xvec_length + yvec_length + 3;
   gl_once (keys_init_once, keys_init);
@@ -252,7 +218,7 @@ fstrcmp_bounded (const char *string1, const char *string2, 
double lower_bound)
 
   /* Now do the main comparison algorithm */
   ctxt.edit_count = - ctxt.edit_count_limit;
-  if (compareseq (0, xvec_length, 0, yvec_length, 0, &ctxt)) /* Prob: 98% */
+  if (compareseq (0, xvec_length, 0, yvec_length, &ctxt)) /* Prob: 98% */
     /* The edit_count passed the limit.  Hence the result would be
        < lower_bound.  We can return any value < lower_bound instead.  */
     return 0.0;
diff --git a/lib/git-merge-changelog.c b/lib/git-merge-changelog.c
index 60dd972..9d4bd5c 100644
--- a/lib/git-merge-changelog.c
+++ b/lib/git-merge-changelog.c
@@ -678,11 +678,10 @@ compute_differences (struct changelog_file *file1, struct 
changelog_file *file2,
     ctxt.index_mapping_reverse[j] = 0;
   ctxt.fdiag = XNMALLOC (2 * (n1 + n2 + 3), ssize_t) + n2 + 1;
   ctxt.bdiag = ctxt.fdiag + n1 + n2 + 3;
-  ctxt.too_expensive = n1 + n2;
 
   /* Store in ctxt.index_mapping and ctxt.index_mapping_reverse a -1 for
      each removed or added entry.  */
-  compareseq (0, n1, 0, n2, 0, &ctxt);
+  compareseq (0, n1, 0, n2, &ctxt);
 
   /* Complete the index_mapping and index_mapping_reverse arrays.  */
   i = 0;

From 777a3185abb5551f0923d44edfb3ba519657b569 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Sun, 23 Feb 2014 16:23:17 -0800
Subject: [PATCH 1/2] build: update gnulib submodule to latest


---
 gnulib | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gnulib b/gnulib
index c96bab3..1aed559 160000
--- a/gnulib
+++ b/gnulib
@@ -1 +1 @@
-Subproject commit c96bab3fee48a9df55e7366344f838e1fc785c28
+Subproject commit 1aed559952395ff1eff24263701b349eb77a4e06
-- 
1.8.5.3

From 5e303b170cd8f6a0a34a30d724e4c7d7a444ef32 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Sun, 23 Feb 2014 22:49:27 -0800
Subject: [PATCH 2/2] diff: remove TOO_EXPENSIVE heuristic

Problem reported by Vincent Lefevre in <http://bugs.gnu.org/16848>.
The simplest solution is to remove the TOO_EXPENSIVE heuristic
that I added to GNU diff in 1993.  Although appropriate for
circa-1993 hardware, these days the heuristic seems to be more
trouble than it's worth.
* NEWS: Document this.
* doc/diffutils.texi (Overview): Modernize citations.
Remove mention of TOO_EXPENSIVE heuristic.
* src/analyze.c (diff_2_files): Adjust to TOO_EXPENSIVE-related
API changes in gnulib's diffseq module.
---
 NEWS               |  7 +++++++
 doc/diffutils.texi | 18 +++++++++---------
 src/analyze.c      | 10 +---------
 3 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/NEWS b/NEWS
index aff7a9d..58a7cbb 100644
--- a/NEWS
+++ b/NEWS
@@ -13,6 +13,13 @@ GNU diffutils NEWS                                    -*- 
outline -*-
   consider two Asian file names to be the same merely because they
   contain no English characters.
 
+** Performance changes
+
+  diff's default algorithm has been adjusted to output higher-quality
+  results at somewhat greater computational cost, as CPUs have gotten
+  faster since the algorithm was last tweaked in diffutils-2.6 (1993).
+
+
 * Noteworthy changes in release 3.3 (2013-03-24) [stable]
 
 ** New features
diff --git a/doc/diffutils.texi b/doc/diffutils.texi
index 1d48f54..7b08cd4 100644
--- a/doc/diffutils.texi
+++ b/doc/diffutils.texi
@@ -142,26 +142,26 @@ use diffs to update files.
 David Hayes, Richard Stallman, and Len Tower.  Wayne Davison designed and
 implemented the unified output format.  The basic algorithm is described
 by Eugene W. Myers in ``An O(ND) Difference Algorithm and its Variations'',
-@cite{Algorithmica} Vol.@: 1 No.@: 2, 1986, pp.@: 251--266; and in ``A File
+@cite{Algorithmica} Vol.@: 1, 1986, pp.@: 251--266,
+@url{http://dx.doi.org/10.1007/BF01840446}; and in ``A File
 Comparison Program'', Webb Miller and Eugene W. Myers,
-@cite{Software---Practice and Experience} Vol.@: 15 No.@: 11, 1985,
-pp.@: 1025--1040.
+@cite{Software---Practice and Experience} Vol.@: 15, 1985,
+pp.@: 1025--1040,
+@url{http://dx.doi.org/10.1002/spe.4380151102}.
 @c From: "Gene Myers" <g...@cs.arizona.edu>
 @c They are about the same basic algorithm; the Algorithmica
 @c paper gives a rigorous treatment and the sub-algorithm for
 @c delivering scripts and should be the primary reference, but
 @c both should be mentioned.
-The algorithm was independently discovered as described by E. Ukkonen in
+The algorithm was independently discovered as described by Esko Ukkonen in
 ``Algorithms for Approximate String Matching'',
-@cite{Information and Control} Vol.@: 64, 1985, pp.@: 100--118.
+@cite{Information and Control} Vol.@: 64, 1985, pp.@: 100--118,
+@url{http://dx.doi.org/10.1016/S0019-9958(85)80046-2}.
 @c From: "Gene Myers" <g...@cs.arizona.edu>
 @c Date: Wed, 29 Sep 1993 08:27:55 MST
 @c Ukkonen should be given credit for also discovering the algorithm used
 @c in GNU diff.
-Unless the @option{--minimal} option is used, @command{diff} uses a
-heuristic by Paul Eggert that limits the cost to @math{O(N^1.5 log N)}
-at the price of producing suboptimal output for large inputs with many
-differences.  Related algorithms are surveyed by Alfred V. Aho in
+Related algorithms are surveyed by Alfred V. Aho in
 section 6.3 of ``Algorithms for Finding Patterns in Strings'',
 @cite{Handbook of Theoretical Computer Science} (Jan Van Leeuwen,
 ed.), Vol.@: A, @cite{Algorithms and Complexity}, Elsevier/MIT Press,
diff --git a/src/analyze.c b/src/analyze.c
index 1886e7e..f062fd3 100644
--- a/src/analyze.c
+++ b/src/analyze.c
@@ -532,7 +532,6 @@ diff_2_files (struct comparison *cmp)
     {
       struct context ctxt;
       lin diags;
-      lin too_expensive;
 
       /* Allocate vectors for the results of comparison:
         a flag for each line of each file, saying whether that line
@@ -564,18 +563,11 @@ diff_2_files (struct comparison *cmp)
 
       ctxt.heuristic = speed_large_files;
 
-      /* Set TOO_EXPENSIVE to be approximate square root of input size,
-        bounded below by 256.  */
-      too_expensive = 1;
-      for (;  diags != 0;  diags >>= 2)
-       too_expensive <<= 1;
-      ctxt.too_expensive = MAX (256, too_expensive);
-
       files[0] = cmp->file[0];
       files[1] = cmp->file[1];
 
       compareseq (0, cmp->file[0].nondiscarded_lines,
-                 0, cmp->file[1].nondiscarded_lines, minimal, &ctxt);
+                 0, cmp->file[1].nondiscarded_lines, &ctxt);
 
       free (ctxt.fdiag - (cmp->file[1].nondiscarded_lines + 1));
 
-- 
1.8.5.3

Re: [bug-diffutils] bug#16848: diff reports a diff on identical lines

Reply via email to