Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-14 Thread Terry Reedy
On 7/14/2010 7:32 PM, Tim Peters wrote: [Nick Coghlan] You're right, I was misremembering how SequenceMatcher works. Terry's summary of the situation seems correct to me - adding a new flag to the constructor signature would mean we're taking a silent failure ("the heuristic makes my code give

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-14 Thread Tim Peters
[Nick Coghlan] > You're right, I was misremembering how SequenceMatcher works. > > Terry's summary of the situation seems correct to me - adding a new > flag to the constructor signature would mean we're taking a silent > failure ("the heuristic makes my code give the wrong answer on 2.7.0") > and

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-14 Thread Nick Coghlan
On Thu, Jul 15, 2010 at 6:40 AM, Tim Peters wrote: > The call in question here is the constructor (__init__), so there's no > real difference between "on the object" and "per call" in this case. You're right, I was misremembering how SequenceMatcher works. Terry's summary of the situation seems

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-14 Thread Tim Peters
[Steven D'Aprano] >> 4. I normally dislike global flags, but this is one time it might be >> less-worse than the alternatives. >> >> Modify SequenceMatcher to test for the value of a global flag, >> defaulting to False if it doesn't exist. >> ... >> The flag will only exist if the caller explicitly

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-14 Thread Terry Reedy
On 7/14/2010 9:45 AM, Nick Coghlan wrote: Code that sets the flag would behave the same on both 2.7.1+ and on 2.7.0, it would just fail to turn the heuristic off in 2.7.0. Antoine Pitrou pointed out on the tracker http://bugs.python.org/issue2986 that such code would *not* 'behave the same'. I

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-14 Thread Nick Coghlan
On Wed, Jul 14, 2010 at 10:38 PM, Steven D'Aprano wrote: > 4. I normally dislike global flags, but this is one time it might be > less-worse than the alternatives. > > Modify SequenceMatcher to test for the value of a global flag, > defaulting to False if it doesn't exist. > > try: >    disable =

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-14 Thread Steven D'Aprano
On Wed, 14 Jul 2010 11:45:25 am Terry Reedy wrote: > Summary: adding an autojunk heuristic to difflib without also adding > a way to turn it off was a bug because it disabled running code. > > 2.6 and 3.1 each have, most likely, one final version each. Don't fix > for these but add something to the

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-13 Thread Terry Reedy
Summary: adding an autojunk heuristic to difflib without also adding a way to turn it off was a bug because it disabled running code. 2.6 and 3.1 each have, most likely, one final version each. Don't fix for these but add something to the docs explaining the problem and future fix. 2.7 will

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-13 Thread Tim Peters
[Tim] >> ... >> BTW, it's not clear whether ratio() computes a _useful_ value in the >> presence of junk, however that may be defined. [Terry Reedy] > I agree, which is one reason why one should be to disable auto-junking. Yup. > There are a number of statistical methods for analyzing similarity

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-12 Thread Terry Reedy
On 7/11/2010 11:02 PM, Tim Peters wrote: The heuristic lowered the reported match ratio from .96 to .88, which would be bad when one wanted the unaltered value. BTW, it's not clear whether ratio() computes a _useful_ value in the presence of junk, however that may be defined. I agree, which

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-11 Thread Tim Peters
[Terry Reedy] > I had considered the possibility of option A for 2.7 and A & C for 3.2. But > see below. > > Since posting, I did an experiment with a 700 char paragraph of text (the > summary from the post) compared to an 'edited' version. I did the > comparision with and without the current heuri

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-08 Thread Tim Peters
[Antoine Pitrou] > I don't think 2.7 should get any change at all here. Only 3.2 should be > modified. As Tim said, difflib works ok for its intended use (regular > text diffs). That was the use case that drove the implementation, but it's going too far to say that was the only "intended" case. I

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-08 Thread Antoine Pitrou
On Wed, 07 Jul 2010 21:04:17 -0400 Terry Reedy wrote: > > In other words, I see three options for 2.7.1+: [...] I don't think 2.7 should get any change at all here. Only 3.2 should be modified. As Tim said, difflib works ok for its intended use (regular text diffs). Making it work for other uses

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-07 Thread Terry Reedy
I had considered the possibility of option A for 2.7 and A & C for 3.2. But see below. Since posting, I did an experiment with a 700 char paragraph of text (the summary from the post) compared to an 'edited' version. I did the comparision with and without the current heuristic. I did not not

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-07 Thread Terry Reedy
On 7/7/2010 4:11 PM, Tres Seaver wrote: Antoine Pitrou wrote: On Wed, 7 Jul 2010 19:44:31 +0200 Eli Bendersky wrote: For what it's worth, my benchmarking showed that modifying the heuristic to only kick in when there are more than 100 kinds of elements (Terry's option A) didn't affect the run

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-07 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Antoine Pitrou wrote: > On Wed, 7 Jul 2010 19:44:31 +0200 > Eli Bendersky wrote: >> For what it's worth, my benchmarking showed that modifying the >> heuristic to only kick in when there are more than 100 kinds of >> elements (Terry's option A) didn't

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-07 Thread Antoine Pitrou
On Wed, 7 Jul 2010 19:44:31 +0200 Eli Bendersky wrote: > > For what it's worth, my benchmarking showed that modifying the > heuristic to only kick in when there are more than 100 kinds of > elements (Terry's option A) didn't affect the runtime of matching > whatsoever, even when the heuristic *do

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-07 Thread Eli Bendersky
> Rather than reverting to Tim's > undocumented vision, perhaps we should better articulate it by > separating the general purpose matcher from an optimised text matcher. > For what it's worth, my benchmarking showed that modifying the heuristic to only kick in when there are more than 100 kinds o

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-07 Thread Tim Peters
[Nick Coghlan] > ... > Hmm, I've been using difflib.SequenceMatcher for years in a serial bit > error rate tester (with typical message sizes ranging from tens of > bytes to tens of thousands of bytes) that occasionally gives > unexpected results. I'd been blaming hardware glitches (and, to be > fa

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-07 Thread Nick Coghlan
On Wed, Jul 7, 2010 at 9:18 AM, Terry Reedy wrote: > In the commit message for revision 26661, which added the heuristic, Tim > Peters wrote "While I like what I've seen of the effects so far, I still > consider this experimental.  Please give it a try!" Several people who have > tried it discover

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-07 Thread Antoine Pitrou
On Tue, 06 Jul 2010 19:18:09 -0400 Terry Reedy wrote: > > Version A: Modify the heuristic to only eliminate common items when > there are more than, say, 100 items (when len(b2j)> 100 where b2j is > first calculated without popularity deletions). [...] > > Version B: add a parameter to .__init

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-07 Thread Eli Bendersky
[snip] > Yes, that was the intent.  I was corresponding with a user at the time > who had odd notions (well, by my standards) of how to format C code, > which left him with many hundreds of lines containing only an open > brace, or a close brace, or just a semicolon (etc).  difflib spun its > wheel

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-06 Thread Tim Peters
[Terry Reedy] > [Also posted to http://bugs.python.org/issue2986 > Developed with input from Eli Bendersky, who will write patchfile(s) for > whichever change option is chosen.] Thanks for paying attention to this, Terry (and Ed)! I somehow managed to miss the whole discussion over the intervenin

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-06 Thread Kevin Jacobs
On Tue, Jul 6, 2010 at 7:18 PM, Terry Reedy wrote: > [Also posted to http://bugs.python.org/issue2986 > A much faster way to find the first mismatch would be > i = 0 > while first[i] == second[i]: > i+=1 > The match ratio, based on the initial matching prefix only, is spuriously > low. >

[Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

2010-07-06 Thread Terry Reedy
[Also posted to http://bugs.python.org/issue2986 Developed with input from Eli Bendersky, who will write patchfile(s) for whichever change option is chosen.] Summary: difflib.SeqeunceMatcher was developed, documented, and originally operated as "a flexible class for comparing pairs of sequenc