On 7/14/2010 7:32 PM, Tim Peters wrote:
[Nick Coghlan]
You're right, I was misremembering how SequenceMatcher works.
Terry's summary of the situation seems correct to me - adding a new
flag to the constructor signature would mean we're taking a silent
failure ("the heuristic makes my code give
[Nick Coghlan]
> You're right, I was misremembering how SequenceMatcher works.
>
> Terry's summary of the situation seems correct to me - adding a new
> flag to the constructor signature would mean we're taking a silent
> failure ("the heuristic makes my code give the wrong answer on 2.7.0")
> and
On Thu, Jul 15, 2010 at 6:40 AM, Tim Peters wrote:
> The call in question here is the constructor (__init__), so there's no
> real difference between "on the object" and "per call" in this case.
You're right, I was misremembering how SequenceMatcher works.
Terry's summary of the situation seems
[Steven D'Aprano]
>> 4. I normally dislike global flags, but this is one time it might be
>> less-worse than the alternatives.
>>
>> Modify SequenceMatcher to test for the value of a global flag,
>> defaulting to False if it doesn't exist.
>> ...
>> The flag will only exist if the caller explicitly
On 7/14/2010 9:45 AM, Nick Coghlan wrote:
Code that sets the flag would behave the same on both 2.7.1+ and on
2.7.0, it would just fail to turn the heuristic off in 2.7.0.
Antoine Pitrou pointed out on the tracker
http://bugs.python.org/issue2986
that such code would *not* 'behave the same'. I
On Wed, Jul 14, 2010 at 10:38 PM, Steven D'Aprano wrote:
> 4. I normally dislike global flags, but this is one time it might be
> less-worse than the alternatives.
>
> Modify SequenceMatcher to test for the value of a global flag,
> defaulting to False if it doesn't exist.
>
> try:
> disable =
On Wed, 14 Jul 2010 11:45:25 am Terry Reedy wrote:
> Summary: adding an autojunk heuristic to difflib without also adding
> a way to turn it off was a bug because it disabled running code.
>
> 2.6 and 3.1 each have, most likely, one final version each. Don't fix
> for these but add something to the
Summary: adding an autojunk heuristic to difflib without also adding a
way to turn it off was a bug because it disabled running code.
2.6 and 3.1 each have, most likely, one final version each. Don't fix
for these but add something to the docs explaining the problem and
future fix.
2.7 will
[Tim]
>> ...
>> BTW, it's not clear whether ratio() computes a _useful_ value in the
>> presence of junk, however that may be defined.
[Terry Reedy]
> I agree, which is one reason why one should be to disable auto-junking.
Yup.
> There are a number of statistical methods for analyzing similarity
On 7/11/2010 11:02 PM, Tim Peters wrote:
The heuristic lowered the reported match ratio from .96 to .88, which
would be bad when one wanted the unaltered value.
BTW, it's not clear whether ratio() computes a _useful_ value in the
presence of junk, however that may be defined.
I agree, which
[Terry Reedy]
> I had considered the possibility of option A for 2.7 and A & C for 3.2. But
> see below.
>
> Since posting, I did an experiment with a 700 char paragraph of text (the
> summary from the post) compared to an 'edited' version. I did the
> comparision with and without the current heuri
[Antoine Pitrou]
> I don't think 2.7 should get any change at all here. Only 3.2 should be
> modified. As Tim said, difflib works ok for its intended use (regular
> text diffs).
That was the use case that drove the implementation, but it's going
too far to say that was the only "intended" case. I
On Wed, 07 Jul 2010 21:04:17 -0400
Terry Reedy wrote:
>
> In other words, I see three options for 2.7.1+:
[...]
I don't think 2.7 should get any change at all here. Only 3.2 should be
modified. As Tim said, difflib works ok for its intended use (regular
text diffs). Making it work for other uses
I had considered the possibility of option A for 2.7 and A & C for 3.2.
But see below.
Since posting, I did an experiment with a 700 char paragraph of text
(the summary from the post) compared to an 'edited' version. I did the
comparision with and without the current heuristic. I did not not
On 7/7/2010 4:11 PM, Tres Seaver wrote:
Antoine Pitrou wrote:
On Wed, 7 Jul 2010 19:44:31 +0200
Eli Bendersky wrote:
For what it's worth, my benchmarking showed that modifying the
heuristic to only kick in when there are more than 100 kinds of
elements (Terry's option A) didn't affect the run
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Antoine Pitrou wrote:
> On Wed, 7 Jul 2010 19:44:31 +0200
> Eli Bendersky wrote:
>> For what it's worth, my benchmarking showed that modifying the
>> heuristic to only kick in when there are more than 100 kinds of
>> elements (Terry's option A) didn't
On Wed, 7 Jul 2010 19:44:31 +0200
Eli Bendersky wrote:
>
> For what it's worth, my benchmarking showed that modifying the
> heuristic to only kick in when there are more than 100 kinds of
> elements (Terry's option A) didn't affect the runtime of matching
> whatsoever, even when the heuristic *do
> Rather than reverting to Tim's
> undocumented vision, perhaps we should better articulate it by
> separating the general purpose matcher from an optimised text matcher.
>
For what it's worth, my benchmarking showed that modifying the
heuristic to only kick in when there are more than 100 kinds o
[Nick Coghlan]
> ...
> Hmm, I've been using difflib.SequenceMatcher for years in a serial bit
> error rate tester (with typical message sizes ranging from tens of
> bytes to tens of thousands of bytes) that occasionally gives
> unexpected results. I'd been blaming hardware glitches (and, to be
> fa
On Wed, Jul 7, 2010 at 9:18 AM, Terry Reedy wrote:
> In the commit message for revision 26661, which added the heuristic, Tim
> Peters wrote "While I like what I've seen of the effects so far, I still
> consider this experimental. Please give it a try!" Several people who have
> tried it discover
On Tue, 06 Jul 2010 19:18:09 -0400
Terry Reedy wrote:
>
> Version A: Modify the heuristic to only eliminate common items when
> there are more than, say, 100 items (when len(b2j)> 100 where b2j is
> first calculated without popularity deletions).
[...]
>
> Version B: add a parameter to .__init
[snip]
> Yes, that was the intent. I was corresponding with a user at the time
> who had odd notions (well, by my standards) of how to format C code,
> which left him with many hundreds of lines containing only an open
> brace, or a close brace, or just a semicolon (etc). difflib spun its
> wheel
[Terry Reedy]
> [Also posted to http://bugs.python.org/issue2986
> Developed with input from Eli Bendersky, who will write patchfile(s) for
> whichever change option is chosen.]
Thanks for paying attention to this, Terry (and Ed)! I somehow
managed to miss the whole discussion over the intervenin
On Tue, Jul 6, 2010 at 7:18 PM, Terry Reedy wrote:
> [Also posted to http://bugs.python.org/issue2986
> A much faster way to find the first mismatch would be
> i = 0
> while first[i] == second[i]:
> i+=1
> The match ratio, based on the initial matching prefix only, is spuriously
> low.
>
[Also posted to http://bugs.python.org/issue2986
Developed with input from Eli Bendersky, who will write patchfile(s) for
whichever change option is chosen.]
Summary: difflib.SeqeunceMatcher was developed, documented, and
originally operated as "a flexible class for comparing pairs of
sequenc
25 matches
Mail list logo