[issue2650] re.escape should not escape underscore

2008-04-17 Thread Russ Cox

New submission from Russ Cox <[EMAIL PROTECTED]>:

import re
print re.escape("_")

Prints \_ but should be _.

This behavior differs from Perl and other systems: _ is an identifier
character and as such does not need to be escaped.

--
messages: 65585
nosy: rsc
severity: normal
status: open
title: re.escape should not escape underscore
type: behavior
versions: Python 2.5

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2650>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2650] re.escape should not escape underscore

2008-04-17 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
components: +Regular Expressions

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2650>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2650] re.escape should not escape underscore

2008-04-17 Thread Russ Cox

Russ Cox <[EMAIL PROTECTED]> added the comment:

> It seems that escape is pretty dumb. The documentations says that
> re.escape escapes all non-alphanumeric characters, and it does that
> faithfully. It would seem more useful to have a list of meta-characters
> and just escape those. This is more true in Py3k when str can have
> thousands of possible characters that could be considered alphanumeric.

The usual convention is to escape everything that is
ASCII and not A-Za-z0-9_, in case other punctuation
becomes special in the future.  But I agree -- escaping
just the actual special characters makes the most sense.

Russ

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2650>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2650] re.escape should not escape underscore

2008-04-23 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
keywords: +patch
Added file: http://bugs.python.org/file10080/re.patch

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2650>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2650] re.escape should not escape underscore

2008-04-24 Thread Russ Cox

Russ Cox <[EMAIL PROTECTED]> added the comment:

> The loop in escape should really use enumerate 
> instead of "for i in range(len(pattern))".

It needs i to edit s[i].

> Instead of using a loop, can't the test just
> use "self.assertEqual(re.esacpe(same), same)?" 

Done.

> Also, please add tests for what re.escape should escape.

That's handled in the existing test over all bytes 0-255.

Added file: http://bugs.python.org/file10084/re.patch

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2650>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-04-24 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
nosy: +rsc

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2537] re.compile(r'((x|y+)*)*') should fail

2008-04-24 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
nosy: +rsc

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2537>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1160] Medium size regexp crashes python

2008-04-24 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
nosy: +rsc

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1160>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1662581] the re module can perform poorly: O(2**n) versus O(n**2)

2008-04-24 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
nosy: +rsc

_
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1662581>
_
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue433030] SRE: Atomic Grouping (?>...) is not supported

2008-04-24 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
nosy: +rsc


Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433030>

___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2008-04-24 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
nosy: +rsc

_
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1693050>
_
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1647489] zero-length match confuses re.finditer()

2008-04-24 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
nosy: +rsc

_
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1647489>
_
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1297193] Search is to long with regex like ^(.+|dontmatch)*$

2008-04-24 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
nosy: +rsc

_
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1297193>
_
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1721518] Small case which hangs

2008-04-24 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
nosy: +rsc

_
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1721518>
_
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue433024] SRE: (?flag) isn't properly scoped

2008-04-24 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
nosy: +rsc


Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433024>

___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2650] re.escape should not escape underscore

2008-05-08 Thread Russ Cox

Russ Cox <[EMAIL PROTECTED]> added the comment:

> Lorenz's patch uses a set, not a list for special characters.  Set 
> lookup is as fast as dict lookup, but a set takes less memory because it 
> does not have to store dummy values.  More importantly, use of frozenset 
> instead of dict makes the code clearer.  On the other hand, I would 
> simply use a string.  For a dozen entries, hash lookup does not buy you 
> much.
> 
> Another nit: why use "\\%c" % (c) instead of obvious "\\" + c?
> 
> Finally, you can eliminate use of index and a temporary list altogether 
> by using a generator expression:
> 
> ''.join(("\\" + c if c in _special else '\\000' if c == "\000" else c),
> for c in pattern)

The title of this issue (#2650) is "re.escape should not escape underscore",
not "re.escape is too slow and too easy to read".

If you have an actual, measured performance problem with re.escape,
please open a new issue with numbers to back it up. 
That's not what this one is about.

Thanks.
Russ

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2650>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2650] re.escape should not escape underscore

2008-05-08 Thread Russ Cox

Russ Cox <[EMAIL PROTECTED]> added the comment:

> You don't need to get so defensive.  I did not raise a performance
> problem, I was simply responding to Rafael's "AFAIK the lookup on
> dictionaries is faster than on lists" comment.  I did not say that you
> *should* rewrite your patch the way I suggested, only that you *can*
> use new language features to simplify the code.

I was responding to the entire thread more than your mail.
I'm frustrated because the only substantial discussion has
focused on details of how to implement set lookup the fastest
in a function that likely doesn't matter for speed.

> In any case, I am -0 on the patch.  The current documentation says:

Now these are the kinds of comments I was hoping for.
Thank you.

>Return *string* with all non-alphanumerics backslashed; this is useful if 
> you
>want to match an arbitrary literal string that may have regular expression
>metacharacters in it.

Sure; the documentation is wrong too.

> I did not see a compelling use case presented for the change.  

The usual convention in regular expressions is that escaping
a word character means you intend a special meaning, and
underscore is a word character.  Even though the current re
module does accept \_ as synonymous with _ (just as it accepts
\q as synonymous with q), it is no more correct to escape _ than
to escape q.

I think it is fine to escape all non-word characters, but someone
else suggested that it would be easier when moving to larger
character sets to escape just the special ones.  I'm happy with
either version.

My argument is only that Python should behave the same in 
this respect as other systems that use substantially the same
regular expressions.

> since there is no mechanism to assure that _special indeed
> contains all re metacharacters, it may present a maintenance problem
> if additional metacharacters are added in the future.

The test suite will catch these easily, since it checks that 
re.escape(c) matches c for all characters c.  But again, I'm happy
with escaping all ASCII non-word characters.

Russ

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2650>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2650] re.escape should not escape underscore

2008-05-08 Thread Russ Cox

Russ Cox <[EMAIL PROTECTED]> added the comment:

On Thu, May 8, 2008 at 12:12 PM, Alexander Belopolsky
<[EMAIL PROTECTED]> wrote:
>
> Alexander Belopolsky <[EMAIL PROTECTED]> added the comment:
>
> On Thu, May 8, 2008 at 11:45 AM, Russ Cox <[EMAIL PROTECTED]> wrote:
> ..
>>  My argument is only that Python should behave the same in
>>  this respect as other systems that use substantially the same
>>  regular expressions.
>>
>
> This is not enough to justify the change in my view.  After all, "A
> Foolish Consistency is the Hobgoblin of Little Minds"
> <http://www.python.org/dev/peps/pep-0008/>.
>
> I don't know if there is much code out there that relies on the
> current behavior, but technically speaking, this is an incompatible
> change.  A backward compatible way to add your desired functionality
> would be to add the "escape_special" function, but not every useful
> 3-line function belongs to stdlib.

In my mind, arguing that re.escape can't possibly be changed
due to imagined backward incompatibilities is the foolish consistency.

> This said, I would prefer simply adding '_' to _alphanum over _special
> approach, but still -1 on the whole idea.

I don't use Python enough to care one way or the other.
I noticed a bug, I reported it.  Y'all are welcome to do
as you see fit.

Russ

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2650>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2009-02-05 Thread Russ Cox

Russ Cox  added the comment:

> Named Unicode characters eg \N{LATIN CAPITAL LETTER A}

These descriptions are not as stable as, say, Unicode code
point values or language names.  Are you sure it is a good idea
to depend on them not being adjusted in the future?
It's certainly nice and self-documenting, but it doesn't seem
better from a future-proofing point of view than \u0041.

Do other languages implement this?

Russ

___
Python tracker 
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com