from:"Jeffrey C. Jacobs"

[issue2636] Adding a new regex module (compatible with re)

2011-09-01 Thread Jeffrey C. Jacobs

Jeffrey C. Jacobs  added the comment:

On 1 September 2011 16:12, Matthew Barnett  wrote:
>
> Matthew Barnett  added the comment:
>
> I think I need a show of hands.

For my part, I recommend literal flags, i.e. re.VERSION222,
re.VERSION300, etc.  Then you know exactly what you're getting and
although it may be confusing, we can then slowly deprecate
re.VERSION222 so that people can get used to the new syntax.

Returning to lurking on my own issue.  :)

--

___
Python tracker 
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Adding a new regex module (compatible with re)

2011-09-03 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs  added the comment:

Although V1, V2 is less wordy, technically the current behavior is version 
2.2.2, so logically this should be re.VERSION222 vs. re.VERSION3 vs. 
re.VERSIONn, with corresponding "(?V222)", "(?V3)" and future "(?Vn)".  But 
that said, I think 2.2.2 can be shorthanded to 2, so basically start counting 
from there.

--

___
Python tracker 
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-23 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs  added the comment:

+1 on VC

--

___
Python tracker 
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-24 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Added file: http://bugs.python.org/file10428/issue2636-05-only.diff

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-24 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

I am finally making progress again, after a month of changing my 
patches from my local svn repository to bazaar hosted on launchpad.net, 
as stated in my last update.  I also have more or less finished the 
probably easiest item, #5, so I have a full patch for that available 
now.  First, though, I want to update my "No matter what" patch, which 
is to say these are the changes I want to make if any changes are made 
to the Regexp code.

Added file: http://bugs.python.org/file10427/issue2636.diff

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-24 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Added file: http://bugs.python.org/file10429/issue2636-05.diff

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-24 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file10056/issue2636-05.patch

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
__
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-28 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Mark scribbled:
> One possible solution would be a grouptuples() function that returned
> a tuple of 3-tuples (index, name, captured_text) with the name being
> None for unnamed groups.

Hmm.  Well, that's not a bad idea at all IMHO and would, AFAICT probably 
be easier to do than (2) but I would still do (2) but will try to add 
that to one of the existing items or spawn another item for it since it 
is kind of a distinct feature.

My preference right now is to finish off the test cases for (7) because 
it is already coded, then finish the work on (1) as that was the 
original reason for modification then on to (2) then (3) as they are 
related and then I don't mind tackling (8) because I think that one 
shouldn't be too hard.  Interestingly, the existing engine code 
(sre_parse.py) has a place-holder, commented out, for character classes 
but it was never properly implemented.  And I will warn that with 
Unicode, I THINK all the character classes exist as unicode functions or 
can be implemented as multiple unicode functions, but I'm not 100% sure 
so if I run into that problem, some character classes may initially be 
left out while I work on another item.

Anyway, thanks for the input, Mark!

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-29 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Added file: http://bugs.python.org/file10467/issue2636.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-29 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file10427/issue2636.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-29 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Added file: http://bugs.python.org/file10468/issue2636-05.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-29 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file10429/issue2636-05.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-29 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Added file: http://bugs.python.org/file10469/issue2636-07.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-29 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Added file: http://bugs.python.org/file10470/issue2636-07-only.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-05-29 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file10053/issue2636-07.patch

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-06-17 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Well, it's time for another update on my progress...

Some good news first: Atomic Grouping is now completed, tested and 
documented, and as stated above, is classified as issue2636-01 and 
related patches.  Secondly, with caveats listed below, Named Match Group 
Attributes on a match object (item 2) is also more or less complete at 
issue2636-02 -- it only lacks documentation.

Now, I want to also update my list of items.  We left off at 11: Other 
Perl-specific modifications.  Since that time, I have spawned a number 
of other branches, the first of which (issue2636-12) I am happy to 
announce is also complete!

12) Implement the changes to the documentation of re as per Jim J. 
Jewett suggestion from 2008-04-24 14:09.  Again, this has been done.

13) Implement a grouptuples(...) method as per Mark Summerfield's 
suggest on 2008-05-28 09:38.  grouptuples would take the same filtering 
parameters as the other group* functions, and would return a list of 3-
tuples (unless only 1 group was requested).  It should default to all 
match groups (1..n, not group 0, the matching string).

14) As per PEP-3131 and the move to Python 3.0, python will begin to 
allow full UNICODE-compliant identifier names.  Correspondingly, it 
would be the responsibility of this item to allow UNICODE names for 
match groups.  This would allow retrieval of UNICODE names via the 
group* functions or when combined with Item 3, the getitem handler 
(m[u'...']) (03+14) and the attribute name itself (e.g. getattr(m, 
u'...')) when combined with item 2 (02+14).

15) Change the Pattern_Type, Match_Type and Scanner_Type (experimental) 
to become richer Python Types.  Specifically, add __doc__ strings to 
each of these types' methods and members.

16) Implement various FIXMEs.

16-1) Implement the FIXME such that if m is a MatchObject, del m.string 
will disassociate the original matched string from the match object; 
string would be the only member that would allow modification or 
deletion and you will not be able to modify the m.string value, only 
delete it.

-

Finally, I want to say a couple notes about Item 2:

Firstly, as noted in Item 14, I wish to add support for UNICODE match 
group names, and the current version of the C-code would not allow that; 
it would only make sense to add UNICODE support if 14 is implemented, so 
adding support for UNICODE match object attributes would depend on both 
items 2 and 14.  Thus, that would be implemented in issue2636-02+14.

Secondly, there is a FIXME which I discussed in Item 16; I gave that 
problem it's own item and branch.  Also, as stated in Item 15, I would 
like to add more robust help code to the Match object and bind __doc__ 
strings to the fixed attributes.  Although this would not directly 
effect the Item 2 implementation, it would probably involve moving some 
code around in its vicinity.

Finally, I would like suggestions on how to handle name collisions when 
match group names are provided as attributes.  For instance, an 
expression like '(?P.*)' would match more or less any string and 
assign it to the name "pos".  But "pos" is already an attribute of the 
Match object, and therefore pos cannot be exposed as a named match group  
attribute, since match.pos will return the usual meaning of pos for a 
match object, not the value of the capture group names "pos".

I have 3 proposals as to how to handle this:

a) Simply disallow the exposure of match group name attributes if the 
names collide with an existing member of the basic Match Object 
interface.

b) Expose the reserved names through a special prefix notation, and for 
forward compatibility, expose all names via this prefix notation.  In 
other words, if the prefix was 'k', match.kpos could be used to access 
pos; if it was '_', match._pos would be used.  If Item 3 is implemented, 
it may be sufficient to allow access via match['pos'] as the canonical 
way of handling match group names using reserved words.

c) Don't expose the names directly; only expose them through a prefixed 
name, e.g. match._pos or match.kpos.

Personally, I like a because if Item 3 is implemented, it makes a fairly 
useful shorthand for retrieving keyword names when a keyword is used for 
a name.  Also, we could put a deprecation warning in to inform users 
that eventually match groups names that are keywords in the Match Object 
will eventually be disallowed.  However, I don't support restricting the 
match group names any more than they already are (they must be a valid 
python identifier only) so again I would go with a) and nothing more and 
that's what's implemented in issue2636-02.patch.

-

Now, rather than posting umteen patch files I am posting one bz2-
compressed tar of ALL patch files for all threads, where each file is of 
the form:

i

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-06-17 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file10052/issue2636-09.patch

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-06-17 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file10467/issue2636.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-06-17 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file10428/issue2636-05-only.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-06-17 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file10468/issue2636-05.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-06-17 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file10469/issue2636-07.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-06-17 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file10470/issue2636-07-only.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue433030] SRE: Atomic Grouping (?>...) is not supported

2008-06-17 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

I have finished work on the Atomic Grouping / Possessive Qualifiers 
support and am including a patch to achieve this; however, 
http://bugs.python.org/issue2636 should be consulted for the complete list 
of changes in the works for the Regexp engine.

--
keywords: +patch
Added file: http://bugs.python.org/file10646/issue2636-01.patch

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433030>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue433030] SRE: Atomic Grouping (?>...) is not supported

2008-06-17 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


Removed file: http://bugs.python.org/file9897/PyLibDiffs.txt

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433030>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-06-17 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Sorry, as I stated in the last post, I generated the patches then realized 
that I was missing the documentation for Item 2, so I have updated the 
issue2636-02.patch file and am attaching that separately until the next 
release of the patch tarball.  issue2636-02-only.patch should be ignored 
and I will only regenerate it with the correct documentation in the next 
tarball release so I can move on to either Character Classes or Relative 
Back-references.  I wanna pause Item 3 for the moment because 2, 3, 13, 
14, 15 and 16 all seem closely related and I need a break to allow my mind 
to wrap around the big picture before I try and tackle each one.

Added file: http://bugs.python.org/file10647/issue2636-02.patch

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-06-19 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Thanks for weighing in Mark!  Actually, your point is valid and quite 
fair, though I would not assume that Item 3 would be included just 
because Item 2 isn't.  I will do my best to develop both, but I do not 
make the final decision as to what python includes.  That having been 
said, 3 seems very likely at this point so we may be okay, but let me 
give this one more try as I think I have a better solution to make Item 
2 more palatable.  Let's say we have 5 choices here:

> a) Simply disallow the exposure of match group name attributes if the 
> names collide with an existing member of the basic Match Object 
> interface.
>
> b) Expose the reserved names through a special prefix notation, and
> for forward compatibility, expose all names via this prefix notation. 
> In other words, if the prefix was 'k', match.kpos could be used to
> access pos; if it was '_', match._pos would be used.  If Item 3 is
> implemented, it may be sufficient to allow access via match['pos'] as
> the canonical way of handling match group names using reserved words.
>
> c) Don't expose the names directly; only expose them through a
> prefixed name, e.g. match._pos or match.kpos.

d) (As Mark suggested) we drop Item 2 completely.  I have not invested 
much work in this so that would not bother me, but IMHO I actually 
prefer Item 2 to 3 so I would really like to see it preserved in some 
form.

e) Add an option, re.MATCH_ATTRIBUTES, that is used as a Match Creation 
flag.  When the re.MATCH_ATTRIBUTES or re.A flag is included in the 
compile, or (?a) is included in the pattern, it will do 2 things.  
First, it will raise an exception if either a) there exists an unnamed 
capture group or b) the capture group name is a reserved keyword.  In 
addition to this, I would put in a hook to support a from __future__ so 
that any post 2.6 changes to the match object type can be smoothly 
integrated a version early to allow programmers to change when any 
future changes come.  Secondly, I would *conditionally* allow arbitrary 
capture group name via the __getattr__ handler IFF that flag was 
present; otherwise you could not access Capture Groups by name via 
match.foo.

I really like the idea of e) so I'm taking Item 2 out of the "ready for 
merge" category and going to put it in the queue for the modifications 
spelled out above.  I'm not too worried about our flags differing from 
Perl too much as we did base our first 4 on Perl (x, s, m, i), but 
subsequently added Unicode and Locale, which Perl does not have, and 
never implemented o (since our caching semantic already pretty much 
gives every expression that), e (which is specific to Perl syntax 
AFAICT) and g (which can be simulated via re.split).  So I propose we 
take A and implement it as I've specified and that is the current goal 
of Item 2.  Once this is done and working, we can decide whether it 
should be included in the python trunk.

How does that sound to you, Mark and anyone else who wishes to weigh in?

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3825] Major reworking of Python 2.5.2 re module

2008-09-15 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Well, I implemented this months ago, but have been busy with other
things so I haven't updated in a while.  I noticed that the current
version is missing my patches for Atomic Grouping / Possessive
Qualifiers and a number of other patches I added in #2636 , but I do
have working test cases and documentation updates for all the features
I've so far implemented as well as splitting my work into separate
sub-issues to make individual selection easier -- though with a number
of my modifications, I found that there are SO MANY co-dependencies
between, say, an engine modification (item 9) and adding Atomic Grouping
/ Possessive Qualifiers (item 1) and using shared Engine Constants (item
10) that I need a branch for Atomic, a branch for Atomic + Engine Mod 1,
Atomic + Engine Mod 2, Atomic + Shared Constants, Atomic + Engine Mod 1
+ Shared Constants AND Atomic + Engine Mod 2 + Shared Constants, and
those were just THREE item co-dependencies.  My code is all off of the
trunk line for 2.6 and is currently available via my Bazaar repository
under https://code.launchpad.net/~timehorse, where you can access any
source tree via the bazaar version control client.  The main reason I
got stumped in my development which might otherwise have implemented ALL
the issues I intended by now is that very situation I just described
where development of new features is NOT mutually independent.  You can
see by all my branches that the multiplicity of A or B or C is just
nightmarish, but what had to be done to keep issues independent.

Anyway, I'm looking forward to having a look at your suggestions and
think we may take best advantage with combining our work visa vi these
things; after all, there's no point re-inventing the wheel.

Thanks again for your contribution, Matthew!

--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3825>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3825] Major reworking of Python 2.5.2 re module

2008-09-15 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

I have uploaded my test cases for Atomic Grouping / Possessive
Qualifier, which is the common code we seem to have developed, as this
may be of use to you.  I also have documentation, but for now, would you
mind running these tests against your code to see what the test outputs
and also, how did you come up with the 2x result?  Was that running the
test suite?  Usually, the regexp module is benchmarked against its test
suite and there are timings built into that, so it may be useful if you
could run the unmodified Lib/test/test_re.py you got from the trunk
against the original code before modification and your modification, and
do so a few times to get a good average result on multi-tasking systems,
and post the results here so we can get a good statistical feel for how
your new engine improves efficiency.  Certainly, I support any Engine
that works faster, as I myself have tried to make it faster but ended up
with something 8% slower instead, alas.

Also, good thinking on fixing the Negative Look-behind variable-width
issue; I wish I'd thought of that, but I am curious about something: did
you remove the optimization for fixed-width look-behind?  The old code
only allowed fixed with because that test can be done quickly; I noticed
your code adds a lot of new REV opcodes to handle back-tracking and I
assume look-behind logic for variable-width look-behind.  It would be
handy if the compiler and engine would be able to differentiate between
fixed-width look-behind (optimized as was originally) and variable-width
(using your advanced code).

Thanks to AMK for some of these suggestions.  Your changes are quite
radical though so I am still trying to wade through them all and I still
don't have a full-picture of how you've changed things, but there are
some good ideas here, IMHO, especially if you do indeed get 2x speedup.

Added file: http://bugs.python.org/file11499/test_re.py

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3825>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-16 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Update 16 Sep 2008:

Based on the work for issue #3825, I would like to simply update the
item list as follows:

1) Atomic Grouping / Possessive Qualifiers (See also Issue #433030)
[Complete]

2) Match group names as attributes (e.g. match.foo) [Complete save
issues outlined above]

3) Match group indexing (e.g. match['foo'], match[3])

4) Perl-style back-references (e.g. compile(r'(a)\g{-1}'), and possibly
adding the r'\k' escape sequence for keywords.

5) Parenthesis-Aware Python Comment (e.g. r'(?P#...)') [Complete]

6) Expose support for Template expressions (expressions without repeat
operators), adding test cases and documentation for existing code.

7) Larger compiled Regexp cache (256 vs. 100) and reduced thrashing
risk. [Complete]

8) Character Classes (e.g. r'[:alphanum:]')

9) Proposed Engine redesigns and cleanups (core item only contains
cleanups and comments to the current design but does not modify the design).

9-1) Single-loop Engine redesign that runs 8% slower than current.
[Complete]

9-1-1) 3-loop Engine redesign that runs 10% slower than current. [Complete]

9-2) Matthew Bernett's Engine redesign as per issue #3825

10) Have all C-Python shared constants stored in 1 place
(sre_constants.py) and generated by that into C constants
(sre_constants.h). [Complete AFAICT]

11) Scan Perl 5.10.0 for other potential additions that could be
implemented for Python.

12) Documentation suggestions by Jim J. Jewett [Complete]

13) Add grouptuples method to the Match object (i.e. match.grouptuples()
returns (, , ) ) suitable for iteration.

14) UNICODE match group names, as per PEP-3131.

15) Add __doc__ strings and other Python niceties to the Pattern_Type,
Match_Type and Scanner_Type (experimental).

16) Implement any remaining TODOs and FIXMEs in the Regexp modules.

16-1) Allow for the disassociation of a source string from a Match_Type,
assuming this will still leave the object in a "reasonable" state.

17) Variable-length [Positive and Negative] Look-behind assertions, as
described and implemented in Issue #3825.

---

Now, we have a combination of Items 1, 9-2 and 17 available in issue
#3825, so for now, refer to that issue for the 01+09-02+17 combined
solution.  Eventually, I hope to merge the work between this and that issue.

I sadly admit I have made not progress on this since June because
managing 30 some lines of development, some of which having complex
diamond branching, e.g.:

01 is the child of Issue2636
09 is the child of Issue2636
10 is the child of Issue2636
09-01 is the child of 09
09-01-01 is the child of 09-01
01+09 is the child of 01 and 09
01+10 is the child of 01 and 10
09+10 is the child of 09 and 10
01+09-01 is the child of 01 and 09-01
01+09-01-01 is the child of 01 and 09-01-01
09-01+10 is the child of 09-01 and 10
09-01-01+10 is the child of 09-01-01 and 10

Which all seems rather simple until you wrap your head around:

01+09+10 is the child of 01, 09, 10, 01+09, 01+10 AND 09+10!

Keep in mind the reason for all this complex numbering is because many
issues cannot be implemented in a vacuum: If you want Atomic Grouping,
that's 1 implementation, if you want Shared Constants, that's a
different implementation. but if you want BOTH Atomic Grouping and
Shared Constants, that is a wholly other implementation because each
implementation affects the other.  Thus, I end up with a plethora of
branches and a nightmare when it comes to merging which is why I've been
so slow in making progress.  Bazaar seems to be very confused when it
comes to a merge in 6 parts between, for example 01, 09, 10, 01+09,
01+10 and 09+10, as above.  It gets confused when it sees the same
changes applied in a previous merge applied again, instead of simply
realizing that the change in one since last merge is EXACTLY the same
change in the other since last merge so effectively there is nothing to
do; instead, Bazaar gets confused and starts treating code that did NOT
change since last merge as if it was changed and thus tries to role back
the 01+09+10-specific changes rather than doing nothing and generates a
conflict.  Oh, that I could only have a version control system that
understood the kind of complex branching that I require!

Anyway, that's the state of things; this is me, signing out!

--
title: Regexp 2.6 (modifications to current re 2.2.2) -> Regexp 2.7 
(modifications to current re 2.2.2)
versions: +Python 2.7 -Python 2.6

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3262] re.split doesn't split with zero-width regex

2008-09-21 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3262>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3654] Duplicated test name in regex test script

2008-09-21 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3654>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue516762] have a way to search backwards for re

2008-09-21 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue516762>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3262] re.split doesn't split with zero-width regex

2008-09-22 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

I think Mike Coleman proposal of enabling this behaviour via flag is
probably best and IMHO we should consider it under these circumstances.
 Intuitively, I think you're interpretation of what re.split should do
under zero-width conditions is logical, and I almost think this should
be a 2-minor number transition à la from __future__ import
zeroWidthRegexpSplit if we are to consider it as the long-term 'right
thing to do'.  3000 (3.0) seems a good place to also consider it for
true overhaul / reexamination, especially as we are writing 'upgrade'
scripts for many of the other Python features.  However, I would say
this, Guido has spoken and it may be too late for the pebbles to vote.

I would like to add this patch as a new item to the general Regexp
Enhancements thread of issue 2636 though, as I think it is an idea worth
considering when overhauling Regexp.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3262>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue433024] SRE: (?flag) isn't properly scoped

2008-09-24 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433024>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue433027] SRE: (?-flag) is not supported.

2008-09-24 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433027>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue433028] SRE: (?flag:...) is not supported

2008-09-24 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433028>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3825] Major reworking of Python 2.5.2 re module

2008-09-24 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Matthew,

I am really happy that you are making such progress on your engine, but
can I PLEASE ask you to slow down for a moment?  We have a lot of issues
already listed in issue 2636 that is a catch-all for any Python 2.7
Regexp improvements, including your new engine, and I have been working
frantically to try and document all the changes YOU are making here into
the general Regexp 2.7 modification thread and setting up development
trees in my Bazaar VCS repository for your work.

There is also a recommended process for patching which makes it easier
for the moderators to accept your patches which is to provide
dis-entangled functionality and letting each improvement stand on its
own two feet.  In other words, let your engine stand ONLY on it's 2x
speed improvements.  We already have an implementation of Atomic
Grouping / Possessive Qualifiers in issue 2636 but you have a version of
your engine with both.  We have no such 'feature-only' implementation
for Variable-Length Look-Behind, for a Reverse flag, for Positionally
Dependent modifier flags or modifier negation flags, as well as the
zero-width Regular Expression split feature, though you and I completely
agree these would all be great things to have!  The more features you
add to your engine as an all-or-nothing proposition, the less likely the
moderators are going to be to adapt it because it's harder for them to
examine the merits of each individual piece.  That is why issue 2636 is
broken up into items (currently 1 - 18, with your proposals bringing
that up toward 22) and where alternate, combined features are provided
if implementing 1 features would affect the implementation of another.

Please understand that I personally have no problem with you redesigning
large swaths of the Python Regular Expression engine.  I would
personally, like to see one of the design goals of any new engine not
only be speed but better source comments because my main beef with the
current engine is that it took me a month to understand and part of my
redesign in issue 2636 9-1 was to add copious comments to the engine so
that future developers would understand what was going on and be able to
pick up from my work.  I am not proposing we use my 9-1 engine because
it is 8% slower than the current engine and I don't intend to propose
anything slower.  But it would be nice if you could add lots of comments
to your engine so that others could help develop features against it. 
None the less, I will fully support your engine if it does indeed
perform substantially and measurably faster and am happy to see all the
Regexp issues you are finding are finally being implemented, all be it
entangled with your engine.  But let's return to the fundamentals of
what you propose IN THIS THREAD, which simply to propose a new Regexp
Engine which is 2x faster than the existing engine (Which I have
allocated item 9-2 in the issue 2636 thread).  I am not trying to put
more work on your hands -- in fact, what I am trying to do is get us to
co-operate on a better python Regexp Engine so that I can help you to
achieve your goals.  Please read issue 2636 and join the discussion
there; feel free to add any new items you feel are missing from my
existing list.  And remember, each new feature needs tests and
documentation changes.  I have been doing each for any feature I
undertake and would be happy to share those skills with you.  

Let's work together to see your engine be the new model, okay?

Thanks.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3825>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3511] Incorrect charset range handling with ignore case flag?

2008-09-24 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3511>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3511] Incorrect charset range handling with ignore case flag?

2008-09-24 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

I think this is even more complicated when you consider that
localization my be an issue.  Consider "Á": is this grammatically before
 "A" or after "a"?  From a character set point of view, it is typically
after "a" but when Locale is taken into account, all that is done is
there is a change to relative ordering, so Á appears somewhere before A
and B.  But when this is done, does that mean that [9-Á] is going to
cover ALL uppercase and ALL lowercase and ALL characters with ord from
91 to 96 and 123 to 127 and all kinds of other UNICODE symbols?  And how
will this effect case-insensitivity.

In a sense, I think it may only be safe to say that character class
ranges are ONLY appropriate over Alphabetic character ranges or numeric
character ranges, since the order of the ASCII symbols between 0 and 47,
56 and 64, 91 adn 96 and 123 and 127, though well-defined, are none the
less implementation dependent.  When we bring UNICODE into this, things
get even more befuddled with some Latin characters in Latin-1, some in
Latin-2, Cyrillic, Hebrew, Arabic, Chinese, Japanese and Korean
character sets just to name a few of the most common!  And how does a
total ordering of characters apply to them?

In the end, I think it's just dangerous to define character group ranges
that span the gap BETWEEN numbers and alphabetics.  Instead, I think a
better solution is simply to implement Emacs / Perl style named
character classes as in issue 2636 sub-item 8.

I do agree this is a problem, but as I see it, the solution may not be
that simple, especially in a UNICODE world.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3511>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-24 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Thanks for weighing in Matthew!

Yeah, I do get some flack for item 2 because originally item 3 wasn't
supposed to cover named groups but on investigation it made sense that
it should.  I still prefer 2 over-all but the nice thing about them
being separate items is that we can accept 2 or 3 or both or neither,
and for the most part development for the first phase of 2 is complete
though there is still IMHO the issue of UNICODE name groups (visa-vi
item 14) and the name collision problem which I propose fixing with an
Attribute / re.A flag.  So, I think it may end up that we could support
both 3 by default and 2 via a flag or maybe 3 and 2 both but with 2 as
is, with name collisions hidden (i.e. if you have r'(?P...)' as
your capture group, typing m.string will still give you the original
comparison string, as per the current python documentation) but have
collision-checking via the Attribute flag so that with
r'(?A)(?P...)' would not compile because string is a reserved word.

Your interpretation of 4 matches mine, though, and I would definitely
suggest using Perl's \g<-n> notation for relative back-references, but
further, I was thinking, if not part of 4, part of the catch-all item 11
to add support for Perl's (?...) as a synonym for Python's
(?P...) and Perl's \k for Python's (?P=name) notation.  The
evolution of Perl's name group is actually interesting.  Years ago,
Guido had a conversation with Larry Wall about using the (?P...) capture
sequence for python-specific Regular Expression blocks.  So Python went
ahead and implemented named capture groups.  Years later, the Perl folks
thought named capture groups were a neat idea and adapted them in the
(?<...>...) form because Python had restricted the (?P...) notation to
themselves so they couldn't use our even if they wanted to.  Now,
though, with Perl adapting (?<...>...), I think it inevitable that Java
and even C++ may see this as the defacto standard.  So I 100% agree, we
should consider supporting (?...) in the parser.

Oh, and as I suggested in Issue 3825, I have these new item proposals:

Item 18: Add a re.REVERSE, re.R (?r) flag for reversing the direction of
the String Evaluation against a given Regular Expression pattern. See
issue 516762, as implemented in Issue 3825.

Item 19: Make various in-line flags positionally dependant, for example
(?i) makes the pattern before this case-sensitive but after it
case-insensitive. See Issue 433024, as implemented in Issue 3825.

Item 20: All the negation of in-line flags to cancel their effect in
conditionally flagged expressions for example (?-i). See Issue 433027,
as implemented in Issue 3825.

Item 21: Allow for scoped flagged expressions, i.e. (?i:...), where the
flag(s) is applied to the expression within the parenthesis. See Issue
433028, as implemented in Issue 3825.

Item 22: Zero-width regular expression split: when splitting via a
regular expression of Zero-length, this should return an expression
equivalent to splitting at each character boundary, with a null string
at the beginning and end representing the space before the first and
after the last character. See issue 3262.

Item 23: Character class ranges over case-insensitive matches, i.e. does
"(?i)[9-A]" contain '_' , whose ord is greater than the ord of 'A' and
less than the ord of 'a'. See issue 5311.

And I shall create a bazaar repository for your current development line
with the unfortunately unwieldy name of
lp:~timehorse/python/issue2636-01+09-02+17+18+19+20+21 as that would,
AFAICT, cover all the items you've fixed in your latest patch.

Anyway, great work Matthew and I look forward to working with you on
Regexp 2.7 as you do great work!

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1647489] zero-length match confuses re.finditer()

2008-09-24 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1647489>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1647489] zero-length match confuses re.finditer()

2008-09-24 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Hmmm.  This strikes me as a bug, beyond the realm of Issue 3262.  The
two items may be related, but the dropping of the 'a' seems like
unexpected behaviour that I doubt any current code is expecting to
occur.  Clearly, what is going on is that the Engine starts scanning at
the 'a', finds the Zero-Width match and, having found a match,
increments its pointer within the input string, thus skipping the 'a'
when it matches 'bc'.

If it is indeed a bug, I think this should be considered for inclusion
in Python 2.6 rather than being part of the new Engine Design in Issue
3626.  I think the solution would simply be to not increment the ptr
(which points to the input string) when findall / finditer encounters a
Zero-Width match.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1647489>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1647489] zero-length match confuses re.finditer()

2008-09-24 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Never mind inclusion in 2.6 as no-one has repeated this bug in re-world
examples yet so it's going to have to wait for the Regexp 2.7 engine in
issue 2636.

--
versions: +Python 2.7 -Python 2.5

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1647489>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1647489] zero-length match confuses re.finditer()

2008-09-24 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Ah, I see the problem, if ptr is not incremented, then it will keep
matching the first expression, (^z*), so it would have to both 'skip'
the 'a' and NOT skip the 'a'.  Hmm.  You're right, Matthew, this is
pretty complicated.  Now, for your expression, Matthew,
r'(z*)|(^q*)|(\w+)', Perl gives:

"",undef,undef
undef,undef,"abc"
"",undef,undef

Meaning it doesn't even bother matching the ^q* since the ^z* matches
first.  This seems the logical behaviour and fits with the idea that a
Zero-Width match would both only match once and NOT consume any
characters.  An internal flag would just have to be created to tell the
2 find functions whether the current value of ptr would allow for a "No
Zero-Width Match" option on second go-around.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1647489>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-24 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Good catch on issue 1647489 Matthew; it looks like this is where that
bug fix will end up going.  But, I am unsure if the solution for this
issue is going to be the same as for 3262.  I think the solution here is
to add an internal flag that will keep track of whether the current
character had previously participated in a Zero-Width match and thus not
allow any subsequent zero-width matches associated beyond the first, and
at the same time not consuming any characters in a Zero-width match.

Thus, I have allocated this fix as Item 24, but it may be later merged
with 22 if the solutions turn out to be more or less the same, likely
via a 22+24 thread.  The main difference, though, as I see it, is that
the change in 24 may be considered a bug where the general consensus of
22 is that it is more of a feature request and given Guido's acceptance
of a flag-based approach, I suggest we allocate re.ZEROWIDTH, re.Z and
(?z) flags to turn on the behaviour you and I expect, but still think
that be best as a 2.7 / 3.1 solution.  I would also like to add a from
__futurue__ import ZeroWidthRegularExpressions or some such to make this
the default behaviour so that by version 3.2 it may indeed be considered
the default.

Anyway, I've allocated all the new items in the launchpad repository so
feel free to go to http://www.bazaar-vcs.org/ and install Bazaar for
windows so you can download any of the individual item development
threads and try them out for yourself.  Also, please consider setting up
a free launchpad account of your very own so that I can perhaps create a
group that would allow us to better share development.

Thanks again Matthew for all your greatly appreciated contributions!

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-24 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

I've moved all the development branches to the ~pythonregexp2.7 team so 
that we can work collaboratively.  You just need to install Bazaar, join 
www.launchpad.net, upload your public SSH key and then request to be added 
to the pythonregexp2.7 team.  At that point, you can check out any code 
via:

bzr co lp:~pythonregexp2.7/python/issue2636-*

This should make co-operative development easier.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-25 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

I've enumerated the current list of Item Numbers at the official
Launchpad page for this issue:

https://launchpad.net/~pythonregexp2.7

There you will find links to each development branch associated with
each item, where a broader description of each issue may be found.

I will no longer enumerate the entire list here as it has grown too long
to keep repeating; please consult that web page for the most up-to-date
list of items we will try to tackle in the Python Regexp 2.7 update.

Also, anyone wanting to join the development team who already has a
Launchpad account can just go to the Python Regexp 2.7 web site above
and request to join.  You will need Bazaar to check out, pull or branch
code from the repository, which is available at www.bazaar-vcs.org.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1160] Medium size regexp crashes python

2008-09-25 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1160>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1160] Medium size regexp crashes python

2008-09-25 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

It seems that changing the size type of the Regular Expression Byte-code
is a nice quick-fix, even though it doubles the size of a pattern.  It
may have the added benefit that most machine architectures available
today are at least partially, if not fully, 32-bit oriented so that
retrieving op codes may in fact be faster if we make this change.  OTOH,
it implies something interesting IMHO with the repeat count limits we
currently have.  Repeat counts can be explicitly set up to 65534 times
because 65535, being the largest number you can express in a 16-bit
unsigned integer, is currently reserved to mean Infinite.  It seems to
me this is a great opportunity to set that limit to (unsigned long)-1,
since that repeat count is incredibly large.

OTOH, if size is an issue, we could change the way sizes are expressed
in the Regexp Op Codes (typically in skip counts) to be 15-bit, with the
Most Significant Bit being reserved for 'extended' expressions.  In this
way, a value of 0x could be expressed as:

0x 0x 0x0003

Of course, parsing number in this form is a pain, to say the least, and
unlike in Python, the C-library would not play nicely if someone tried
to express a number that could not fit into what the architecture
defined an int to be.  Plus, there is the problem of how you express
Infinite with this scheme.  The advantage though would be we don't have
to change the op-code size and these 'extended' counts would be very
rare indeed.

Over all, I'm more of an Occam's Razor fan in that the simplest solution
is probably the best: just change the op-code size to unsigned long
(which, on SOME architectures would actually make it 64-bits!) and
define the 'Infinite' constant as (unsigned long)-1.  Mind you, I prefer
defining the constant in Python, not C, and it would be hard for Python
to determine that particular value being that Python is meant to be 'the
same' regardless of the underlying architecture, but that's another issue.

Anyway, as 2.6 is in Beta, this will have to wait for Python 2.7 / 3.1,
and so I will add an item to Issue 2636 with respect to it.

--
versions: +Python 2.7 -Python 2.5

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1160>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-25 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Good catch, Matthew, and if you spot any other outstanding Regular
Expression issues feel free to mention them here.

I'll give issue 1160 an item number of 25 and think all we need to do
here is change SRE_CODE to be typedefed to an unsigned long and change
the repeat count constants (which would be easier if we assume item 10:
shared constants).

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1647489] zero-length match confuses re.finditer()

2008-09-25 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Perl gives this result for your new expression:

"",undef,undef
undef,undef,"abc"
undef,"",undef

I think it has to do with not thinking of a string as a sequence of
characters, but as a sequence of characters separated by null-space. 
Null-space is can be captured, but ONLY if it is part of a zero-width
match, and once captured, it can no longer be captured by another
zero-width expression.  This is in keeping which what I see as Perl's
behaviour, namely that the (q*) group never participates in the first
match because, initially the (^z*) captures it.  OTOH, when it gets to
the null-space AFTER the 'abc' capture, the (^z*) cannot participate
because it has a "at-beginning" restriction.  The evaluator then moves
on to the (q*), which has no such restriction and this time it matches,
consuming the final null-space.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1647489>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-25 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Hmmm.  Well, some of those are already covered:

#2636: self
#1160: Item 25
#1647489 : Item 24
#3511: Item 23
#3825: Item 9-2
#433028  : Item 21
#433027  : Item 20
#433024  : Item 19
#3262: Item 22
#3299: TBD
#3665: TBD
#3482: TBD
#1519638 : TBD
#1662581 : TBD
#3255: TBD
#2650: TBD
#433030  : Item 1
#1721518 : TBD
#1693050 : TBD
#2537: TBD
#1633953 : TBD
#1282: TBD
#814253  : TBD (but I think you implemented this, didn't you Matthew?)
#214033  : TBD
#1708652 : TBD
#694374  : TBD
#433029  : Item 8

I'll have to get nosy and go over the rest of these to see if any of
them have already been solved, like the duplicate test case issue from a
while ago, but someone forgot to close them.  I'm thinking specifically
the '\u' escape sequence one.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-25 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Yes, I see in you rc2+2 diff it was added into that.  I will have to
allocate a new number for that fix though, as technically it's a
different feature than variable-length look-behind.

For now I'm having a hard time merging your diffs in with my code base.
 Lots and lots of conflicts, alas.

BTW, what UID did you try to register under at Launchpad?  Maybe I can
see if it's registered but just forgetting to send you e-mail.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-25 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Thanks Matthew.  You are now part of the pythonregexp2.7 team.  I want
to handle integrating Branch 01+09-02+17 myself for now and the other
branches will need to be renamed because I need to add Item 26: Capture
Groups in Look-Behind expressions, which would mean the order of your
patches are:

01+09-02+17:

regex_2.6rc2.diff
regex_2.6rc2+1.diff

01+09-02+17+26:

regex_2.6rc2+2.diff

01+09-02+17+18+26:

regex_2.6rc2+3.diff
regex_2.6rc2+4.diff

01+09-02+17+18+19+20+21+26:

regex_2.6rc2+5
regex_2.6rc2+6

It is my intention, therefore, to check a version of each of these
patches in to their corresponding repository, sequentially, starting
with 0, which is what I am working on now.

I am worried about a straight copy to each thread though, as there are
some basic cleanups provided through the core issue2636 patch, the item
1 patch and the item 9 patch.  The best way to see what these changes
are is to download
http://bugs.python.org/file10645/issue2636-patches.tar.bz2 and look at
the issue2636-01+09.patch file or, by typing the following into bazaar:

bzr diff --old lp:~pythonregexp2.7/python/base --new
lp:~pythonregexp2.7/python/issue2636+01+09

Which is more up-to-date than my June patches -- I really need to
regenerate those!

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1647489] zero-length match confuses re.finditer()

2008-09-25 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Matthew, I'll try to merge all your diffs with the current repository
over the weekend.  Having done the first, I know where code differs
between your implementation, mine and the base, so I can apply your
patch, and then a patch that restores my changes so the rest of the
merges should be easy!  :)

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1647489>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-26 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Matthew,

Did you upload a public SSH key to your Launchpad account?

You're on MS Windows, right?  I can try and do an install on an MS
Windows XP box or 2 I have lying around and see how that works, but we
should try and solve this vexing thing I've noticed about Windows
development, which is that Windows cannot understand Unix-style file
permissions, and so when I check out Python on Windows and then check it
back in, I've noticed that EVERY python and C file is "changed" by
virtue of its permissions having changed.  I would hope there's some way
to tell Bazaar to ignore 'permissions' changes because I know our edits
really have nothing to do with that.

Anyway, I'll try a few things visa-vi Windows to see if I get a similar
problem; there's also the https://answers.launchpad.net/bazaar forum
where you can post your Bazaar issues and see if the community can help.
 Search previous questions or click the "Ask a question" button and type
your subject.  Launchpad's UI is even smart enough to scan your question
title for similar ones so you may be able to find a solution right away
that way.  I use the Launchpad Answers section all the time and have
found it usually is a great way of getting help.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-26 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Great, Matthew!!

Now, I'm still in the process of setting up branches related to your
work; generally they should be created from a core and set of features
implemented for example:

To get from Version 2 to Version 3 of your Engine, I had to first check
out lp:~pythonregexp2.7/python/issue2636-01+09-02+17 and then "push" it
back onto launchpad as
lp:~pythonregexp2.7/python/issue2636-01+09-02+17+26.  This way the
check-in logs become coherent.

So, please hold off on checking your code in until I have your current
patch-set checked in, which I should finish by today; I also need to
rename some of the projects based on the fact that you also implemented
item 26 in most of your patches.  Actually, I keep a general To-Do list
of what I am up to on the
https://code.launchpad.net/~pythonregexp2.7/python/issue2636 whiteboard,
which you can also edit, if you want to see what I'm up to.  But I'll
try to have that list complete by today, fingers crossed!  In the mean
time, would you mind seeing if you are getting the file permissions
issue by doing a checkout or pull or branch and then calling "bzr stat"
to see if this caused Bazaar to add your entire project for checkin
because the permissions changed.  Thanks and congratulations again!

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-26 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Thanks, Matthew.  My reading of that Answer is that you should be okay
because you, I assume, installed the Windows-Native package rather than
the cygwin that I first tested.  I think the problem is specific to
Cygwin as well as the circumstances described in the article.  Still, it
should be quite easy to verify if you just check out python and then do
a stat, as this will show all files whose permissions have changed as
well as general changes.  Unfortunately, I am still working on setting
up those branches, but once I finish documenting each of the branches, I
should proceed more rapidly.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-26 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Phew!  Okay, all you patches have been applied as I said in a previous
message, and you should now be able to check out
lp:~pythonregexp2.7/python/issue2636+01+09-02+17+18+19+20+21+24+26 where
you can then apply your latest known patch (rc2+7) to add a fix for the
findall / finditer bug.

However, please review my changes to:

a) lp:~pythonregexp2.7/python/issue2636-01+09-02+17
b) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+26
c) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+18+26
d) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+18+19+20+21+26

To make sure my mergers are what your code snapshots should be.  I did
get one conflict with patch 5 IIRC where a reverse attribute was added
to the SRE_STATE struct, and get a weird grouping error when running the
tests for (a) and (b), which I think is a typo; a compile error
regarding the afore mentioned missing reverse attribute from patch 3 or
4 in (c) and the SRE_FLAG_REVERSE seems to have been lost in (d) for
some reason.

Also, if you feel like tackling any other issues, whether they have
numbers or not, and implementing them in your current development line,
please let me know so I can get all the documentation and development
branches set up.  Thanks and good luck!

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue433029] SRE: posix classes aren't supported

2008-09-26 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

To clarify, you mean named character sets as found in Perl and Emacs,
which are normally written, for example, like '[:ALPHANUM:]', right?  We
are working on that as Item 8 of Issue 2636: Regexp 2.7.  If not, please
clarify so I nknow what needs to be added.  Thanks!

--
nosy: +timehorse
versions: +Python 2.7

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433029>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3299] invalid object destruction in re.finditer()

2008-09-26 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3299>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3665] Support \u and \U escapes in regexes

2008-09-27 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3665>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3482] re.split, re.sub and re.subn should support flags

2008-09-27 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3482>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3482] re.split, re.sub and re.subn should support flags

2008-09-27 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
versions: +Python 2.7, Python 3.1 -Python 2.6, Python 3.0

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3482>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3299] invalid object destruction in re.finditer()

2008-09-27 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
versions: +Python 2.7 -Python 2.6

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3299>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3665] Support \u and \U escapes in regexes

2008-09-27 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
versions: +Python 2.7, Python 3.1 -Python 3.0

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3665>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1519638] Unmatched Group issue - workaround

2008-09-27 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1519638>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1519638] Unmatched Group issue - workaround

2008-09-27 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
versions: +Python 2.7 -Python 2.5

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1519638>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1662581] the re module can perform poorly: O(2n) versus O(n2)

2008-09-27 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse
versions: +Python 2.7

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1662581>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3255] [proposal] alternative for re.sub

2008-09-28 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Implementing Issue 3482 should solve this problem, and I will try to add 
it to issue 2636 so that it is captured in the general Regexp 2.7 
redesign.

--
nosy: +timehorse
versions: +Python 2.7

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3255>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2650] re.escape should not escape underscore

2008-09-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
versions: +Python 2.7, Python 3.1 -Python 2.6, Python 3.0

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2650>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2650] re.escape should not escape underscore

2008-09-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2650>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1721518] Small case which hangs

2008-09-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse
versions: +Python 2.7 -Python 2.4

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1721518>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1721518] Small case which hangs

2008-09-28 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Tested on 2.6rc2 and slow but successful.  Issue 1662851 may be related.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1721518>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1693050] \w not helpful for non-Roman scripts

2008-09-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse
versions: +Python 2.7 -Python 2.4

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1693050>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2537] re.compile(r'((x|y+))') should fail

2008-09-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse
versions: +Python 2.7 -Python 2.6

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2537>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1633953] re.compile("(.*$){1,4}", re.MULTILINE) fails

2008-09-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse
versions: +Python 2.7 -Python 2.5

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1633953>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1282] re module needs to support bytes / memoryview well

2008-09-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1282>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue214033] re incompatibility in sre

2008-09-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue214033>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1708652] Exact matching

2008-09-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse
versions: +Python 2.7

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1708652>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue694374] Recursive regular expressions

2008-09-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse
versions: +Python 2.7

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue694374>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1456280] Traceback error when compiling Regex

2008-09-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1456280>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-29 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Good work, Matthew.  Now, another bazaar hint, IMHO, is once of my
favourite commands: switch.  I generally develop all in one directory,
rather than getting a new directory for each branch.  Once does have to
be VERY careful to type "bzr info" to make sure the branch you're
editing is the one you think it is! but with "bzr switch", you do a
differential branch switch that allows you to change your development
branch quickly and painlessly.  This assumes you did a "bzr checkout"
and not a "bzr pull".  If you did a pull, you can still turn this into a
"checkout", where all VCS actions are mirrored on the server, by using
the 'bind' command.  Make sure you push your branch first.  You don't
need to worry about all this "bind"ing, "push"ing and "pull"ing if you
choose checkout, but OTOH, if your connection is over-all very slow, you
may still be better off with a "pull"ed branch rather than a
"checkout"ed one.

Anyway, good catch on those 4 lines and I'll see if I can get your
earlier branches up to date.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2008-09-29 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Matthew, I've traced down the patch failures in my merges and now each
of the 4 versions of code on Launchpad should compile, though the first
2 do not pass all the negative look-behind tests, though your later 2
do.  Any chance you could back-port that fix to the
lp:~pythonregexp2.7/python/issue2636-01+09-02+17 branch?  If you can, I
can propagate that fix to the higher levels pretty quickly.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1633953] re.compile("(.*$){1,4}", re.MULTILINE) fails

2008-10-11 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

On first blush, this issue sounds quite similar to issue 2537, but I 
have been looking at different scenarios and found that there is a 
subtle difference because, grammatically:

(?m)(?:.*$)(.*$)

is the same as:

(?m)(.*$){2}

Yet the former compiles while the later raises the exception you list 
below.  Thus, I think the issue YOU raise is indeed related to the 
redundant repeat operator issue numbered 2537, BUT, when I match an 
expression with the alternate form, I get an empty string in my capture 
group, since in a range repeat over a capture group, only the last group 
is captured, while the entire expression matches only the first line, 
without the end-line character.  Thus, the other thing to remember is 
that ^ and $ are zero-width matches, so when you write .*$, you are 
saying match up to, but not including, the end of the line.  If you 
immediately follow that with another .*$, you will start from the point 
"up to, but not including, the end of the line", which means the next 
character is an end of line.  Thus, when you reach the second .*$, you 
capture nothing because the .* is allowed to be zero-length and you 
still haven't advanced PAST the end of the line.

As a working alternative, you could write r'(?m)(?:(.*$)[\r\n]*){1,4}' , 
since this would give you your 1-4 lines, but also consume the carriage 
return and line feed characters to get you to the next line.

Since we don't want to change the meaning of $ and ^ to make them 
capturing (custom POSIX character classes may make 'capturing' a new 
line character easier), and the 'redundant repeat operator' is already 
listed as a bug (your expression is essentially saying (.*){1,4}$ 
because it does not capture the new-line character(s) and thus has a 
redundant repeat operation in the range repeat expression), I'm willing 
to call this a repeat (technically repeated by as this issue is older) 
of issue 2537.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1633953>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1708652] Exact matching

2008-10-13 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Binary format searches should be supported once issue 1282 is implemented, 
likely as part of issue 2636 Item 32.  That said, I'm not clear what you 
mean by exact search; wouldn't you want match instead?  If your main issue 
is you want something that automatically binds to the beginning and ending 
of input, then I suppose we could add an 'exact' method where 'search' 
searches anywhere, 'match' matches from the start of input and 'exact' 
matches from beginning to ending.  I'd call that a separate issue, though.  
In other words: byte-oriented matches is covered by 1282 and adding an 
'exact' method is the only new issue here.  Does that sound right?

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1708652>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue694374] Recursive regular expressions

2008-10-13 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

The PCRE has some interesting suggestions on how the grammar for a 
recursive regular expressions might work.  I am concerned about the use of 
(?P>name) to call a regexp subexpression as an atomic subroutine.  The (?
P>name) format has never before been supported by Python and the (?P...) 
notation is exclusive to python, so it is strange the PCRE assigning us a 
use for (?P>name) without the Python community actually agreeing to it.  
Other than that, though, I think this is a possible feature we could add 
in issue 2636 as item 35.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue694374>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue214033] re incompatibility in sre

2008-10-13 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

The duplicate zero-or-one repeat operator bug described in this issue 
originally no longer exists in python 2.6.

However, Trent Mick brings up a fair point in that expressions of the 
form (x*)? generate an error (issue 1456280) when internally the '?' 
should be passively stripped from the expression by the Python Regular 
Expression Compiler because it is redundant.  The same goes for 
expressions of the form (x*)* (issue 2537).  Also, there is a problem 
with expressions of the form (x*){n,m} (issue 1633953), since the x* 
matches as much as it can, and thus it sees the range repeat operation 
as redundant -- in this case I think the range repeat should have the 
effect of matching (x*)(x*)(x*)... n to m times, but since the first 
time matches everything, the subsequent matches all match zero-width 
expressions following the first one.  I am tracking all of these issues 
under Item 33 of Issue 2636.

The are the 3 known redundant repeat issues, but this one, the zero-or-
one followed by zero-or-one is AFAICT fixed in python 2.6 as the 
expression originally listed now passes compile.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue214033>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1456280] Traceback error when compiling Regex

2008-10-13 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

This is another version of the redundant repeat issue defined in issues 
2537 and 1633953 and although not described by the original report for 
issue 214033, the comments further down that issue also describe a 
similar situation.

In this case, the problem arises from the '[(](?P[^)]*)?[)]' 
portion of your regexp code because you have a zero-or-more repeat 
repeated zero-or-one times, which in the current version of python 
causes this error.  Technically, the outer zero-or-one operator ('?') is 
redundant and you can eliminate it, but this IMHO should not cause the 
error listed below and I will look into a solution in issue 2636.

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1456280>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1619060] bisect on presorted list

2008-03-13 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Thanks Raymond for checking on the status of this issue and Guido for 
thinking about it too.  I still support the basic concept for the 
reasons I specified in the original description, namely maintaining a 
sorted list (or random-access iterable).  Thus, although in some ways it 
seems like syntactic sugar to make bisect and heap match the parameter 
list of sort, there is good reason to do so.

That having been said, the re-computation of the /key/ function for 
typically the same values at each invocation of an insort_* method can 
potentially add a lot of otherwise unnecessary calls.

So, having thought about this further, it seems to me that, somehow, the 
bisect/insort routines functions need to cache the results of each 
evaluation of key in the same way that sort can maintain that 
information in a local variable.  Ideally, IMHO, the keys computed in 
sort could be cached and stored with the list upon which it was 
operated, but this is cumbersome, would not work for generic types and 
would be invalid if for some reason the sort key was different from the 
bisect key.  Typically the later case would not occur but since it might 
lead to unexpected behavior it should definitely be avoided.

Instead, I am now thinking that the only way we can safely maintain 
caching information is to implement a persistent object to contain it 
that can be shared between each invocation of bisect/insort so as to not 
force re-evaluation of the same key argument.  This clearly could be 
done with a class.

So I would propose that a bisect class be created that would hold the 
list, the current key function and the cached key values.  Of course, 
special care would need to be taken to take into account mutable 
sequences that could be changed between bisect calls.

The class could be implements in a number of ways.  For instance, it 
could create a callable instance that accepts a list and other 
parameters.  It will then look at its state, including a check of /is/ 
with the input and the last list operated upon, and if that test fails, 
the 'key cache' is invalidated.  Next, if a key method is provided, it 
is checked against the last key function used (/is/) and if it is not, 
the 'key cache' is invalidated.  Then, to prevent mutability problems, 
the 'key cache' would actually be implemented as a dictionary mapping 
item to key-value and each time a key needed to be computed, the input 
would first be checked against the cache and only if it was not there 
would the key function be called.  This should allow for preventing 
undefined behavior when handling invalid cases.  Of course, if the 
client wishes to use bisect on 2 different lists or 2 different keys, 
sie could always just create 2 different bisect objects.  There are of 
course other ways to accomplish this, such as binding the bisect object 
to a list and key at construction that cannot be reset (at least, should 
not be) once invoked.  I am also wary of implementing the 'key cache' as 
a dictionary as it adds a hash-table lookup which is already potentially 
expensive.  Ideally, if the bisect object could force its associated 
list to be immutable for the life of the bisect, this would get around 
the problem of external inserts and deletes that would invalidate the 
'key cache'.

Obviously, there may be another way to define such a class but I think 
the only way you can safely make sure the bisect functions don't unnecessarily 
compute key without modifying the list is to allow for the 
caching of values between invocations, and the only way you can do that 
safely is by putting the cache in some helper class.  Of course, a class 
is only needed for when key is used.  comp and reverse, AFAIK, are not 
cached when sorting.  So the existing, generic bisect algorithms would 
still be useful and in fact, you'd want the bisect class to be optional, 
IMHO.

Of course, Guido's suggestion of just mapping your list into a list of 
(key, value) tuples is a good workaround for just the key case (assuming 
key, comp and reverse flags are not all used at the same time for 
sorting).  It probably should be a cookbook item that could be 
documented with the bisect (and heap) libraries.  It can't completely 
solve the problem, but it will solve some conditions in the short term.  
Long term, I still think we should consider bisect, potentially with a 
class definition.

_
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1619060>
_
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue433030] SRE: (?>...) is not supported

2008-03-27 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse


Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433030>

___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue433030] SRE: (?>...) is not supported

2008-03-28 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Fredrik,

If you're still listening, I am gonna try and tackle this one but I 
would like to know why you or the famous Jeffrey of the Regexp world 
claims that there is already code in the Regexp Engine for Atomic 
Grouping?  Adding a hook for (?>...) should be trivial but I don't wanna 
re-invent the wheel if the proper stack-unwind code already exists.  
Thanks.  Of course, Andrew (a.k.a. A.M. Kuchling) already asked this 
question and you did not answer it, so I guess you're not reading this, 
but if you are, please respond.  Thanks!


Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433030>

___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue433030] SRE: Atomic Grouping (?>...) is not supported

2008-03-28 Thread Jeffrey C. Jacobs


Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
title: SRE: (?>...) is not supported -> SRE: Atomic Grouping (?>...) is not 
supported


Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433030>

___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue433030] SRE: Atomic Grouping (?>...) is not supported

2008-03-29 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

I'm digging into the sre_parse.py at the moment and this I have all the 
changes I need for that now.  The rest of the changes I believe are in 
either sre_compile.py or more likely directly in _sre.c, so I will 
examine those files next.  I am attaching a single diff for 
expedience.  This is not an official patch, just a sample to see the 
progress I am making.  I forgot the correct format for patch files but 
I promise to get it right when I have made more progress.

--
components: +Library (Lib)
versions: +Python 2.6
Added file: http://bugs.python.org/file9897/PyLibDiffs.txt


Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue433030>

___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-04-15 Thread Jeffrey C. Jacobs


New submission from Jeffrey C. Jacobs <[EMAIL PROTECTED]>:

I am working on adding features to the current Regexp implementation,
which is now set to 2.2.2.  These features are to bring the Regexp code
closer in line with Perl 5.10 as well as add a few python-specific
niceties and potential speed-ups and clean-ups.

I will be posting regular patch updates to this thread when major
milestones have been reach with a description of the feature(s) added. 
Currently, the list of proposed changes are (in no particular order):

1) Fix http://bugs.python.org/issue433030";>issue 433030 by
adding support for Atomic Grouping and Possessive Qualifiers

2) Make named matches direct attributes of the match object; i.e.
instead of m.group('foo'), one will be able to write simply m.foo.

3) (maybe) make Match objects subscriptable, such that m[n] is
equivalent to m.group(n) and allow slicing.

4) Implement Perl-style back-references including relative back-references.

5) Add a well-formed, python-specific comment modifier, e.g. (?P#...);
the difference between (?P#...) and Perl/Python's (?#...) is that the
former will allow nested parentheses as well as parenthetical escaping,
so that patterns of the form '(?P# Evaluate (the following) expression,
3\) using some other technique)'.  The (?P#...) will interpret this
entire expression as a comment, where as with (?#...) only, everything
following ' expression...' would be considered part of the match. 
(?P#...) will necessarily be slower than (?#...) and so only should be
used if richer commenting style is required but the verbose mode is not
desired.

6) Add official support for fast, non-repeating capture groups with the
Template option.  Template is unofficially supported and disables all
repeat operators (*, + and ?).  This would mainly consist of documenting
its behavior.

7) Modify the re compiled expression cache to better handle the
thrashing condition.  Currently, when regular expressions are compiled,
the result is cached so that if the same expression is compiled again,
it is retrieved from the cache and no extra work has to be done.  This
cache supports up to 100 entries.  Once the 100th entry is reached, the
cache is cleared and a new compile must occur.  The danger, all be it
rare, is that one may compile the 100th expression only to find that one
recompiles it and has to do the same work all over again when it may
have been done 3 expressions ago.  By modifying this logic slightly, it
is possible to establish an arbitrary counter that gives a time stamp to
each compiled entry and instead of clearing the entire cache when it
reaches capacity, only eliminate the oldest half of the cache, keeping
the half that is more recent.  This should limit the possibility of
thrashing to cases where a very large number of Regular Expressions are
continually recompiled.  In addition to this, I will update the limit to
256 entries, meaning that the 128 most recent are kept.

8) Emacs/Perl style character classes, e.g. [:alphanum:].  For instance,
:alphanum: would not include the '_' in the character class.

9) C-Engine speed-ups.  I commenting and cleaning up the _sre.c Regexp
engine to make it flow more linearly, rather than with all the current
gotos and replace the switch-case statements with lookup tables, which
in tests have shown to be faster.  This will also include adding many
more comments to the C code in order to make it easier for future
developers to follow.  These changes are subject to testing and some
modifications may not be included in the final release if they are shown
to be slower than the existing code.  Also, a number of Macros are being
eliminated where appropriate.

10) Export any (not already) shared value between the Python Code and
the C code, e.g. the default Maximum Repeat count (65536); this will
allow those constants to be changed in 1 central place.

11) Various other Perl 5.10 conformance modifications, TBD.


More items may come and suggestions are welcome.

-

Currently, I have code which implements 5) and 7), have done some work
on 10) and am almost 9).  When 9) is complete, I will work on 1), some
of which, such as parsing, is already done, then probably 8) and 4)
because they should not require too much work -- 4) is parser-only
AFAICT.  Then, I will attempt 2) and 3), though those will require
changes at the C-Code level.  Then I will investigate what additional
elements of 11) I can easily implement.  Finally, I will write
documentation for all of these features, including 6).

In a few days, I will provide a patch with my interim results and will
update the patches with regular updates when Milestones are reached.

--
components: Library (Lib)
messages: 65513
nosy: timehorse
severity: normal
status: open
title: Regexp 2.6 (modifications to current re 2.2.2)
type: feature request
versions: Python 2.6

__
Tr

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-04-17 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

I am very sorry to report (at least for me) that as of this moment, item 
9), although not yet complete, is stable and able to pass all the 
existing python regexp tests.  Because these tests are timed, I am using 
the timings from the first suite of tests to perform a benchmark of 
performance between old and new code.  Based on discussion with Andrew 
Kuchling, I have decided for the sake of simplicity, the "timing" of 
each version is to be calculated by the absolute minimum time to execute 
observed because it is believed this execution would have had the most 
continuous CPU cycles and thus most closely represents the true 
execution time.

It is this current conclusion that greatly saddens me, not that the 
effort has not been valuable in understanding the current engine.  
Indeed, I understand the current engine now well enough that I could 
proceed with the other modifications as-is rather than implementing them 
with the new engine.  Mind you, I will likely not bring over the copious  
comments that the new engine received when I translated it to a form 
without C_Macros and gotos, as that would require too much effort IMHO.

Anyway, all that being said, and keeping in mind that I am not 100% 
satisfied with the new engine and may still be able to wring some timing 
out of it -- not that I will spend much more time on this -- here is 
where we currently stand:

Old Engine: 6.574s
New Engine: 7.239s

This makes the old Engine 665ms faster over the entire first test_re.py 
suite, or 9% faster than the New Engine.

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-04-18 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Here are the modification so far for item 9) in _sre.c plus some small
modifications to sre_constants.h which are only to get _sre.c to
compile; normally sre_constants.h is generated by sre_constants.py, so
this is not the final version of that file.  I also would have intended
to make SRE_CHARSET and SRE_COUNT use lookup tables, as well as maybe
others, but not likely any other lookup tables.  I also want to remove
alloc_pos out of the self object and make it a parameter to the ALLOC
parameter and probably get rid of the op_code attribute since it is only
used in 1 place to save one subtract in a very rare case.  But I want to
resolve the 10% problem first, so would appreciate it if people could
look at the REMOVE_SRE_MATCH_MACROS section of code and compare it to
the non-REMOVE_SRE_MATCH_MACROS version of SRE_MATCH and see if you can
suggest anything to make the former (new code) faster to get me that
elusive 10%.

--
keywords: +patch
Added file: http://bugs.python.org/file10052/issue2636-09.patch

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-04-18 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Here is a patch to implement item 7)

Added file: http://bugs.python.org/file10053/issue2636-07.patch

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-04-18 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

This simple patch adds (?P#...)-style comment support.

Added file: http://bugs.python.org/file10056/issue2636-05.patch

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

2008-04-24 Thread Jeffrey C. Jacobs


Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Thanks Jim for your thoughts!

Armaury has already explained about Perl 5.10.0.  I suppose it's like
Macintosh version numbering, since Mac Tiger went from version 10.4.9 to
10.4.10 and 10.4.11 a few years ago.  Maybe we should call Python 2.6
Python 2.06 just in case.  But 2.6 is the known last in the 2 series so
it's not a problem for us!  :)

>> as well as add a few python-specific
>
> because this also adds to the scope.

At this point the only python-specific changes I am proposing would be
items 2, 3 (discussed below), 5 (discussed below), 6 and 7.  6 is only a
documentation change, the code is already implemented.  7 is just a
better behavior.  I think it is RARE one compiles more than 100 unique
regular expressions, but you never know as projects tend to grow over
time, and in the old code the 101st would be recompiled even if it was
just compiled 2 minutes ago.  The patch is available so I leave it to
the community to judge for themselves whether it is worth it, but as you
can see, it's not a very large change.

>> 2) Make named matches direct attributes 
>> of the match object; i.e. instead of m.group('foo'), 
>> one will be able to write simply m.foo.
>
>> 3) (maybe) make Match objects subscriptable, such 
>> that m[n] is equivalent to m.group(n) and allow slicing.
>
> (2) and (3) would both be nice, but I'm not sure it makes sense to do 
> *both* instead of picking one.

Well, I think named matches are better than numbered ones, so I'd
definitely go with 2.  The problem with 2, though, is that it still
leaves the rather typographically intense m.group(n), since I cannot
write m.3.  However, since capture groups are always numbered
sequentially, it models a list very nicely.  So I think for indexing by
group number, the subscripting operator makes sense.  I was not
originally suggesting m['foo'] be supported, but I can see how that may
come out of 3.  But there is a restriction on python named matches that
they have to be valid python and that strikes me as 2 more than 3
because 3 would not require such a restriction but 2 would.  So at least
I want 2, but it seems IMHO m[1] is better than m.group(1) and not in
the least hard or a confusing way of retrieving the given group.  Mind
you, the Match object is a C-struct with python binding and I'm not
exactly sure how to add either feature to it, but I'm sure the C-API
manual will help with that.

>> 5) Add a well-formed, python-specific comment modifier, 
>> e.g. (?P#...);  
>
> [handles parens in comments without turning on verbose, but is slower]
>
> Why?  It adds another incompatibility, so it has to be very useful or 
> clear.  What exactly is the advantage over just turning on verbose?

Well, Larry Wall and Guido agreed long ago that we, the python
community, own all expressions of the form (?P...) and although I'd be
my preference to make (?#...) more in conformance with understanding
parenthesis nesting, changing the logic behind THAT would make python
non-standard.  So as far as any conflicting design, we needn't worry.

As for speed, the this all occurs in the parser and does not effect the
compiler or engine.  It occurs only after a (?P has been read and then
only as the last check before failure, so it should not be much slower
except when the expression is invalid.  The actual execution time to
find the closing brace of (?P#...) is a bit slower than that for (?#...)
but not by much.

Verbose is generally a good idea for anything more than a trivial
Regular Expression.  However, it can have overhead if not included as
the first flag: an expression is always checked for verbose
post-compilation and if it is encountered, the expression is compiled a
second time, which is somewhat wasteful.  But the reason I like the
(?P#...) over (?#...) is because I think people would more tend to assume:

r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't.

That expression only matches "He ls)llo", so I created the (?P#...) to
make the comment match type more intuitive:

r'He(?P# 2 (TWO) ls)llo' matches "Hello".

>> 9) C-Engine speed-ups. ...
>> a number of Macros are being eliminated where appropriate.
>
> Be careful on those, particular on str/unicode and different
> compile options.

Will do; thanks for the advice!  I have only observed the UNICODE flag
controlling whether certain code is used (besides the ones I've added)
and have tried to stay true to that when I encounter it.  Mind you,
unless I can get my extra 10% it's unlikely I'd actually go with item 9
here, even if it is easier to read IMHO.  However, I want to run the new
engine proposal through gprof to see if I can track down some bottlenecks.

At some point, I hope to ge

1 2 >

1 - 100 of 114 matches

Mail list logo