[issue10139] regex A|B : both A and B match, but B is wrongly preferred

2010-10-18 Thread Christos Georgi ou

New submission from Χρήστος Γεωργίου (Christos Georgiou) 
:

This is based on that StackOverflow answer: 
http://stackoverflow.com/questions/3957164/3963443#3963443. It also applies to 
Python 2.6 .

Searching for a regular expression that satisfies the mentioned SO question (a 
regular expression that matches strings with an initial A and/or final Z and 
returns everything except said initial A and final Z), I discovered something 
that I consider a bug. I've tried to thoroughly verify that this is not a 
PEBCAK before reporting the issue here.

Given:

>>> import re
>>> text= 'A***Z'

then:

>>> re.compile('(?<=^A).*(?=Z$)').search(text).group(0) # regex_1
'***'
>>> re.compile('(?<=^A).*').search(text).group(0) # regex_2
'***Z'
>>> re.compile('.*(?=Z$)').search(text).group(0) # regex_3
'A***'
>>> re.compile('(?<=^A).*(?=Z$)|(?<=^A).*').search(text).group(0) # 
>>> regex_1|regex_2
'***'
>>> re.compile('(?<=^A).*(?=Z$)|.*(?=Z$)').search(text).group(0) # 
>>> regex_1|regex_3
'A***'
>>> re.compile('(?<=^A).*|.*(?=Z$)').search(text).group(0) # regex_2|regex_3
'A***'
>>> re.compile('(?<=^A).*(?=Z$)|(?<=^A).*|.*(?=Z$)').search(text).group(0) # 
>>> regex_1|regex_2|regex_3
'A***'

regex_1 returns '***'. Based on the documentation 
(http://docs.python.org/py3k/library/re.html#regular-expression-syntax), I 
assert that, likewise, '***' should be returned by:

regex_1|regex_2
regex_1|regex_3
regex_1|regex_2|regex_3

And yet, regex_3 ( ".*(?=Z$)" ) seems to take precedence over both regex_1 and 
regex_2, even though it's the last alternative.

This works even if I substitute "(?:regex_n)" for every "regex_n", so it's not 
a matter of precedence.

I really hope that this is a PEBCAK; if that is true, I apologize for any time 
lost on the issue by anyone; but really don't think it is.

--
components: Regular Expressions
messages: 119088
nosy: tzot
priority: normal
severity: normal
status: open
title: regex A|B : both A and B match, but B is wrongly preferred
type: behavior
versions: Python 3.1

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10139] regex A|B : both A and B match, but B is wrongly preferred

2010-10-18 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

For completeness' sake, I also provide the "(?:regex_n)" results:

>>> text= 'A***Z'
>>> re.compile('(?:(?<=^A).*(?=Z$))').search(text).group(0) # regex_1
'***'
>>> re.compile('(?:(?<=^A).*)').search(text).group(0) # regex_2
'***Z'
>>> re.compile('(?:.*(?=Z$))').search(text).group(0) # regex_3
'A***'
>>> re.compile('(?:(?<=^A).*(?=Z$))|(?:(?<=^A).*)').search(text).group(0) # 
>>> regex_1|regex_2
'***'
>>> re.compile('(?:(?<=^A).*(?=Z$))|(?:.*(?=Z$))').search(text).group(0) # 
>>> regex_1|regex_3
'A***'
>>> re.compile('(?:(?<=^A).*)|(?:.*(?=Z$))').search(text).group(0) # 
>>> regex_2|regex_3
'A***'
>>> re.compile('(?:(?<=^A).*(?=Z$))|(?:(?<=^A).*)|(?:.*(?=Z$))').search(text).group(0)
>>>  # regex_1|regex_2|regex_3
'A***'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10139] regex A|B : both A and B match, but B is wrongly preferred

2010-10-19 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

As I see it, it's more like:

>>> re.search('a.*c|a.*|.*c', 'abc').group()

producing 'bc' instead of 'abc'. Substitute "(?<=^A)" for "a" and "(?=Z$)" for 
"c" in the pattern above.

In your example, the first part ('bc') does not match the whole string ('abc'). 
In my example, the first part ('(?<=^A).*(?=Z$)') matches the whole string 
('A***Z').

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10139] regex A|B : both A and B match, but B is wrongly preferred

2010-10-19 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

Georg, please re-open it. Focus on the difference between example 
regex_1|regex_2 (both matching; regex_1 is used as it should be), and 
regex_1|regex_3 (both matching; regex_3 is used incorrectly).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10139] regex A|B : both A and B match, but B is wrongly preferred

2010-10-19 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

No, my mistake, you did well for closing it.

The more explicit version of the explanation: both regex_1 and regex_2 start 
actually matching at index 1, while regex_3 starts matching at index 0.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10160] operator.attrgetter slower than lambda after adding dotted names ability

2010-10-20 Thread Christos Georgi ou

New submission from Χρήστος Γεωργίου (Christos Georgiou) 
:

(Discovered in that StackOverflow answer: 
http://stackoverflow.com/questions/3940518/3942509#3942509 ; check the comments 
too)

operator.attrgetter in its simplest form (i.e. with a single non-dotted name) 
needs more time to execute than an equivalent lambda expression.

Attached file so3940518.py runs a simple benchmark comparing: a list 
comprehension of plain attribute access; attrgetter; and lambda. I will append 
sample benchmark times at the end of the comment.

Browsing Modules/operator.c, I noticed that the dotted_getattr function was 
using PyUnicode_Check and (possibly) splitting on dots on *every* call of the 
attrgetter, which I thought to be most inefficient.

I changed the py3k-daily-snapshot source to make the PyUnicode_Check calls in 
the attrgetter_new function; also, I modified the algorithm to pre-parse the 
operator.attrgetter functions for possible dots in the names, in order for the 
dotted_getattr function to become speedier.

The only “drawback” is that now operator.attrgetter raises a TypeError on 
creation, not on subsequent calls of the attrgetter object; this shouldn't be a 
compatibility problem. However, I obviously had to update both 
Doc/library/operator.rst and Lib/test/test_operator.py .

I am not sure whether I should attach a zip/tar file with both the attachments 
(the sample benchmark and the diff); so I'll attach the diff in a further 
comment.

On the Ubuntu server 9.10 where I made the changes, I ran the so3940518.py 
sample benchmark before and after the changes.

Run before the changes (third column is seconds, less is better):

list comp 0.40925 100
map attrgetter 1.3897 100
map lambda 1.0098 100

Run after the changes:

list comp 0.40036 100
map attrgetter 0.5196 100
map lambda 0.96 100

--
assignee: d...@python
components: Documentation, Library (Lib), Tests
files: so3940518.py
messages: 119247
nosy: d...@python, tzot
priority: normal
severity: normal
status: open
title: operator.attrgetter slower than lambda after adding dotted names ability
versions: Python 3.2
Added file: http://bugs.python.org/file19310/so3940518.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10160] operator.attrgetter slower than lambda after adding dotted names ability

2010-10-20 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

Here comes the diff to Modules/operator.c, Doc/library/operator.rst and 
Lib/test/test_operator.py . As far as I could check, there are no leaks, but a 
more experienced eye in core development could not hurt. Also, obviously 
test_operatory.py passes all tests.

Should this be accepted, I believe it should be backported to 2.7 (at least). I 
can do that, just let me know.

--
keywords: +patch
Added file: http://bugs.python.org/file19312/issue10160.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10160] operator.attrgetter slower than lambda after adding dotted names ability

2010-10-20 Thread Christos Georgi ou

Changes by Χρήστος Γεωργίου (Christos Georgiou) :


Removed file: http://bugs.python.org/file19312/issue10160.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10160] operator.attrgetter slower than lambda after adding dotted names ability

2010-10-20 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

Newer version of the diff, since I forgot some "if(0) fprintf" debug calls that 
shouldn't be there.

--
Added file: http://bugs.python.org/file19313/issue10160.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10160] operator.attrgetter slower than lambda after adding dotted names ability

2010-10-20 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

An explanation to the changes.

The old code kept the operator.itemgetter arguments in the ag->attr member. If 
the argument count (ag->nattrs) was 1, the single argument was kept; if more 
than 1, a tuple of the original arguments was kept.

On every attrgetter_call call, if ag->nattrs was 1, dotted_getattr was called 
with the plain ag->attr as attribute name; if > 2, dotted_getattr was called 
for every one of the original arguments.

Now, ag->attr is always a tuple, containing either dotless strings or tuples of 
dotless strings:

operator.attrgetter("name1", "name2.name3", "name4")

stores ("name1", ("name2", "name3"), "name4") in ag->attr.

dotted_getattr accordingly chooses based on type (either str or tuple, ensured 
by attrgetter_new) whether to do a single access or a recursive one.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10160] operator.attrgetter slower than lambda after adding dotted names ability

2010-10-20 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

Modules/operator.c grows by ~70 lines, most of it the setup code for ag->attr; 
also I loop twice over the args of attrgetter_new, choosing fast code that runs 
once per attrgetter creation than temporary data.

Alex's suggestion to make use of Python-level functions to shorten the code of 
attrgetter_new could obviously work to decrease the source lines. I don't know 
how fast I would produce such a version if requested, though.

Whatever the way attrgetter_new sets up the data, I would suggest that you keep 
the logic changes in general, i.e. set-up in attrgetter_new and keep a thinner 
dotted_getattr , since it avoids running the same checks and splitting over and 
over again for every attrgetter_call invocation.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10160] operator.attrgetter slower than lambda after adding dotted names ability

2010-10-22 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

A newer version of the patch with the following changes:

- single loop in the ag->attr setup phase of attrgetter_new; interning of the 
stored attribute names
- added two more tests of invalid attrgetter parameters (".attr", "attr.")

--
Added file: http://bugs.python.org/file19339/issue10160-2.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10160] operator.attrgetter slower than lambda after adding dotted names ability

2010-10-30 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

Thank you very much, Antoine, for your review. My comments in reply:

- the dead code: it's not dead, IIRC it ensures that at least one argument is 
given, otherwise it raises an exception.

- PyUnicode_GET_SIZE: you're right. The previous patch didn't have this 
problem, because there were two loops: the first one made sure in advance that 
all arguments are PyUnicode.

- the false comment: right again. A remain from the first patch.

- dotted_getattr and references: right! I should have noted better what 
Raymond's initial loop did.

Attached a corrected version of the patch according to Antoine's comments.

--
Added file: http://bugs.python.org/file19440/issue10160.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1602] windows console doesn't print utf8 (Py30a2)

2010-11-04 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx

If you want any kind of Unicode output in the console, the font must be an 
“official” MS console TTF (“official” as defined by the Windows version); I 
believe only Lucida Console and Consolas are the ones with all MS private 
settings turned on inside the font file.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1602] windows console doesn't print utf8 (Py30a2)

2009-09-18 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

Another note:
if one creates a dummy Stream object (having a softspace attribute and a
write method that writes using os.write, as in
http://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash/1432462#1432462
) to replace sys.stdout and sys.stderr, then writes occur correctly,
without issues. Pre-requisites:
chcp 65001, Lucida Console font and cp65001 as an alias for UTF-8 in
encodings/aliases.py
This is Python 2.5.4 on Windows.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6058] Add cp65001 to encodings/aliases.py

2009-12-22 Thread Christos Georgi ou

Χρήστος Γεωργίου (Christos Georgiou)  added the 
comment:

re Martin's question, I can offer the indirect wisdom of Michael Kaplan
in this blog post:

http://blogs.msdn.com/michkap/archive/2008/03/18/8306597.aspx

where he mentions that the easiest way to output unicode text in the
Windows console, is:

int main(void) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
return 0;
}

_setmode being the special call needed.

I haven't tested with any _O_U8TEXT (if such a thing exists), I don't do
Windows anymore, therefore I can't provide a patch.

It also seems that Python —when stdin/stdout/stderr is under control of
a Windows console— doesn't use plain *printf functions. The example code
I offered in one of the other issues (dumb stdout doing plain .write as
UTF-8) runs and displays fine.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com