[issue1693050] \w not helpful for non-Roman scripts

2018-03-14 Thread Terry J. Reedy
Terry J. Reedy added the comment: Whatever I may have said before, I favor supporting the Unicode standard for \w, which is related to the standard for identifiers. This is one of 2 issues about \w being defined too narrowly. I am somewhat arbitrarily closing this as a duplicate of #12731 (f

[issue1693050] \w not helpful for non-Roman scripts

2014-02-03 Thread Mark Lawrence
Changes by Mark Lawrence : -- nosy: -BreamoreBoy ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: https://mai

[issue1693050] \w not helpful for non-Roman scripts

2013-05-29 Thread STINNER Victor
STINNER Victor added the comment: Let see Modules/_sre.c: #define SRE_UNI_IS_ALNUM(ch) Py_UNICODE_ISALNUM(ch) #define SRE_UNI_IS_WORD(ch) (SRE_UNI_IS_ALNUM(ch) || (ch) == '_') >>> [ch.isalpha() for ch in '\u0939\u093f\u0928\u094d\u0926\u0940'] [True, False, True, False, True, False] >>> import

[issue1693050] \w not helpful for non-Roman scripts

2013-05-29 Thread Matthew Barnett
Matthew Barnett added the comment: UTF-16 has nothing to do with it, that's just an encoding (a pair of them actually, UTF-16LE and UTF-16BE). And I don't know why you thought I was using findall in msg190100 when the examples were using match! :-) --

[issue1693050] \w not helpful for non-Roman scripts

2013-05-29 Thread Jeffrey C. Jacobs
Jeffrey C. Jacobs added the comment: Thanks Matthew and sorry to put you through more work; I just wanted to verify exactly which unicode (UTF-16 I take it) were being used to verify if the UNICODE standard expected them to be treated as unique words or single letters within a word. Sanskrit

[issue1693050] \w not helpful for non-Roman scripts

2013-05-29 Thread Matthew Barnett
Matthew Barnett added the comment: You could've obtained it from msg76556 or msg190100: >>> print(ascii('हिन्दी')) '\u0939\u093f\u0928\u094d\u0926\u0940' >>> import re, regex >>> print(ascii(re.match(r"\w+", >>> '\u0939\u093f\u0928\u094d\u0926\u0940').group())) '\u0939' >>> print(ascii(regex.ma

[issue1693050] \w not helpful for non-Roman scripts

2013-05-28 Thread Jeffrey C. Jacobs
Jeffrey C. Jacobs added the comment: Maybe you could show us the byte-for-byte hex of the string you're testing so we can examine if it's really a code point intending word boundary or just a code point for the sake of beginning a new character. --

[issue1693050] \w not helpful for non-Roman scripts

2013-05-28 Thread Matthew Barnett
Matthew Barnett added the comment: I'm not sure what you're saying. The re module in Python 3.3 matches only the first codepoint, treating the second codepoint as not part of a word, whereas the regex module matches all 6 codepoints, treating them all as part of a single word. -- ___

[issue1693050] \w not helpful for non-Roman scripts

2013-05-28 Thread Jeffrey C. Jacobs
Jeffrey C. Jacobs added the comment: Matthew, I think that is considered a single word in Sanscrit or Thai so Python 3.x is correct. In this case you've written the Sanscrit word for Hindi. -- ___ Python tracker

[issue1693050] \w not helpful for non-Roman scripts

2013-05-26 Thread Terry J. Reedy
Changes by Terry J. Reedy : -- versions: +Python 3.3, Python 3.4 -Python 3.1 ___ Python tracker ___ ___ Python-bugs-list mailing lis

[issue1693050] \w not helpful for non-Roman scripts

2013-05-26 Thread Matthew Barnett
Matthew Barnett added the comment: I had to check what re does in Python 3.3: >>> print(len(re.match(r'\w+', 'हिन्दी').group())) 1 Regex does this: >>> print(len(regex.match(r'\w+', 'हिन्दी').group())) 6 -- ___ Python tracker

[issue1693050] \w not helpful for non-Roman scripts

2013-05-26 Thread Mark Lawrence
Mark Lawrence added the comment: Am I correct in saying that this must stay open as it targets the re module but as given in msg81221 is fixed in the new regex module? -- nosy: +BreamoreBoy ___ Python tracker __

[issue1693050] \w not helpful for non-Roman scripts

2010-03-30 Thread Shashwat Anand
Changes by Shashwat Anand : -- nosy: +l0nwlf ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.pyth

[issue1693050] \w not helpful for non-Roman scripts

2010-03-05 Thread STINNER Victor
Changes by STINNER Victor : -- nosy: +haypo ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.pytho

[issue1693050] \w not helpful for non-Roman scripts

2009-05-12 Thread Ezio Melotti
Changes by Ezio Melotti : -- nosy: +ezio.melotti ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.

[issue1693050] \w not helpful for non-Roman scripts

2009-02-05 Thread Matthew Barnett
Matthew Barnett added the comment: In issue #2636 I'm using the following: Alpha is Ll, Lo, Lt, Lu. Digit is Nd. Word is Ll, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc. These are what are specified at http://www.regular-expressions.info/posixbrackets.html -- nosy: +mrabarnett ___

[issue1693050] \w not helpful for non-Roman scripts

2008-11-28 Thread Martin v. Löwis
Martin v. Löwis <[EMAIL PROTECTED]> added the comment: Unicode TR#18 defines \w as a shorthand for \p{alpha} \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} which would include all marks. We should recursively check whether we follow the recommendation (e.g. \p{alpha} refers to all character

[issue1693050] \w not helpful for non-Roman scripts

2008-11-28 Thread Terry J. Reedy
Terry J. Reedy <[EMAIL PROTECTED]> added the comment: Vowel 'marks' are condensed vowel characters and are very much part of words and do not separate words. Python3 properly includes Mn and Mc as identifier characters. http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-

[issue1693050] \w not helpful for non-Roman scripts

2008-09-28 Thread Jeffrey C. Jacobs
Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>: -- nosy: +timehorse versions: +Python 2.7 -Python 2.4 ___ Python tracker <[EMAIL PROTECTED]> ___ __

[issue1693050] \w not helpful for non-Roman scripts

2008-04-24 Thread Russ Cox
Changes by Russ Cox <[EMAIL PROTECTED]>: -- nosy: +rsc _ Tracker <[EMAIL PROTECTED]> _ ___ Python-bugs-list mailing list Unsubscribe: