"Andreas Pfrengle" <[email protected]> wrote in message
news:[email protected]...
On 12 Apr., 02:31, "Mark Tolonen" <[email protected]> wrote:
"Andreas" <[email protected]> wrote in message
news:[email protected]...
> Hello,
> I'd like to create a regex that captures any unicode character, but
> not the underscore and the digits 0-9. "^(?u)\w$" captures them also.
> Is there a possibility to restrict an expression like "\w" to "\w
> without [0-9_]"?
'(?u)[^\W0-9_]' removes 0-9_ from \w.
-Mark
Hello Mark,
haven't tried it yet, but it looks good!
@John: Sorry for being imprecise, I meant *letters*, not *characters*,
so requirement 2 fits my needs.
Note that \w matches alphanumeric Unicode characters. If you only want
letters, consider superscripts(¹²³), fractions (¼½¾), and other characters
are also numbers to Unicode. See the unicodedata.category function and
http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values.
If you only want letters as considered by the Unicode standard, something
this would give you only Unicode letters (it could be optimized to list
ranges of characters):
u'(?u)[' + u''.join(unichr(n) for n in xrange(65536) if
ud.category(unichr(n))[0]=='L') + u']'
Hmm, maybe Python 3.0 with its default Unicode strings needs a regex
extension to specify the Unicode category to match.
-Mark
--
http://mail.python.org/mailman/listinfo/python-list