Re: [Tutor] regex: matching unicode

2012-12-24 Thread eryksun
On Mon, Dec 24, 2012 at 2:51 AM, Albert-Jan Roskam wrote: > > First, check if the first character is a (unicode) letter You can use unicode.isalpha, with a caveat. On a narrow build isalpha fails for supplementary planes. That's about 50% of all alphabetic characters, +/- depending on the version

Re: [Tutor] regex: matching unicode

2012-12-23 Thread Albert-Jan Roskam
>>Is the code below the only/shortest way to match unicode characters? I would >>like to match whatever is defined as a character in the unicode reference >>database. So letters in the broadest sense of the word, but not digits, >>underscore or whitespace. Until just now, I was convinced that th

Re: [Tutor] regex: matching unicode

2012-12-23 Thread eryksun
On Sat, Dec 22, 2012 at 11:12 PM, Steven D'Aprano wrote: > > No. You could install a more Unicode-aware regex engine, and use it instead > of Python's re module, where Unicode support is at best only partial. > > Try this one: > > http://pypi.python.org/pypi/regex Looking over the old docs, I cou

Re: [Tutor] regex: matching unicode

2012-12-22 Thread Steven D'Aprano
On 23/12/12 07:53, Albert-Jan Roskam wrote: Hi, Is the code below the only/shortest way to match unicode characters? No. You could install a more Unicode-aware regex engine, and use it instead of Python's re module, where Unicode support is at best only partial. Try this one: http://pypi.py

Re: [Tutor] regex: matching unicode

2012-12-22 Thread Hugo Arts
On Sat, Dec 22, 2012 at 9:53 PM, Albert-Jan Roskam wrote: > Hi, > > Is the code below the only/shortest way to match unicode characters? I > would like to match whatever is defined as a character in the unicode > reference database. So letters in the broadest sense of the word, but not > digits,

[Tutor] regex: matching unicode

2012-12-22 Thread Albert-Jan Roskam
Hi, Is the code below the only/shortest way to match unicode characters? I would like to match whatever is defined as a character in the unicode reference database. So letters in the broadest sense of the word, but not digits, underscore or whitespace. Until just now, I was convinced that the r