Josep M. Fontana wrote:
[...]
I guess this is because the character encoding was not specified but
accented characters in the languages I'm dealing with should be
treated as a-z or A-Z, shouldn't they?
No. a-z means a-z. If you want the localized set of alphanumeric
characters, you need \w.
On Sun, Nov 28, 2010 at 6:14 PM, Steven D'Aprano wrote:
> Have you considered just using the isalnum() method?
>
'¿de'.isalnum()
> False
Mmm. No, I didn't consider it because I didn't even know such a method
existed. This can turn out to be very handy but I don't think it would
help me at t
Sorry, something went wrong and my message got sent before I could
finish it. I'll try again.
On Tue, Nov 30, 2010 at 2:19 PM, Josep M. Fontana
wrote:
> On Sun, Nov 28, 2010 at 6:03 PM, Evert Rol wrote:
>
>> -
>> with open('output_tokens.txt', 'a') as out_tokens:
>>with open('text
Josep M. Fontana wrote:
I'm trying to use regular expressions to extract strings that match
certain patterns in a collection of texts. Basically these texts are
edited versions of medieval manuscripts that use certain symbols to
mark information that is useful for filologists.
I'm interested in
> Here's what I do. This was just a first attempt to get strings
> starting with a non alpha-numeric symbol. If this had worked, I would
> have continued to build the regular expression to get words with non
> alpha-numeric symbols in the middle and in the end. Alas, even this
> first attempt did
I'm trying to use regular expressions to extract strings that match
certain patterns in a collection of texts. Basically these texts are
edited versions of medieval manuscripts that use certain symbols to
mark information that is useful for filologists.
I'm interested in isolating words that have