Re: [Tutor] A regular expression problem

2010-12-01 Thread Steven D'Aprano
Josep M. Fontana wrote: [...] I guess this is because the character encoding was not specified but accented characters in the languages I'm dealing with should be treated as a-z or A-Z, shouldn't they? No. a-z means a-z. If you want the localized set of alphanumeric characters, you need \w.

Re: [Tutor] A regular expression problem

2010-11-30 Thread Josep M. Fontana
On Sun, Nov 28, 2010 at 6:14 PM, Steven D'Aprano wrote: > Have you considered just using the isalnum() method? > '¿de'.isalnum() > False Mmm. No, I didn't consider it because I didn't even know such a method existed. This can turn out to be very handy but I don't think it would help me at t

Re: [Tutor] A regular expression problem

2010-11-30 Thread Josep M. Fontana
Sorry, something went wrong and my message got sent before I could finish it. I'll try again. On Tue, Nov 30, 2010 at 2:19 PM, Josep M. Fontana wrote: > On Sun, Nov 28, 2010 at 6:03 PM, Evert Rol wrote: > >> - >> with open('output_tokens.txt', 'a') as out_tokens: >>with open('text

Re: [Tutor] A regular expression problem

2010-11-28 Thread Steven D'Aprano
Josep M. Fontana wrote: I'm trying to use regular expressions to extract strings that match certain patterns in a collection of texts. Basically these texts are edited versions of medieval manuscripts that use certain symbols to mark information that is useful for filologists. I'm interested in

Re: [Tutor] A regular expression problem

2010-11-28 Thread Evert Rol
> Here's what I do. This was just a first attempt to get strings > starting with a non alpha-numeric symbol. If this had worked, I would > have continued to build the regular expression to get words with non > alpha-numeric symbols in the middle and in the end. Alas, even this > first attempt did

[Tutor] A regular expression problem

2010-11-28 Thread Josep M. Fontana
I'm trying to use regular expressions to extract strings that match certain patterns in a collection of texts. Basically these texts are edited versions of medieval manuscripts that use certain symbols to mark information that is useful for filologists. I'm interested in isolating words that have