Sorry! Sorry! Sorry! I just found out this question had already been answered by Steven D'Aprano in another thread! The trick was to add '\w' besides [a-zA-Z].
Please, accept my apologies. I devote time to this project whenever I have some free time. I got very busy with other things at some point and I stopped working on that. When I started again today, I had not noticed that there was already an answer to the question I had posted a while ago that actually solved my problem. Thanks again Steven. You can consider the problem solved and this thread closed. Josep M. On Thu, Dec 23, 2010 at 10:25 AM, Josep M. Fontana <josep.m.font...@gmail.com> wrote: > I am working with texts that are encoded as ISO-8859-1. I have > included the following two lines at the beginning of my python script: > > !/usr/bin/env python > # -*- coding: iso-8859-1 -*- > > If I'm not mistaken, this should tell Python that accented characters > such as 'á', 'Á', 'ö' or 'è' should be considered as alpha-numeric > characters and therefore matched with a regular expression of the form > [a-zA-Z]. However, when I process my texts, all of the accented > characters are matched as non alpha-numeric symbols. What am I doing > wrong? > > I'm not including the whole script because I think the rest of the > code is irrelevant. All that's relevant (I think) is that I'm using > the regular expression '[^a-zA-Z\t\n\r\f\v]+' to match any string that > includes non alpha-numeric characters and that returns 'á', 'Á', 'ö' > or 'è' as well as other real non alpha-numeric characters. > > Has anybody else experienced this problem when working with texts > encoded as ISO-8859-1 or UTF-8? Is there any additional flag or > parameter that I should add to make the processing of these characters > as regular word characters possible? > > Thanks in advance for your help. > > Josep M. > _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor