I am working with texts that are encoded as ISO-8859-1. I have included the following two lines at the beginning of my python script:
!/usr/bin/env python # -*- coding: iso-8859-1 -*- If I'm not mistaken, this should tell Python that accented characters such as 'á', 'Á', 'ö' or 'è' should be considered as alpha-numeric characters and therefore matched with a regular expression of the form [a-zA-Z]. However, when I process my texts, all of the accented characters are matched as non alpha-numeric symbols. What am I doing wrong? I'm not including the whole script because I think the rest of the code is irrelevant. All that's relevant (I think) is that I'm using the regular expression '[^a-zA-Z\t\n\r\f\v]+' to match any string that includes non alpha-numeric characters and that returns 'á', 'Á', 'ö' or 'è' as well as other real non alpha-numeric characters. Has anybody else experienced this problem when working with texts encoded as ISO-8859-1 or UTF-8? Is there any additional flag or parameter that I should add to make the processing of these characters as regular word characters possible? Thanks in advance for your help. Josep M. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor