Kent, Many thanks again, and thanks too to Paul at http://tinyurl.com/yrl8cy.
That's very effective, thanks very much for the detailed explanation; however, I'm a little surprised that it's necessary. I would have thought that there would be some standard module that included a unicode equivalent of the builtin method isupper(). On Fri, 3 Aug 2007, Kent Johnson wrote: > > What sort of re test can I do to catch lines whose defining > > characteristic is that they begin with two or more adjacent utf-8 > > encoded capital letters? > > First you have to decode the file to a Unicode string. > Then build the set of matching characters and build a regex. For example, > something like this: > > data = open('data.txt').read().decode('utf-8').splitlines() > > uppers = u''.join(unichr(i) for i in xrange(sys.maxunicode) > if unichr(i).isupper()) I modified uppers to include only the latin characters, and added the apostrophe to catch placenames like L'ISLE. > upperRe = u'^[%s]{2,}' % uppers > > for line in data: > if re.match(upperRe, line): > > > With a tip of the hat to > http://tinyurl.com/yrl8cy > > Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor