Wijaya Edward wrote:
> Since there are separator I need to include as delimiter
> Especially for the case like this:
>
>>>> str = '\xc5\xeb\xc7\xd5\xbc--FOO--BAR'
>>>> field = list(str)
>>>> print field
> ['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-', '-',
> 'B', 'A', 'R']
>
> What we want as the output is this instead:
> ['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','BAR]
>>> s = '\xc5\xeb\xc7\xd5\xbc--FOO--BAR'
>>> re.findall("(?i)[a-z]+|[\xA0-\xFF]", s)
'\xd5', '\xbc', 'FOO', 'BAR']
the RE matches either a sequence of latin characters, *or* a single
non-ASCII character.
you may want to adjust the character ranges to match the encoding you're
using, and your definition of non-chinese words.
</F>
--
http://mail.python.org/mailman/listinfo/python-list