I don't know the cause of the error here but I will say that
parsing HTML with regular expressions is fraught with difficulty
unless you know that the HTML will be suitably formatted
in advance.
You may be better off using one of the HTML parsing
modules such as HTMLParser or even the more powerfu
Kent was right,
>>> print u'\xae'.encode('utf-8')
> (R)
>
but i think you are using the wrong source file, i mean don't copy & paste
it from your browsers 'VIEW SOURCE' button. use python native urllib to get
the file.
___
Tutor maillist - Tutor@pyth
Oleg Oltar wrote:
> I am trying to parse an html page. Have following error while doing that
>
>
> src = sel.get_html_source()
> links = re.findall(r'', src)
> for link in links:
> print link
Presumably get_html_source() is returning unicode? So link is a unicode
st
I am trying to parse an html page. Have following error while doing that
src = sel.get_html_source()
links = re.findall(r'', src)
for link in links:
print link
==
ERROR: test_new (__main__.NewTest)