Re: Character encoding
I would suggest using string.replace. Simply replace ' ' with ' ' for each time it occurs. It doesn't take too much code. On Nov 7, 1:34 pm, "mp" <[EMAIL PROTECTED]> wrote: > I have html document titles with characters like >, , and > ‡. How do I decode a string with these values in Python? > > Thanks -- http://mail.python.org/mailman/listinfo/python-list
Having problems with urlparser concatenation
I'm working on a basic web spider, and I'm having problems with the
urlparser.
This is the effected function:
--
def FindLinks(Website):
WebsiteLen = len(Website)+1
CurrentLink = ''
i = 0
SpliceStart = 0
SpliceEnd = 0
LinksString = ""
LinkQueue = open('C:/LinkQueue.txt', 'a')
while (i < WebsiteLen) and (i != -1):
#Debugging info
#print '-'
#print 'Length = ' + str(WebsiteLen)
#print 'SpliceStart = ' + str(SpliceStart)
#print 'SpliceEnd = ' + str(SpliceEnd)
#print 'i = ' + str(i)
SpliceStart = Website.find('', SpliceStart))
ParsedURL =
urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
robotparser.set_url(ParsedURL.hostname + '/' +
'robots.txt')
robotparser.read()
if (robotparser.can_fetch("*",
(Website[SpliceStart+9:(SpliceEnd+1)])) == False):
i = i - 1
else:
LinksString = LinksString + "\n" +
(Website[SpliceStart+9:(SpliceEnd+1)])
LinksString = LinksString[:(len(LinksString) - 1)]
#print 'found ' + LinksString
i = SpliceEnd
LinkQueue.write(LinksString)
LinkQueue.close()
--
Sorry if it's uncommented. When I run my program, I get this error:
-
Traceback (most recent call last):
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 120, in
FindLinks(Website)
File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 84, in FindLinks
robotparser.read()
File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
f = opener.open(self.url)
File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
return getattr(self, name)(url)
File "C:\Program Files\Python25\lib\urllib.py", line 451, in
open_file
return self.open_local_file(url)
File "C:\Program Files\Python25\lib\urllib.py", line 465, in
open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'en.wikipedia.org\\robots.txt'
Note the last line 'en.wikipedia.org\\robots.txt'. I want
'en.wikipedia.org/robots.txt'! What am I doing wrong?
If this has been answered before, please just give me a link to the
proper thread. If you need more contextual code, I can post more.
--
http://mail.python.org/mailman/listinfo/python-list
Re: Having problems with urlparser concatenation
Thank you! Fixed my problem perfectly!
Gabriel Genellina wrote:
> At Thursday 9/11/2006 20:23, i80and wrote:
>
> >I'm working on a basic web spider, and I'm having problems with the
> >urlparser.
> >[...]
> > SpliceStart = Website.find('', SpliceStart))
> >
> > ParsedURL =
> >urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
> > robotparser.set_url(ParsedURL.hostname + '/' +
> >'robots.txt')
> >-
> >Traceback (most recent call last):
> > File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
> >line 120, in
> > FindLinks(Website)
> > File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
> >line 84, in FindLinks
> > robotparser.read()
> > File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
> > f = opener.open(self.url)
> > File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
> > return getattr(self, name)(url)
> > File "C:\Program Files\Python25\lib\urllib.py", line 451, in
> >open_file
> > return self.open_local_file(url)
> > File "C:\Program Files\Python25\lib\urllib.py", line 465, in
> >open_local_file
> > raise IOError(e.errno, e.strerror, e.filename)
> >IOError: [Errno 2] The system cannot find the path specified:
> >'en.wikipedia.org\\robots.txt'
> >
> >Note the last line 'en.wikipedia.org\\robots.txt'. I want
> >'en.wikipedia.org/robots.txt'! What am I doing wrong?
>
> No, you don't want 'en.wikipedia.org/robots.txt'; you want
> 'http://en.wikipedia.org/robots.txt'
> urllib treats the former as a file: request, here the \\ in the
> normalized path.
> You are parsing the link and then building a new URI using ONLY the
> hostname part; that's wrong. Use urljoin(ParsedURL, '/robots.txt') instead.
>
> You may try Beautiful Soup for a better HTML parsing.
>
> --
> Gabriel Genellina
> Softlab SRL
>
> __
> Correo Yahoo!
> Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
> ¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar
--
http://mail.python.org/mailman/listinfo/python-list
re pattern for matching JS/CSS
I'm working on a program to remove tags from a HTML document, leaving just the content, but I want to do it simply. I've finished a system to remove simple tags, but I want all CSS and JS to be removed. What re pattern could I use to do that? I've tried '' but that didn't work properly. I'm fairly basic in my knowledge of Python, so I'm still trying to learn re. What pattern would work? -- http://mail.python.org/mailman/listinfo/python-list
Re: cxfrozen linux binaries run on FreeBSD?
I haven't personally used freeze (Kubuntu doesn't seem to install it with the python debs), but based on what I know of it, it makes make files. I'm not a make expert, but if FreeBSD has GNU tools, freeze's output _should_ be able to be compiled on FreeBSD. On Dec 15, 5:52 am, robert <[EMAIL PROTECTED]> wrote: > When i freeze a python app (simple - no strange sys calls) for x86 Linux, > does this stuff run well also on x86 FreeBSD? > > Robert -- http://mail.python.org/mailman/listinfo/python-list
re pattern for matching JS/CSS
I'm working on a program to remove tags from a HTML document, leaving just the content, but I want to do it simply. I've finished a system to remove simple tags, but I want all CSS and JS to be removed. What re pattern could I use to do that? I've tried '' but that didn't work properly. I'm fairly basic in my knowledge of Python, so I'm still trying to learn re. What pattern would work? -- http://mail.python.org/mailman/listinfo/python-list
