from:"i80and"

Re: Character encoding

2006-11-07 Thread i80and

I would suggest using string.replace.  Simply replace ' ' with ' '
for each time it occurs.  It doesn't take too much code.

On Nov 7, 1:34 pm, "mp" <[EMAIL PROTECTED]> wrote:
> I have html document titles with characters like >,  , and
> ‡. How do I decode a string with these values in Python?
> 
> Thanks

-- 
http://mail.python.org/mailman/listinfo/python-list

Having problems with urlparser concatenation

2006-11-09 Thread i80and

I'm working on a basic web spider, and I'm having problems with the
urlparser.
This is the effected function:
--
def FindLinks(Website):
WebsiteLen = len(Website)+1
CurrentLink = ''
i = 0
SpliceStart = 0
SpliceEnd = 0

LinksString = ""
LinkQueue = open('C:/LinkQueue.txt', 'a')

while (i < WebsiteLen) and (i != -1):

#Debugging info
#print '-'
#print 'Length = ' + str(WebsiteLen)
#print 'SpliceStart = ' + str(SpliceStart)
#print 'SpliceEnd = ' + str(SpliceEnd)
#print 'i = ' + str(i)

SpliceStart = Website.find('', SpliceStart))

ParsedURL =
urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
robotparser.set_url(ParsedURL.hostname + '/' +
'robots.txt')
robotparser.read()
if (robotparser.can_fetch("*",
(Website[SpliceStart+9:(SpliceEnd+1)])) == False):
i = i - 1
else:
LinksString = LinksString + "\n" +
(Website[SpliceStart+9:(SpliceEnd+1)])
LinksString = LinksString[:(len(LinksString) - 1)]
#print 'found ' + LinksString
i = SpliceEnd

LinkQueue.write(LinksString)
LinkQueue.close()
--
Sorry if it's uncommented.  When I run my program, I get this error:
-
Traceback (most recent call last):
  File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 120, in 
FindLinks(Website)
  File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 84, in FindLinks
robotparser.read()
  File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
f = opener.open(self.url)
  File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
return getattr(self, name)(url)
  File "C:\Program Files\Python25\lib\urllib.py", line 451, in
open_file
return self.open_local_file(url)
  File "C:\Program Files\Python25\lib\urllib.py", line 465, in
open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'en.wikipedia.org\\robots.txt'

Note the last line 'en.wikipedia.org\\robots.txt'.  I want
'en.wikipedia.org/robots.txt'!  What am I doing wrong?

If this has been answered before, please just give me a link to the
proper thread.  If you need more contextual code, I can post more.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Having problems with urlparser concatenation

2006-11-09 Thread i80and

Thank you!  Fixed my problem perfectly!
Gabriel Genellina wrote:
> At Thursday 9/11/2006 20:23, i80and wrote:
>
> >I'm working on a basic web spider, and I'm having problems with the
> >urlparser.
> >[...]
> > SpliceStart = Website.find('', SpliceStart))
> >
> > ParsedURL =
> >urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
> > robotparser.set_url(ParsedURL.hostname + '/' +
> >'robots.txt')
> >-
> >Traceback (most recent call last):
> >   File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
> >line 120, in 
> > FindLinks(Website)
> >   File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
> >line 84, in FindLinks
> > robotparser.read()
> >   File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
> > f = opener.open(self.url)
> >   File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
> > return getattr(self, name)(url)
> >   File "C:\Program Files\Python25\lib\urllib.py", line 451, in
> >open_file
> > return self.open_local_file(url)
> >   File "C:\Program Files\Python25\lib\urllib.py", line 465, in
> >open_local_file
> > raise IOError(e.errno, e.strerror, e.filename)
> >IOError: [Errno 2] The system cannot find the path specified:
> >'en.wikipedia.org\\robots.txt'
> >
> >Note the last line 'en.wikipedia.org\\robots.txt'.  I want
> >'en.wikipedia.org/robots.txt'!  What am I doing wrong?
>
> No, you don't want 'en.wikipedia.org/robots.txt'; you want
> 'http://en.wikipedia.org/robots.txt'
> urllib treats the former as a file: request, here the \\ in the
> normalized path.
> You are parsing the link and then building a new URI using ONLY the
> hostname part; that's wrong. Use urljoin(ParsedURL, '/robots.txt') instead.
>
> You may try Beautiful Soup for a better HTML parsing.
>
> --
> Gabriel Genellina
> Softlab SRL
>
> __
> Correo Yahoo!
> Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
> ¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar

-- 
http://mail.python.org/mailman/listinfo/python-list

re pattern for matching JS/CSS

2006-12-15 Thread i80and

I'm working on a program to remove tags from a HTML document, leaving
just the content, but I want to do it simply.  I've finished a system
to remove simple tags, but I want all CSS and JS to be removed.  What
re pattern could I use to do that?

I've tried
''
but that didn't work properly.  I'm fairly basic in my knowledge of
Python, so I'm still trying to learn re.
What pattern would work?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: cxfrozen linux binaries run on FreeBSD?

2006-12-15 Thread i80and

I haven't personally used freeze (Kubuntu doesn't seem to install it
with the python debs), but based on what I know of it, it makes make
files.  I'm not a make expert, but if FreeBSD has GNU tools, freeze's
output _should_ be able to be compiled on FreeBSD.

On Dec 15, 5:52 am, robert <[EMAIL PROTECTED]> wrote:
> When i freeze a python app (simple - no strange sys calls) for x86 Linux, 
> does this stuff run well also on x86 FreeBSD?
> 
> Robert

-- 
http://mail.python.org/mailman/listinfo/python-list

re pattern for matching JS/CSS

2006-12-15 Thread i80and

I'm working on a program to remove tags from a HTML document, leaving
just the content, but I want to do it simply.  I've finished a system
to remove simple tags, but I want all CSS and JS to be removed.  What
re pattern could I use to do that?

I've tried
''
but that didn't work properly.  I'm fairly basic in my knowledge of
Python, so I'm still trying to learn re.
What pattern would work?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Character encoding

Having problems with urlparser concatenation

Re: Having problems with urlparser concatenation

re pattern for matching JS/CSS

Re: cxfrozen linux binaries run on FreeBSD?

re pattern for matching JS/CSS

6 matches

Site Navigation

Mail list logo

Footer information