Am 29.12.2010 10:54, schrieb Tommy Kaas:
Hi,

I’m trying to learn basic web scraping and starting from scratch. I’m
using Activepython 2.6.6

I have uploaded a simple table on my web page and try to scrape it and
will save the result in a text file. I will separate the columns in the
file with #.

It works fine but besides # I also get spaces between the columns in the
text file. How do I avoid that?

This is the script:

import urllib2

from BeautifulSoup import BeautifulSoup

f = open('tabeltest.txt', 'w')

soup =
BeautifulSoup(urllib2.urlopen('http://www.kaasogmulvad.dk/unv/python/tabeltest.htm').read())

rows = soup.findAll('tr')

for tr in rows:

     cols = tr.findAll('td')

     print >> f,
cols[0].string,'#',cols[1].string,'#',cols[2].string,'#',cols[3].string

You can strip the whitespaces from the strings. I assume the "string"-attribute returns a string (I don't now the API of Beautiful Soup) E.g.:
cols[0].string.strip()

Also, you can use join() to create the complete string:

resulting_string = "#".join([col.string.strip() for col in cols])

The long version without list comprehension (just for illustration, better use list comprehension):

resulting_string = "#".join([cols[0].string.strip(), cols[1].string.strip(), cols[2].string.strip(), cols[3].string.strip(), cols[4].string.strip()])

HTH,

Jan





f.close()

And the text file looks like this:

Kommunenr # Kommune # Region # Regionsnr

101 # København # Hovedstaden # 1084

147 # Frederiksberg # Hovedstaden # 1084

151 # Ballerup # Hovedstaden # 1084

153 # Brøndby # Hovedstaden # 1084

155 # Dragør # Hovedstaden # 1084

Thanks in advance

Tommy Kaas

Kaas & Mulvad

Lykkesholms Alle 2A, 3.

1902 Frederiksberg C

Mobil: 27268818

Mail: tommy.k...@kaasogmulvad.dk <mailto:tommy.k...@kaasogmulvad.dk>

Web: www.kaasogmulvad.dk <http://www.kaasogmulvad.dk>



_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to