Re: Web Scraping - Output File

MRAB Thu, 26 Apr 2012 11:21:50 -0700

On 26/04/2012 18:54, [email protected] wrote:

Hello,


I am having some difficulty generating the output I want from web
scraping. Specifically, the script I wrote, while it runs without any
errors, is not writing to the output file correctly. It runs, and
creates the output .txt file; however, the file is blank (ideally it
should be populated with a list of names).

I took the base of a program that I had before for a different data
gathering task, which worked beautifully, and edited it for my
purposes here. Any insight as to what I might be doing wrote would be
highly appreciated. Code is included below. Thanks!

import os
import re
import urllib2

outfile = open("Skadden.txt","w")

A = 1
Z = 26

for letter in range(A,Z):

     for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?
contentID=44&alphaSearch="+str(letter)):

             x = line
             if '"><B>' in line:
                     start=x.find('"><B>"')
                     end= x.find('</B></A></nobr></td>',start)
                     name=x[start:end]
                     outfile.write(name+"\n")
                     print name


Firstly, 'letter' goes from 1 (inclusive) to 26 (exclusive), so the
URLs are:

    http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=1
    http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=2
    ...
    http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=25

What you need is:

    http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=A
    http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=B
    ...
    http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=Z

Secondly, the names in the HTML source aren't enclosed by '"><B>' and
'</B></A></nobr></td>'.
--
http://mail.python.org/mailman/listinfo/python-list

Re: Web Scraping - Output File

Reply via email to