Hello all, I have two questions I'm hoping someone will have the patience to answer as an act of mercy.
I. How to get past a Terms of Service page? I've just started learning python (have never done any programming prior) and am trying to figure out how to open or download a website to scrape data. The only problem is, whenever I try to open the link (via urllib2, for example) I'm after, I end up getting the HTML to a Terms of Service Page (where one has to click an "I Agree" button) rather than the actual target page. I've seen examples on the web on providing data for forms (typically by finding the name of the form and providing some sort of dictionary to fill in the form fields), but this simple act of getting past "I Agree" is stumping me. Can anyone save my sanity? As a workaround, I've been using os.popen('curl ' + url ' >' filename) to save the html in a txt file for later processing. I have no idea why curl works and urllib2, for example, doesn't (I use OS X). I even tried to use Yahoo Pipes to try and sidestep coding anything altogether, but ended up looking at the same Terms of Service page anyway. Here's the code (tho it's probably not that illuminating since it's basically just opening a url): import urllib2 url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1' #the first of 23 tables html = urllib2.urlopen(url).read() II. How to parse html tables with lxml, beautifulsoup? (for dummies) Assuming i get past the Terms of Service, I'm a bit overwhelmed by the need to know XPath, CSS, XML, DOM, etc. to scrape data from the web. I've tried looking at the documentation included with different python libraries, but just got more confused. The basic tutorials show something like the following: from lxml import html doc = html.parse("/path/to/test.txt") #the file i downloaded via curl root = doc.getroot() #what is this root business? tables = root.cssselect('table') I understand that selecting all the table tags will somehow target however many tables on the page. The problem is the table has multiple headers, empty cells, etc. Most of the examples on the web have to do with scraping the web for search results or something that don't really depend on the table format for anything other than layout. Are there any resources out there that are appropriate for web/python illiterati like myself that deal with structured data as in the url above? FYI, the data in the url above goes up in smoke every week, so I'm trying to capture it automatically on a weekly basis. Getting all of it into a CSV or database would be a personal cause for celebration as it would be the first really useful thing I've done with python since starting to learn it a few months ago. For anyone who is interested, here is the code that uses "curl" to pull the webpages. It basically just builds the url string for the different table-pages and saves down the file with a timestamped filename: import os from time import strftime BASE_URL = 'http://www.dtcc.com/products/derivserv/data_table_' SECTIONS = {'section1':{'select':'i.php?id=table', 'id':range(1,9)}, 'section2':{'select':'ii.php?id=table', 'id':range(9,17)}, 'section3':{'select':'iii.php?id=table', 'id':range(17,24)} } def get_pages(): filenames = [] path = '~/Dev/Data/DTCC_DerivServ/' #os.popen('cd ' + path) for section in SECTIONS: for id in SECTIONS[section]['id']: #urlList.append(BASE_URL + SECTIONS[section]['select']+str(id)) url = BASE_URL + SECTIONS[section]['select'] + str(id) timestamp = strftime('%Y%m%d_') #sectionName = BASE_URL.split('/')[-1] sectionNumber = SECTIONS[section]['select'].split('.')[0] tableNumber = str(id) + '_' filename = timestamp + tableNumber + sectionNumber + '.txt' os.popen('curl ' + url + '> ' + path + filename) filenames.append(filename) return filenames if (__name__ == '__main__'): get_pages() -- morenotestoself.wordpress.com _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor