Joe Farro wrote: > The package implements a DSL that is intended to make web-scraping a bit > more maintainable :) > > I generally find my scraping code ends up being rather chaotic with > querying, regex manipulations, conditional processing, conversions, etc., > ending up being to close together and sometimes interwoven. It's > stressful.
Everything is cleaner than a bunch of regular expressions. It's just that sometimes they give results more quickly and as reliable as you can get without adding a javascript engine to your script. > The DSL attempts to mitigate this by doing only two things: > finding stuff and saving it as a string. The post-processing is left to be > done down the pipeline. It's almost just a configuration file. > > Here is an example that would get the text and URL for every link in a > page: > > $ a > save each: links > | [href] > save: url > | text > save: link_text > > > The result would be something along these lines: > > { > 'links': [ > { > 'url': 'http://www.something.com/hm', > 'link_text': 'The text in the link' > }, > # etc... another dict for each <a> tag > ] > } > With beautiful soup you could write this soup = bs4.BeautifulSoup(...) links = [ { "url": a["href"], "link_text": a.text } for a in soup("a") ] and for many applications you wouldn't even bother with the intermediate data structure. Can you give a real-world example where your DSL is significantly cleaner than the corresponding code using bs4, or lxml.xpath, or lxml.objectify? > The hope is that having all the selectors in one place will make them more > manageable and possibly simplify the post-processing. > > This is my first go at something along these lines, so any feedback is > welcomed. Your code on github looks good to me (too few docstrings), but like Alan I'm not prepared to read it completely. Do you have specific questions? _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor