Hi!
I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over
their internal links, and scrape the contents of any external links (links
with a domain different from the original domain).
I managed to do that with 2 rules but they are based on the domain of the
site being crawled. If I want to run this on multiple websites I run into a
problem because I don't know which "start_url" I'm currently on so I can't
change the rule appropriately.
Here's what I came up with so far, it works for one website and I'm not
sure how to apply it to a list of websites:
class HomepagesSpider(CrawlSpider):
name = 'homepages'
homepage = 'http://www.somesite.com'
start_urls = [homepage]
# strip http and www
domain = homepage.replace('http://', '').replace('https://', '').replace
('www.', '')
domain = domain[:-1] if domain[-1] == '/' else domain
rules = (
Rule(LinkExtractor(allow_domains=(domain), deny_domains=()),
callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)),
callback='parse_external', follow=False),
)
def parse_internal(self, response):
# log internal page...
def parse_external(self, response):
# parse external page...
This can probably be done by just passing the start_url as an argument when
calling the scraper, but I'm looking for a way to do that programmatically
within the scraper itself.
Any ideas? Thanks!
Simon.
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.