Dynamic rules based on start_urls for Scrapy CrawlSpider?

Simon Nizov Thu, 02 Mar 2017 06:48:49 -0800

Hi!

I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over 
their internal links, and scrape the contents of any external links (links 
with a domain different from the original domain).


I managed to do that with 2 rules but they are based on the domain of the 
site being crawled. If I want to run this on multiple websites I run into a 
problem because I don't know which "start_url" I'm currently on so I can't 
change the rule appropriately.

Here's what I came up with so far, it works for one website and I'm not 
sure how to apply it to a list of websites:
class HomepagesSpider(CrawlSpider):
    name = 'homepages'


    homepage = 'http://www.somesite.com'


    start_urls = [homepage]


    # strip http and www
    domain = homepage.replace('http://', '').replace('https://', '').replace
('www.', '')
    domain = domain[:-1] if domain[-1] == '/' else domain


    rules = (
        Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), 
callback='parse_internal', follow=True),
        Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), 
callback='parse_external', follow=False),
    )


    def parse_internal(self, response):


        # log internal page...


    def parse_external(self, response):


        # parse external page...


This can probably be done by just passing the start_url as an argument when 
calling the scraper, but I'm looking for a way to do that programmatically 
within the scraper itself.


Any ideas? Thanks!

Simon.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Dynamic rules based on start_urls for Scrapy CrawlSpider?

Reply via email to