Re: Is this the right way to use scrapyJs with CrawlSpider?

Paul Tremberth Mon, 02 Nov 2015 04:34:25 -0800

Hello,

You probably want to use Splash for Requests that CrawlSpider generates 
from the rules.
See `process_request` argument when defining CrawlSpider Rules
http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules


Something like this:

    rules = [
            Rule(SgmlLinkExtractor(allow = 
(r'https://detail.ju.taobao.com/.*')),
                 follow = False,
                 process_request = "use_splash"
            ),

            Rule(SgmlLinkExtractor(allow = 
(r'https://detail.tmall.com/item.htm.*')),
                 callback = "parse_link",
                 process_request = "use_splash"),
        ]

    def use_splash(self, request):
        request.meta['splash'] = {

                'endpoint':'render.html',
                'args':{
                    'wait':0.5,
                    }
                }

       return request
    ...


See https://github.com/scrapy/scrapy/blob/master/scrapy/spiders/crawl.py#L64 
for the implementation details


Also note that SgmlLinkExtractor is not the current recommended link 
extractor:
http://doc.scrapy.org/en/latest/topics/link-extractors.html#module-scrapy.linkextractors

Hope this helps.

Paul.

On Monday, November 2, 2015 at 12:00:13 PM UTC+1, Raymond Guo wrote:
>
> Hi:
> sorry that I'm not really familiar about scrapy. but I had to use scrapyJs 
> to get rendered contents.
> I noticed that you have scrapySpider example but I want to use 
> crawlSpider. So I wrote this:
>
>
> class JhsSpider(CrawlSpider):
>     name = "jhsspy"
>     allowd_domains=["taobao.com"]
>     start_urls = ["https://ju.taobao.com/";]
>     rules = [
>             Rule(SgmlLinkExtractor(allow = 
> (r'https://detail.ju.taobao.com/.*')), follow = False),
>
>             Rule(SgmlLinkExtractor(allow = 
> (r'https://detail.tmall.com/item.htm.*')), callback = "parse_link"),
>         ]
> def parse_link(self, response):
>     le = SgmlLinkExtractor()
>     for link in le.extract_links(response):
>         yield scrapy.Request(link.url, self.parse_item, meta={
>             'splash':{
>                 'endpoint':'render.html',
>                 'args':{
>                     'wait':0.5,
>                     }
>                 }
>             })
>
> def parse_item(self, response):
>     ...get items with reponse...
>
>  
>
> but I had some problem that I'm not sure what caused them. So, want to 
> know is it the right way to yield request like what I did above.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Is this the right way to use scrapyJs with CrawlSpider?

Reply via email to