Re: Can scrapy achieve crawl job with RDBMS structured website through multiple spiders?

lnxpgn lnxpgn Thu, 17 Dec 2015 21:16:11 -0800

If you use a single spider, let the "meta" attribute of scrapy.http.Request
to carry crawled items to the next request and continue crawling;
If you use multiple spiders, to serialize crawled items and put them into
Redis or other places, the next spider fetches and deserializes these items
and continue


2015-12-17 17:41 GMT+08:00 Peng Liu <[email protected]>:

> I've posted this problem
> <http://stackoverflow.com/questions/34330372/scrapy-different-spider-for-different-type-item>
> onto stackoverflow.com, here's the content below.
>
> I think the framework of scrapy <http://scrapy.org/> might be a little
> inflexible. And I can't find good solution for my issue.
>
> Here's the issue I'm facing now.
>
> There's a website, let's say, to be, http://example.com/. I want to scrap
> some information from it.
>
> It has many items which are urls in form of
> http://example.com/item/([0-9]+) <http://example.com/item/(%5B0-9%5D+)>,
> for now I *have*the list of the valid ([0-9]+) which has about *3 million* 
> index
> ids, it might seems to be a simple mission to complete the whole webpage
> scrapping work.
>
> *But*, the structure of this mission is like this:
>
>    - there are many data of the item on the page of /item/. I want these
>    information, this is simple to achieve.
>    - there are links refer to the entity related to the item, for example item
>    owner with link path /owner/, or the collections the item belongs with
>    link path /collection/ and so on. I want all the *unique* information
>    of these entities, which is hard to achieve. They shouldn't be the nested
>    item of item or scrapped by single spider because of the reason below:
>       - *single* owner have [1-n] items.
>       - *single* item have [1-n] owners.
>       - same as collection with item.
>    - there are links refer to other entity related to the item, for
>    example, comment with link path /comment/ or user who like it with
>    link path /user/. Obviously, it's wise to split commentor user information
>    away from item and use *key or index* to refer to entity. This is hard
>    to achieve by single spider.
>
> So, I prefer to start a spider to handle the list of
> http://example.com/item/([0-9]+) <http://example.com/item/(%5B0-9%5D+)>,
> and use other type of spiders to handle with item owner, collection,
> comment, and userrespectively.
>
> *But*, the problem is I don't have the list of item owner, collection,
> comment, and user. I could go through all of these entities only by
> iterate the webpage of http://example.com/item/([0-9]+)
> <http://example.com/item/(%5B0-9%5D+)>.
>
> I have googled a lot but found no solution to fit my issue. Please feel
> free to give your opinion out.
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Can scrapy achieve crawl job with RDBMS structured website through multiple spiders?

Reply via email to