If you use a single spider, let the "meta" attribute of scrapy.http.Request to carry crawled items to the next request and continue crawling; If you use multiple spiders, to serialize crawled items and put them into Redis or other places, the next spider fetches and deserializes these items and continue
2015-12-17 17:41 GMT+08:00 Peng Liu <[email protected]>: > I've posted this problem > <http://stackoverflow.com/questions/34330372/scrapy-different-spider-for-different-type-item> > onto stackoverflow.com, here's the content below. > > I think the framework of scrapy <http://scrapy.org/> might be a little > inflexible. And I can't find good solution for my issue. > > Here's the issue I'm facing now. > > There's a website, let's say, to be, http://example.com/. I want to scrap > some information from it. > > It has many items which are urls in form of > http://example.com/item/([0-9]+) <http://example.com/item/(%5B0-9%5D+)>, > for now I *have*the list of the valid ([0-9]+) which has about *3 million* > index > ids, it might seems to be a simple mission to complete the whole webpage > scrapping work. > > *But*, the structure of this mission is like this: > > - there are many data of the item on the page of /item/. I want these > information, this is simple to achieve. > - there are links refer to the entity related to the item, for example item > owner with link path /owner/, or the collections the item belongs with > link path /collection/ and so on. I want all the *unique* information > of these entities, which is hard to achieve. They shouldn't be the nested > item of item or scrapped by single spider because of the reason below: > - *single* owner have [1-n] items. > - *single* item have [1-n] owners. > - same as collection with item. > - there are links refer to other entity related to the item, for > example, comment with link path /comment/ or user who like it with > link path /user/. Obviously, it's wise to split commentor user information > away from item and use *key or index* to refer to entity. This is hard > to achieve by single spider. > > So, I prefer to start a spider to handle the list of > http://example.com/item/([0-9]+) <http://example.com/item/(%5B0-9%5D+)>, > and use other type of spiders to handle with item owner, collection, > comment, and userrespectively. > > *But*, the problem is I don't have the list of item owner, collection, > comment, and user. I could go through all of these entities only by > iterate the webpage of http://example.com/item/([0-9]+) > <http://example.com/item/(%5B0-9%5D+)>. > > I have googled a lot but found no solution to fit my issue. Please feel > free to give your opinion out. > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
