Below is a sample piece of HTML code that I want to scrape with scrapy.
<body><h2 class="post-title entry-title">Sample Header</h2>
<div class="entry clearfix">
<div class="sample1">
<p>Hello</p>
</div>
<!--start comment-->
<div class="sample2">
<p>World</p>
</div>
<!--end comment-->
</div><ul class="post-categories"><li><a
href="123.html">Category1</a></li><li><a
href="456.html">Category2</a></li><li><a
href="789.html">Category3</a></li></ul></body>
Right now I am using the below working scrapy code:
from scrapy.contrib.spiders import CrawlSpider, Rulefrom
scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.selector
import HtmlXPathSelectorfrom isbullshit.items import IsBullshitItem
class IsBullshitSpider(CrawlSpider):
name = 'isbullshit'
start_urls = ['http://sample.com']
rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True),
Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')]
def parse_blogpost(self, response):
hxs = HtmlXPathSelector(response)
item = IsBullshitItem()
item['title'] = hxs.select('//h2[@class="post-title
entry-title"]/text()').extract()[0]
item['tag'] =
hxs.select('//ul[@class="post-categories"]/li[1]/a/text()').extract()[0]
item['article_html'] = hxs.select("//div[@class='entry
clearfix']").extract()[0]
return item
It gives me the following xml output:
<?xml version="1.0" encoding="utf-8"?><items>
<item>
<article_html>
<div class="entry clearfix">
<div class="sample1">
<p>Hello</p>
</div>
<!--start comment-->
<div class="sample2">
<p>World</p>
</div>
<!--end comment-->
</div>
</article_html>
<tag>
Category1
</tag>
<title>
Sample Header
</title>
</item></items>
I want to know how to achieve the following output:
<?xml version="1.0" encoding="utf-8"?><items>
<item>
<article_html>
<div class="entry clearfix">
<div class="sample1">
<p>Hello</p>
</div>
<!--start comment-->
<!--end comment-->
</div>
</article_html>
<tag>
Category1,Category2,Category3
</tag>
<title>
Sample Header
</title>
</item></items>
Note: The number of categories depends on the post. In the above example,
there are 3 categories. There could be more or less.
Help would be much appreciated. Cheers.
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.