RE: crawling all links of same domain in nutch in solr

Markus Jelsma Tue, 29 Jul 2014 00:45:07 -0700

Hi - use the domain URL filter plugin and list the domains, hosts or TLD's you 
want to restrict the crawl to.



 
 
-----Original message-----
> From:Vivekanand Ittigi <vi...@biginfolabs.com>
> Sent: Tuesday 29th July 2014 7:17
> To: solr-user@lucene.apache.org
> Subject: crawling all links of same domain in nutch in solr
> 
> Hi,
> 
> Can anyone tel me how to crawl all other pages of same domain.
> For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.
> 
> Following property is added in nutch-site.xml
> 
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
>   <description>If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   </description>
> </property>
> 
> And following is added in regex-urlfilter.txt
> 
> # accept anything else
> +.
> 
> Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to
> crawl all other pages but not techcrunch.com's pages though it has got many
> other pages too.
> 
> Please help..?
> 
> Thanks,
> Vivek
>

RE: crawling all links of same domain in nutch in solr

Reply via email to