Hi - use the domain URL filter plugin and list the domains, hosts or TLD's you want to restrict the crawl to.
-----Original message----- > From:Vivekanand Ittigi <vi...@biginfolabs.com> > Sent: Tuesday 29th July 2014 7:17 > To: solr-user@lucene.apache.org > Subject: crawling all links of same domain in nutch in solr > > Hi, > > Can anyone tel me how to crawl all other pages of same domain. > For example i'm feeding a website http://www.techcrunch.com/ in seed.txt. > > Following property is added in nutch-site.xml > > <property> > <name>db.ignore.internal.links</name> > <value>false</value> > <description>If true, when adding new links to a page, links from > the same host are ignored. This is an effective way to limit the > size of the link database, keeping only the highest quality > links. > </description> > </property> > > And following is added in regex-urlfilter.txt > > # accept anything else > +. > > Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to > crawl all other pages but not techcrunch.com's pages though it has got many > other pages too. > > Please help..? > > Thanks, > Vivek >