Hi Charan, Thanks for the clarifications.
The link I have been referring to( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) does not say anything about using the crawl? Do I have to do it after the last step mentioned? Thanks, Abi On Thu, Feb 10, 2011 at 12:58 AM, charan kumar <charan.ku...@gmail.com>wrote: > Hi Abishek, > > depth is a param of crawl command, not fetch command > > If you are using custom script calling individual stages of nutch crawl, > then depth N means , you running that script for N times.. You can put a > loop, in the script. > > Thanks, > Charan > > On Wed, Feb 9, 2011 at 6:26 AM, .: Abhishek :. <ab1s...@gmail.com> wrote: > > > Hi Erick, > > > > Thanks a bunch for the response > > > > Could be a chance..but all I am wondering is where to specify the depth > in > > the whole entire process in the URL > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried > > specifying it during the fetcher phase but it was just ignored :( > > > > Thanks, > > Abi > > > > On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson <erickerick...@gmail.com > > >wrote: > > > > > WARNING: I don't do Nutch much, but could it be that your > > > crawl depth is 1? See: > > > http://wiki.apache.org/nutch/NutchTutorial > > > > > > <http://wiki.apache.org/nutch/NutchTutorial>and search for "depth" > > > Best > > > Erick > > > > > > On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. <ab1s...@gmail.com> > > wrote: > > > > > > > Hi Markus, > > > > > > > > I am sorry for not being clear, I meant to say that... > > > > > > > > Suppose if a url namely > > www.somehost.com/gifts/greetingcard.html(which<http://www.somehost.com/gifts/greetingcard.html%28which> > <http://www.somehost.com/gifts/greetingcard.html%28which> > > <http://www.somehost.com/gifts/greetingcard.html%28which>in > > > > turn contain links to a.html, b.html, c.html, d.html) is injected > into > > > the > > > > seed.txt, after the whole process I was expecting a bunch of other > > pages > > > > which crawled from this seed url. However, at the end of it all I see > > is > > > > the > > > > contents from only this page namely > > > > www.somehost.com/gifts/greetingcard.htmland I do not see any other > > > > pages(here a.html, b.html, c.html, d.html) > > > > crawled from this one. > > > > > > > > The crawling happens only for the URLs mentioned in the seed.txt and > > > does > > > > not proceed further from there. So I am just bit confused. Why is it > > not > > > > crawling the linked pages(a.html, b.html, c.html and d.html). I get a > > > > feeling that I am missing something that the author of the blog( > > > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed > > > > everyone would know. > > > > > > > > Thanks, > > > > Abi > > > > > > > > > > > > On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma < > > > markus.jel...@openindex.io > > > > >wrote: > > > > > > > > > The parsed data is only sent to the Solr index of you tell a > segment > > to > > > > be > > > > > indexed; solrindex <crawldb> <linkdb> <segment> > > > > > > > > > > If you did this only once after injecting and then the consequent > > > > > fetch,parse,update,index sequence then you, of course, only see > those > > > > > URL's. > > > > > If you don't index a segment after it's being parsed, you need to > do > > it > > > > > later > > > > > on. > > > > > > > > > > On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: > > > > > > Hi all, > > > > > > > > > > > > I am a newbie to nutch and solr. Well relatively much newer to > > Solr > > > > than > > > > > > Nutch :) > > > > > > > > > > > > I have been using nutch for past two weeks, and I wanted to know > > if > > > I > > > > > can > > > > > > query or search on my nutch crawls on the fly(before it > completes). > > I > > > > am > > > > > > asking this because the websites I am crawling are really huge > and > > it > > > > > takes > > > > > > around 3-4 days for a crawl to complete. I want to analyze some > > quick > > > > > > results while the nutch crawler is still crawling the URLs. Some > > one > > > > > > suggested me that Solr would make it possible. > > > > > > > > > > > > I followed the steps in > > > > > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for > > > this. > > > > By > > > > > > this process, I see only the injected URLs are shown in the Solr > > > > search. > > > > > I > > > > > > know I did something really foolish and the crawl never happened, > I > > > > feel > > > > > I > > > > > > am missing some information here. I think somewhere in the > process > > > > there > > > > > > should be a crawling happening and I missed it out. > > > > > > > > > > > > Just wanted to see if some one could help me pointing this out > and > > > > where > > > > > I > > > > > > went wrong in the process. Forgive my foolishness and thanks for > > your > > > > > > patience. > > > > > > > > > > > > Cheers, > > > > > > Abi > > > > > > > > > > -- > > > > > Markus Jelsma - CTO - Openindex > > > > > http://www.linkedin.com/in/markus17 > > > > > 050-8536620 / 06-50258350 > > > > > > > > > > > > > > >