Okay. I changed my solrindex to this: bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb crawl/segments/20140115143147
I got the same errors: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/.../crawl/linkdb/crawl_fetch Input path does not exist: file:/.../crawl/linkdb/crawl_parse Input path does not exist: file:/.../crawl/linkdb/parse_data Input path does not exist: file:/.../crawl/linkdb/parse_text Along with a Java stacktrace Those linkdb folders are not being created. -----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 10:44 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hi - you cannot use wildcards for segments. You need to give one segment or a -dir segments_dir. Check the usage of your indexer command. -----Original message----- > From:Teague James <teag...@insystechinc.com> > Sent: Thursday 16th January 2014 16:43 > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Hello Markus, > > I do get a linkdb folder in the crawl folder that gets created - but it is > created at the time that I execute the command automatically by Nutch. I just > tried to use solrindex against yesterday's cawl and did not get any errors, > but did not get the anchor field or any of the outlinks. I used this command: > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > crawl/linkdb crawl/segments/* > > I then tried: > bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb > crawl/segments/* This produced the following errors: > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path > does not exist: file:/.../crawl/linkdb/crawl_fetch > Input path does not exist: file:/.../crawl/linkdb/crawl_parse > Input path does not exist: file:/.../crawl/linkdb/parse_data Input > path does not exist: file:/.../crawl/linkdb/parse_text Along with a > Java stacktrace > > So I tried invertlinks as you had previously suggested. No errors, but the > above missing directories were not created. Using the same solrindex command > above this one produced the same errors. > > When/How are the missing directories supposed to be created? > > I really appreciate the help! Thank you very much! > > -----Original Message----- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Thursday, January 16, 2014 5:45 AM > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > > -----Original message----- > > From:Teague James <teag...@insystechinc.com> > > Sent: Wednesday 15th January 2014 22:01 > > To: solr-user@lucene.apache.org > > Subject: Re: Indexing URLs from websites > > > > I am still unsuccessful in getting this to work. My expectation is > > that the index-anchor plugin should produce values for the field > > anchor. However this field is not showing up in my Solr index no matter > > what I try. > > > > Here's what I have in my nutch-site.xml for plugins: > > <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor) > > |q > > uery-( > > basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scorin > > basic|site|g- > > basic|site|optic| > > urlnormalizer-(pass|reges|basic)</value> > > > > I am using the schema-solr4.xml from the Nutch package and I added > > the _version_ field > > > > Here's the command I'm running: > > Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50 > > > > The fields that Solr returns are: > > Content, title, segment, boost, digest, tstamp, id, url, and > > _version_ > > > > Note that the url field is the url of the page being indexed and not > > the > > url(s) of the documents that may be outlinks on that page. It is the > > outlinks that I am trying to get into the index. > > > > What am I missing? I also tried using the invertlinks command that > > Markus suggested, but that did not work either, though I do > > appreciate the suggestion. > > That did get you a LinkDB right? You need to call solrindex and use the > linkdb's location as part of the arguments, only then Nutch knows about it > and will use the data contained in the LinkDB together with the index-anchor > plugin to write the anchor field in your Solrindex. > > > > > Any help is appreciated! Thanks! > > > > <Markus Jelsma> Wrote: > > You need to use the invertlinks command to build a database with > > docs with inlinks and anchors. Then use the index-anchor plugin when > > indexing. Then you will have a multivalued field with anchors pointing to > > your document. > > > > <Teague James> Wrote: > > I am trying to index a website that contains links to documents such > > as PDF, Word, etc. The intent is to be able to store the URLs for > > the links to the documents. > > > > For example, when indexing www.example.com which has links on the > > page like "Example Document" which points to > > www.example.com/docs/example.pdf, I want Solr to store the text of > > the link, "Example Document", and the URL for the link, > > "www.example.com/docs/example.pdf" in separate fields. I've tried > > using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the > > page content, but I am not getting the URLs from the links. There > > are no document type restrictions in Nutch for PDF or Word. Any > > suggestions on how I can accomplish this? Should I use a different method > > than Nutch for crawling the site? > > > > I appreciate any help on this! > > > > > > > >