RE: Indexing URLs from websites

Teague James Thu, 16 Jan 2014 07:48:40 -0800

Hello Markus,

I do get a linkdb folder in the crawl folder that gets created - but it is 
created at the time that I execute the command automatically by Nutch. I just 
tried to use solrindex against yesterday's cawl and did not get any errors, but 
did not get the anchor field or any of the outlinks. I used this command:
bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb 
crawl/segments/*


I then tried:
bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
crawl/segments/*
This produced the following errors:
Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: file:/.../crawl/linkdb/crawl_fetch
Input path does not exist: file:/.../crawl/linkdb/crawl_parse
Input path does not exist: file:/.../crawl/linkdb/parse_data
Input path does not exist: file:/.../crawl/linkdb/parse_text
Along with a Java stacktrace

So I tried invertlinks as you had previously suggested. No errors, but the 
above missing directories were not created. Using the same solrindex command 
above this one produced the same errors. 

When/How are the missing directories supposed to be created?

I really appreciate the help! Thank you very much!

-----Original Message-----
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Thursday, January 16, 2014 5:45 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

 
-----Original message-----
> From:Teague James <teag...@insystechinc.com>
> Sent: Wednesday 15th January 2014 22:01
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing URLs from websites
> 
> I am still unsuccessful in getting this to work. My expectation is 
> that the index-anchor plugin should produce values for the field 
> anchor. However this field is not showing up in my Solr index no matter what 
> I try.
> 
> Here's what I have in my nutch-site.xml for plugins:
> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|q
> uery-(
> basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring-
> basic|site|optic|
> urlnormalizer-(pass|reges|basic)</value>
> 
> I am using the schema-solr4.xml from the Nutch package and I added the 
> _version_ field
> 
> Here's the command I'm running:
> Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50
> 
> The fields that Solr returns are:
> Content, title, segment, boost, digest, tstamp, id, url, and _version_
> 
> Note that the url field is the url of the page being indexed and not 
> the
> url(s) of the documents that may be outlinks on that page. It is the 
> outlinks that I am trying to get into the index.
> 
> What am I missing? I also tried using the invertlinks command that 
> Markus suggested, but that did not work either, though I do appreciate 
> the suggestion.

That did get you a LinkDB right? You need to call solrindex and use the 
linkdb's location as part of the arguments, only then Nutch knows about it and 
will use the data contained in the LinkDB together with the index-anchor plugin 
to write the anchor field in your Solrindex.

> 
> Any help is appreciated! Thanks!
> 
> <Markus Jelsma> Wrote:
> You need to use the invertlinks command to build a database with docs 
> with inlinks and anchors. Then use the index-anchor plugin when 
> indexing. Then you will have a multivalued field with anchors pointing to 
> your document.
> 
> <Teague James> Wrote:
> I am trying to index a website that contains links to documents such 
> as PDF, Word, etc. The intent is to be able to store the URLs for the 
> links to the documents.
> 
> For example, when indexing www.example.com which has links on the page 
> like "Example Document" which points to 
> www.example.com/docs/example.pdf, I want Solr to store the text of the 
> link, "Example Document", and the URL for the link, 
> "www.example.com/docs/example.pdf" in separate fields. I've tried 
> using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page 
> content, but I am not getting the URLs from the links. There are no 
> document type restrictions in Nutch for PDF or Word. Any suggestions 
> on how I can accomplish this? Should I use a different method than Nutch for 
> crawling the site?
> 
> I appreciate any help on this!
> 
> 
>

RE: Indexing URLs from websites

Reply via email to