Re: Indexing URLs from websites

Teague James Wed, 15 Jan 2014 13:01:48 -0800

I am still unsuccessful in getting this to work. My expectation is that the
index-anchor plugin should produce values for the field anchor. However this
field is not showing up in my Solr index no matter what I try.


Here's what I have in my nutch-site.xml for plugins:
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(
basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring-optic|
urlnormalizer-(pass|reges|basic)</value>

I am using the schema-solr4.xml from the Nutch package and I added the
_version_ field

Here's the command I'm running:
Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50

The fields that Solr returns are:
Content, title, segment, boost, digest, tstamp, id, url, and _version_

Note that the url field is the url of the page being indexed and not the
url(s) of the documents that may be outlinks on that page. It is the
outlinks that I am trying to get into the index.

What am I missing? I also tried using the invertlinks command that Markus
suggested, but that did not work either, though I do appreciate the
suggestion.

Any help is appreciated! Thanks!

<Markus Jelsma> Wrote:
You need to use the invertlinks command to build a database with docs with
inlinks and anchors. Then use the index-anchor plugin when indexing. Then
you will have a multivalued field with anchors pointing to your document. 

<Teague James> Wrote:
I am trying to index a website that contains links to documents such as PDF,
Word, etc. The intent is to be able to store the URLs for the links to the
documents. 

For example, when indexing www.example.com which has links on the page like
"Example Document" which points to www.example.com/docs/example.pdf, I want
Solr to store the text of the link, "Example Document", and the URL for the
link, "www.example.com/docs/example.pdf" in separate fields. I've tried
using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page
content, but I am not getting the URLs from the links. There are no document
type restrictions in Nutch for PDF or Word. Any suggestions on how I can
accomplish this? Should I use a different method than Nutch for crawling the
site?

I appreciate any help on this!

Re: Indexing URLs from websites

Reply via email to