RE: Indexing URLs from websites

Markus Jelsma Thu, 16 Jan 2014 02:46:07 -0800

 
-----Original message-----
> From:Teague James <[email protected]>
> Sent: Wednesday 15th January 2014 22:01
> To: [email protected]
> Subject: Re: Indexing URLs from websites
> 
> I am still unsuccessful in getting this to work. My expectation is that the
> index-anchor plugin should produce values for the field anchor. However this
> field is not showing up in my Solr index no matter what I try.
> 
> Here's what I have in my nutch-site.xml for plugins:
> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(
> basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring-optic|
> urlnormalizer-(pass|reges|basic)</value>
> 
> I am using the schema-solr4.xml from the Nutch package and I added the
> _version_ field
> 
> Here's the command I'm running:
> Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50
> 
> The fields that Solr returns are:
> Content, title, segment, boost, digest, tstamp, id, url, and _version_
> 
> Note that the url field is the url of the page being indexed and not the
> url(s) of the documents that may be outlinks on that page. It is the
> outlinks that I am trying to get into the index.
> 
> What am I missing? I also tried using the invertlinks command that Markus
> suggested, but that did not work either, though I do appreciate the
> suggestion.


That did get you a LinkDB right? You need to call solrindex and use the 
linkdb's location as part of the arguments, only then Nutch knows about it and 
will use the data contained in the LinkDB together with the index-anchor plugin 
to write the anchor field in your Solrindex.

> 
> Any help is appreciated! Thanks!
> 
> <Markus Jelsma> Wrote:
> You need to use the invertlinks command to build a database with docs with
> inlinks and anchors. Then use the index-anchor plugin when indexing. Then
> you will have a multivalued field with anchors pointing to your document. 
> 
> <Teague James> Wrote:
> I am trying to index a website that contains links to documents such as PDF,
> Word, etc. The intent is to be able to store the URLs for the links to the
> documents. 
> 
> For example, when indexing www.example.com which has links on the page like
> "Example Document" which points to www.example.com/docs/example.pdf, I want
> Solr to store the text of the link, "Example Document", and the URL for the
> link, "www.example.com/docs/example.pdf" in separate fields. I've tried
> using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page
> content, but I am not getting the URLs from the links. There are no document
> type restrictions in Nutch for PDF or Word. Any suggestions on how I can
> accomplish this? Should I use a different method than Nutch for crawling the
> site?
> 
> I appreciate any help on this!
> 
> 
>

RE: Indexing URLs from websites

Reply via email to