Re:Indexing URLs from websites

Markus Jelsma Tue, 07 Jan 2014 13:01:14 -0800

You need to use the invertlinks command to build a database with docs with 
inlinks and anchors. Then use the index-anchor plugin when indexing. Then you 
will have a multivalued field with anchors pointing to your document. Teague 
James <teag...@insystechinc.com> schreef:I am trying to index a website that 
contains links to documents such as PDF,
Word, etc. The intent is to be able to store the URLs for the links to the
documents.


For example, when indexing www.example.com which has links on the page like
"Example Document" which points to www.example.com/docs/example.pdf, I want
Solr to store the text of the link, "Example Document", and the URL for the
link, "www.example.com/docs/example.pdf" in separate fields. I've tried
using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page
content, but I am not getting the URLs from the links. There are no document
type restrictions in Nutch for PDF or Word. Any suggestions on how I can
accomplish this? Should I use a different method than Nutch for crawling the
site?

I appreciate any help on this!

Re:Indexing URLs from websites

Reply via email to