You need to use the invertlinks command to build a database with docs with inlinks and anchors. Then use the index-anchor plugin when indexing. Then you will have a multivalued field with anchors pointing to your document. Teague James <teag...@insystechinc.com> schreef:I am trying to index a website that contains links to documents such as PDF, Word, etc. The intent is to be able to store the URLs for the links to the documents.
For example, when indexing www.example.com which has links on the page like "Example Document" which points to www.example.com/docs/example.pdf, I want Solr to store the text of the link, "Example Document", and the URL for the link, "www.example.com/docs/example.pdf" in separate fields. I've tried using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page content, but I am not getting the URLs from the links. There are no document type restrictions in Nutch for PDF or Word. Any suggestions on how I can accomplish this? Should I use a different method than Nutch for crawling the site? I appreciate any help on this!