I am still unsuccessful in getting this to work. My expectation is that the index-anchor plugin should produce values for the field anchor. However this field is not showing up in my Solr index no matter what I try.
Here's what I have in my nutch-site.xml for plugins: <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-( basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring-optic| urlnormalizer-(pass|reges|basic)</value> I am using the schema-solr4.xml from the Nutch package and I added the _version_ field Here's the command I'm running: Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50 The fields that Solr returns are: Content, title, segment, boost, digest, tstamp, id, url, and _version_ Note that the url field is the url of the page being indexed and not the url(s) of the documents that may be outlinks on that page. It is the outlinks that I am trying to get into the index. What am I missing? I also tried using the invertlinks command that Markus suggested, but that did not work either, though I do appreciate the suggestion. Any help is appreciated! Thanks! <Markus Jelsma> Wrote: You need to use the invertlinks command to build a database with docs with inlinks and anchors. Then use the index-anchor plugin when indexing. Then you will have a multivalued field with anchors pointing to your document. <Teague James> Wrote: I am trying to index a website that contains links to documents such as PDF, Word, etc. The intent is to be able to store the URLs for the links to the documents. For example, when indexing www.example.com which has links on the page like "Example Document" which points to www.example.com/docs/example.pdf, I want Solr to store the text of the link, "Example Document", and the URL for the link, "www.example.com/docs/example.pdf" in separate fields. I've tried using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page content, but I am not getting the URLs from the links. There are no document type restrictions in Nutch for PDF or Word. Any suggestions on how I can accomplish this? Should I use a different method than Nutch for crawling the site? I appreciate any help on this!