You could use something like Apache Droids - http://incubator.apache.org/droids/
Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Tue, Jan 7, 2014 at 2:27 PM, Teague James <teag...@insystechinc.com>wrote: > I am trying to index a website that contains links to documents such as > PDF, > Word, etc. The intent is to be able to store the URLs for the links to the > documents. > > For example, when indexing www.example.com which has links on the page > like > "Example Document" which points to www.example.com/docs/example.pdf, I > want > Solr to store the text of the link, "Example Document", and the URL for the > link, "www.example.com/docs/example.pdf" in separate fields. I've tried > using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page > content, but I am not getting the URLs from the links. There are no > document > type restrictions in Nutch for PDF or Word. Any suggestions on how I can > accomplish this? Should I use a different method than Nutch for crawling > the > site? > > I appreciate any help on this! > >