You could use something like Apache Droids -
http://incubator.apache.org/droids/

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jan 7, 2014 at 2:27 PM, Teague James <teag...@insystechinc.com>wrote:

> I am trying to index a website that contains links to documents such as
> PDF,
> Word, etc. The intent is to be able to store the URLs for the links to the
> documents.
>
> For example, when indexing www.example.com which has links on the page
> like
> "Example Document" which points to www.example.com/docs/example.pdf, I
> want
> Solr to store the text of the link, "Example Document", and the URL for the
> link, "www.example.com/docs/example.pdf" in separate fields. I've tried
> using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page
> content, but I am not getting the URLs from the links. There are no
> document
> type restrictions in Nutch for PDF or Word. Any suggestions on how I can
> accomplish this? Should I use a different method than Nutch for crawling
> the
> site?
>
> I appreciate any help on this!
>
>

Reply via email to