Check suffix-urlfilter.txt in your conf directory for Nutch. You might be prohibiting those filetypes from the crawl.
- Mark On 1/3/14, 10:29 AM, "Teague James" <teag...@insystechinc.com> wrote: >I am using Nutch 1.7 with Solr 4.6.0 to index websites that have links to >binary files, such as Word, PDF, etc. The crawler crawls the site but I am >not getting the URLs of the links for the binary files no matter how deep >I >set the settings for the site. I see the labels for the links in the >content, but not the URLs. Any ideas on how I could get those URLs back in >my crawl? > IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.