Thanks, Mark. I checked there, but pdf files are not listed. There are some file types in there that I might need in the future, so I appreciate the info. Any other ideas?
-----Original Message----- From: Reyes, Mark Sent: Friday, January 03, 2014 1:39 PM To: solr-user@lucene.apache.org Subject: Re: Indexing URLs for Binaries Check suffix-urlfilter.txt in your conf directory for Nutch. You might be prohibiting those filetypes from the crawl. - Mark On 1/3/14, 10:29 AM, "Teague James" <teag...@insystechinc.com> wrote: >I am using Nutch 1.7 with Solr 4.6.0 to index websites that have links >to binary files, such as Word, PDF, etc. The crawler crawls the site >but I am not getting the URLs of the links for the binary files no >matter how deep I set the settings for the site. I see the labels for >the links in the content, but not the URLs. Any ideas on how I could >get those URLs back in my crawl? > IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.=