Re: Indexing URLs for Binaries

Reyes, Mark Fri, 03 Jan 2014 10:40:06 -0800

Check suffix-urlfilter.txt in your conf directory for Nutch. You might be
prohibiting those filetypes from the crawl.

- Mark

On 1/3/14, 10:29 AM, "Teague James" <teag...@insystechinc.com> wrote:

>I am using Nutch 1.7 with Solr 4.6.0 to index websites that have links to
>binary files, such as Word, PDF, etc. The crawler crawls the site but I am
>not getting the URLs of the links for the binary files no matter how deep
>I
>set the settings for the site. I see the labels for the links in the
>content, but not the URLs. Any ideas on how I could get those URLs back in
>my crawl?
>

IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Re: Indexing URLs for Binaries

Reply via email to