I'd be happy to comment:

A simple shell script doesn't provide URL filtering and control of how you 
crawl those documents on the local file system. Nutch has several levels of URL 
filtering based on regex, MIME type, and others. Also, if there are any 
outlinks in those local files that point to remote content, Nutch will go and 
crawl it for you, something that a simple shell script doesn't take care of.

Also, it would be great if you could elaborate what the extra configuration and 
maintenance issues are regarding Nutch? If you had something specific in mind, 
patches or issue comments, welcome :)

Cheers,
Chris

On Jan 23, 2011, at 8:56 PM, Gora Mohanty wrote:

> On Mon, Jan 24, 2011 at 8:15 AM, Adam Estrada <estrada.a...@gmail.com> wrote:
>> +1 on Nutch!
> [...]
> 
> Would it be possible for Markus, and you to clarify on
> what the advantages of Nutch are in crawling a
> well-defined filesystem hierarchy? A simple shell script
> that POSTs to Solr works fine for this, so why would
> one choose the extra configuration, and maintenance
> issues required for Nutch.
> 
> Regards,
> Gora


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to