document support for file system crawling

Eivind Hasle Amundsen Mon, 15 Jan 2007 01:11:14 -0800

Hi,

I want to pick up this old thread from the summer (see below). I dounderstand that Solr is inteded for more structured data, and that Nutchis a good basis for cluttered information, particularly fetched fromcrawlers.

However Solr's ease of setup and flexible schemas make it a viablealternative for enterprise solutions. It seems even the purpose of theproject itself is to create an enterprise search platform.

In that respect I agree with the original posting that Solr lacksfunctionality with respect to desired functionality. One can argue thatmore or less random data should be structured by the user writing adecent application. However a more easy to use and configurable pluginarchitecture for different filtering and document parsing could makeSolr more attractive. I think that many potential users would welcomesuch additions.

In other words, Solr *could* very well be the right tool for the job inmany cases, provided that there is a configurable "pre-Solr" step thatcan be run on content before it actually "turns XML".

A related design question is to what extent this should be contractedbetween the XML documents themselves and the schema.xml, or whether mostof the work should be done in the parser/pre-processing (i.e. whenmaking the XML documents).


Your thoughts and feedback is greatly appriciated.

Regards,

Eivind

>> browsing through the message thread I tried to find a trailaddressing file

>> system crawls. I want to implement an enterprise search over a networked

>> filesystem, crawling all sorts of documents, such as html, doc, pptand pdf.

>> Nutch provides plugins enabling it to read proprietary formats.
>> Is there support for the same functionality in solr?

> the text out of these types of documents.  You could borrow the
> document parsing pieces from Lucene's contrib and Nutch and glue them
> together into your client that speaks to Solr, or perhaps Solr isn't
> the right approach for your needs?   It certainly is possible to add
> these capabilities into Solr, but it would be awkward to have to
> stream binary data into XML documents such that Solr could parse them
> on the server side.

Agreed.  Solr's focus is in indexing "Structured Data".  The support for
dynamic fields certainly allows you do deal with complex structured data,
and somewhat heterogeneous structured data -- but it's still structured
data.  If your goal is to do a lot of crawling of disparat physical
documents, extract the text, and build a "path,title,content" index
then Nutch is probably your best bet.

document support for file system crawling

Reply via email to