To be clear, the Data Import Handler is certainly capable of indexing directly from the file system that Solr is running on. DIH is not just for databases.

Sorry, but I haven't written the DIH section of my book yet! Maybe I'll try to do that by the end of the summer, if there is enough interest. Hmmm... come to think of it, I think I did some DIH experiments a few months ago using DIH file system crawling, so I could add that example to the book intro, at least.

The newer versions of the Solr Simple Post Tool (4.1? Or did that make it into 4.0?) support recursive indexing of directories as well.

And... LucidWorks Search has a built-in file system crawler.

Here’s my raw DIH solr-data-config.xml file that I was using, but no guarantees that it works anymore, since I was experimenting and don't recall what state I left it in:

<dataConfig>
 <dataSource name="dir" type="FileDataSource"/>
 <document>
   <entity name="file-list" processor="FileListEntityProcessor"
           fileName="^.*\.txt$"
           recursive="true"
           rootEntity="false"
           dataSource="null"
           baseDir="solr/collection1/data/dih-test">

<entity name="one-file" processor="PlainTextEntityProcessor" url="${file-list.fileAbsolutePath}"
             transformer="TemplateTransformer,LogTransformer"
             logTemplate="DIH processing ${file-list.fileAbsolutePath}..."
             logLevel="info"
             rootEntity="true" dataSource="dir">
       <field template="${file-list.fileAbsolutePath}" column="id" />
       <field name="name" column="plainText" />
       <field name="features" column="plainText" />
      </entity>
   </entity>

   <entity name="file-list2" processor="FileListEntityProcessor"
           fileName="^.*\.txt$"
           recursive="true"
           rootEntity="false"
           dataSource="null"
           baseDir=".">

<entity name="one-file2" processor="PlainTextEntityProcessor" url="${file-list2.fileAbsolutePath}"
             transformer="TemplateTransformer,LogTransformer"
logTemplate="DIH Step #2 processing ${file-list2.fileAbsolutePath}..."
             logLevel="info"
             rootEntity="true" dataSource="dir">
       <field template="${file-list2.fileAbsolutePath}" column="id" />
       <field name="features" column="plainText" />
      </entity>

   </entity>
 </document>
</dataConfig>

(Seems like I had a copy of of the processors for some reason. Sorry, no recollection.)

And this was my request handler in solrconfig:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
 <lst name="defaults">
     <str name="config">solr-data-config.xml</str>
 </lst>
</requestHandler>

-- Jack Krupansky

-----Original Message----- From: Gora Mohanty
Sent: Sunday, June 23, 2013 11:43 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr File System Search

On 23 June 2013 01:27, Sourabh107 <sourabh.jain....@gmail.com> wrote:
I want to create a search engine for my computer. My doubt is, can i crawl my
G: / or any drive in my network to search any string in any file (any type
of file like XML, .log, properties) using solr? if yes, Please guide me, I
went through the tutorials given in Solr site but  could not find them
useful for me, everywhere they are taking database as an example. But i want
to crawl my file system.

Yes, use a FileDataSource along with an appropriate entity processor
such as the PlainTextEntityProcessor. Please see
http://wiki.apache.org/solr/DataImportHandler . This blog might
be of help, though it is somewhat outdated now:
http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/

You could also write a script to crawl the filesystem, or
use something like Apache ManifoldCF, and dump the
contents of found files into Solr. If you want structured
data indexing for log files, and other types of files you
will probably need to do more work to extract structure
from the text in the files.

Regards,
Gora

Reply via email to