Re: Solr File System Search

Jack Krupansky Sun, 23 Jun 2013 09:37:24 -0700

To be clear, the Data Import Handler is certainly capable of indexingdirectly from the file system that Solr is running on. DIH is not just fordatabases.

Sorry, but I haven't written the DIH section of my book yet! Maybe I'll tryto do that by the end of the summer, if there is enough interest. Hmmm...come to think of it, I think I did some DIH experiments a few months agousing DIH file system crawling, so I could add that example to the bookintro, at least.

The newer versions of the Solr Simple Post Tool (4.1? Or did that make itinto 4.0?) support recursive indexing of directories as well.


And... LucidWorks Search has a built-in file system crawler.

Here’s my raw DIH solr-data-config.xml file that I was using, but noguarantees that it works anymore, since I was experimenting and don't recallwhat state I left it in:


<dataConfig>
 <dataSource name="dir" type="FileDataSource"/>
 <document>
   <entity name="file-list" processor="FileListEntityProcessor"
           fileName="^.*\.txt$"
           recursive="true"
           rootEntity="false"
           dataSource="null"
           baseDir="solr/collection1/data/dih-test">

<entity name="one-file" processor="PlainTextEntityProcessor"url="${file-list.fileAbsolutePath}"

             transformer="TemplateTransformer,LogTransformer"
             logTemplate="DIH processing ${file-list.fileAbsolutePath}..."
             logLevel="info"
             rootEntity="true" dataSource="dir">
       <field template="${file-list.fileAbsolutePath}" column="id" />
       <field name="name" column="plainText" />
       <field name="features" column="plainText" />
      </entity>
   </entity>

   <entity name="file-list2" processor="FileListEntityProcessor"
           fileName="^.*\.txt$"
           recursive="true"
           rootEntity="false"
           dataSource="null"
           baseDir=".">

<entity name="one-file2" processor="PlainTextEntityProcessor"url="${file-list2.fileAbsolutePath}"

             transformer="TemplateTransformer,LogTransformer"

logTemplate="DIH Step #2 processing${file-list2.fileAbsolutePath}..."

             logLevel="info"
             rootEntity="true" dataSource="dir">
       <field template="${file-list2.fileAbsolutePath}" column="id" />
       <field name="features" column="plainText" />
      </entity>

   </entity>
 </document>
</dataConfig>

(Seems like I had a copy of of the processors for some reason. Sorry, norecollection.)


And this was my request handler in solrconfig:

<requestHandler name="/dataimport"class="org.apache.solr.handler.dataimport.DataImportHandler">

 <lst name="defaults">
     <str name="config">solr-data-config.xml</str>
 </lst>
</requestHandler>

-- Jack Krupansky

-----Original Message-----From: Gora Mohanty

Sent: Sunday, June 23, 2013 11:43 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr File System Search

On 23 June 2013 01:27, Sourabh107 <sourabh.jain....@gmail.com> wrote:

I want to create a search engine for my computer. My doubt is, can i crawlmy
G: / or any drive in my network to search any string in any file (any type
of file like XML, .log, properties) using solr? if yes, Please guide me, I
went through the tutorials given in Solr site but  could not find them
useful for me, everywhere they are taking database as an example. But iwant
to crawl my file system.


Yes, use a FileDataSource along with an appropriate entity processor
such as the PlainTextEntityProcessor. Please see
http://wiki.apache.org/solr/DataImportHandler . This blog might
be of help, though it is somewhat outdated now:
http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/

You could also write a script to crawl the filesystem, or
use something like Apache ManifoldCF, and dump the
contents of found files into Solr. If you want structured
data indexing for log files, and other types of files you
will probably need to do more work to extract structure
from the text in the files.

Regards,

Gora

Re: Solr File System Search

Reply via email to