Using Solr Cell (ExtractingRequestHandler) which is now built into trunk, and thus an eventual Solr 1.4 release, indexing a directory of text (or even Word, PDF, etc) files is mostly 'out of the box'.

It still requires scripting an iteration over all files and sending them. Here's an example of doing that scripting using Ant and the ant- contrib <for> and <post> tasks:

  <target name="index-docs" description="Index documents">
    <for param="filename">
      <fileset dir="${docs.dir}"/>
      <sequential>
        <echo>Processing @{filename}</echo>

<post to="${solr.url}/update/extract" verbose="false" failonerror="true">
          <prop name="stream.file" value="@{filename}"/>
          <prop name="ext.resource.name" value="@{filename}"/>
          <prop name="ext.idx.attr" value="false"/>
          <prop name="ext.ignore.und.fl" value="true"/>

          <prop name="ext.literal.id" value="@{filename}"/>
          <prop name="ext.def.fl" value="text"/>
          <prop name="ext.map.title" value="title"/>
          <prop name="wt" value="ruby"/>
        </post>
      </sequential>
    </for>
  </target>

And it also should be possible, perhaps slightly easier and more built- in to do the entire iteration using DataImportHandler's ability to iterate over a list of files and read their contents into a field. [an example of this on the wiki would be handy, or a pointer to it if it doesn't already exist]

        Erik


On Mar 10, 2009, at 2:01 PM, KennyN wrote:


This functionality is possible 'out of the box', right? Or am I going to need to code up something that reads in the id named files and generates the xml
file?
--
View this message in context: 
http://www.nabble.com/Solr-configuration-with-Text-files-tp22438201p22440095.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to