Hi Noble,

Noble Paul wrote:
isn't it possible to do this by having two datasources (one Js=dbc and
another File) and two entities . The outer entity can read from a DB
and the inner entity can read from a file.
Yes, it is. Here's my db-data-config.xml file:

<!-- definition of data sources -->
<dataSource name="ds.database"
            driver="..."
            url="..."
            user="..."
            password="..." />
<dataSource name="ds.filesystem"
            type="FileDataSource" />


<!-- building the document using both db and file content
     (files are stored in /tmp/<recordId>)
-->
<document name="doc">
  <entity name="t" query="select * from t" dataSource="ds.database">
    <field column="id" name="id" />
    <field column="title" name="title" />
    <entity name="dir"
            processor="FileListEntityProcessor"
            baseDir="/tmp/${id}"
            fileName=".*"
            dataSource="null"
            rootEntity="false" >
      <entity name="file"
              dataSource="ds.filesystem"
              processor="XPathEntityProcessor"
              forEach="/root"
              url="${dir.fileAbsolutePath}"
              stream="false" >
        <field column="text" xpath="/root" />
      </entity>
    </entity>
  </entity>
</document>


Only one additional adjustment has to be made: Since I'm using Solr 1.3 and it comes without PlainTextEntityProcessor, I have to transform my plain text files in xml files by surrounding the content with a root element. That's all!

On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
Hello,

is it possible (and if it is, how can I accomplish it) to configure DIH to
build up index documents by using content that resides in different data
sources?

Here is an example scenario:
Let's assume we have a table T with two columns, ID (which is the primary
key of T) and TITLE. Furthermore, each record in T is assigned a directory
containing text files that were generated out of pdf documents by using
Tika. A directory name is build by using the ID of the record in T
associated to that directory, e.g. all text files associated to a record
with id = 101 are stored in direcory 101.

Is there a way to configure DIH such that it uses ID, TITLE and the content
of all related text files when building a document (the documents should
have three fields: id, title, and text)?

Furthermore, as you may have noticed, a second question arises naturally:
Will there be any integration of Solr Cell and DIH in an upcoming release,
so that it would be possible to directly use the pdf documents instead of
the extracted text files that were generated outside of Solr?

This is something I wish to see. But there has been no user request
yet. You can raise an issue and it can be looked upon
I've raised issue SOLR-1358.

Best,
Sascha

Reply via email to