Re: Building documents using content residing both in database tables and text files

Noble Paul നോബിള്‍ नोब्ळ् Wed, 12 Aug 2009 23:17:03 -0700

isn't better to make a jar of PlaintextEntityProcessor and drop into
solr.home/lib ?


On Tue, Aug 11, 2009 at 11:05 PM, Sascha Szott<sz...@zib.de> wrote:
> Hi Noble,
>
> Noble Paul wrote:
>>
>> isn't it possible to do this by having two datasources (one Js=dbc and
>> another File) and two entities . The outer entity can read from a DB
>> and the inner entity can read from a file.
>
> Yes, it is. Here's my db-data-config.xml file:
>
> <!-- definition of data sources -->
> <dataSource name="ds.database"
>            driver="..."
>            url="..."
>            user="..."
>            password="..." />
> <dataSource name="ds.filesystem"
>            type="FileDataSource" />
>
>
> <!-- building the document using both db and file content
>     (files are stored in /tmp/<recordId>)
> -->
> <document name="doc">
>  <entity name="t" query="select * from t" dataSource="ds.database">
>    <field column="id" name="id" />
>    <field column="title" name="title" />
>    <entity name="dir"
>            processor="FileListEntityProcessor"
>            baseDir="/tmp/${id}"
>            fileName=".*"
>            dataSource="null"
>            rootEntity="false" >
>      <entity name="file"
>              dataSource="ds.filesystem"
>              processor="XPathEntityProcessor"
>              forEach="/root"
>              url="${dir.fileAbsolutePath}"
>              stream="false" >
>        <field column="text" xpath="/root" />
>      </entity>
>    </entity>
>  </entity>
> </document>
>
>
> Only one additional adjustment has to be made: Since I'm using Solr 1.3 and
> it comes without PlainTextEntityProcessor, I have to transform my plain text
> files in xml files by surrounding the content with a root element. That's
> all!
>
>> On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
>>>
>>> Hello,
>>>
>>> is it possible (and if it is, how can I accomplish it) to configure DIH
>>> to
>>> build up index documents by using content that resides in different data
>>> sources?
>>>
>>> Here is an example scenario:
>>> Let's assume we have a table T with two columns, ID (which is the primary
>>> key of T) and TITLE. Furthermore, each record in T is assigned a
>>> directory
>>> containing text files that were generated out of pdf documents by using
>>> Tika. A directory name is build by using the ID of the record in T
>>> associated to that directory, e.g. all text files associated to a record
>>> with id = 101 are stored in direcory 101.
>>>
>>> Is there a way to configure DIH such that it uses ID, TITLE and the
>>> content
>>> of all related text files when building a document (the documents should
>>> have three fields: id, title, and text)?
>>>
>>> Furthermore, as you may have noticed, a second question arises naturally:
>>> Will there be any integration of Solr Cell and DIH in an upcoming
>>> release,
>>> so that it would be possible to directly use the pdf documents instead of
>>> the extracted text files that were generated outside of Solr?
>>
>> This is something I wish to see. But there has been no user request
>> yet. You can raise an issue and it can be looked upon
>
> I've raised issue SOLR-1358.
>
> Best,
> Sascha
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Building documents using content residing both in database tables and text files

Reply via email to