isn't better to make a jar of PlaintextEntityProcessor and drop into solr.home/lib ?
On Tue, Aug 11, 2009 at 11:05 PM, Sascha Szott<sz...@zib.de> wrote: > Hi Noble, > > Noble Paul wrote: >> >> isn't it possible to do this by having two datasources (one Js=dbc and >> another File) and two entities . The outer entity can read from a DB >> and the inner entity can read from a file. > > Yes, it is. Here's my db-data-config.xml file: > > <!-- definition of data sources --> > <dataSource name="ds.database" > driver="..." > url="..." > user="..." > password="..." /> > <dataSource name="ds.filesystem" > type="FileDataSource" /> > > > <!-- building the document using both db and file content > (files are stored in /tmp/<recordId>) > --> > <document name="doc"> > <entity name="t" query="select * from t" dataSource="ds.database"> > <field column="id" name="id" /> > <field column="title" name="title" /> > <entity name="dir" > processor="FileListEntityProcessor" > baseDir="/tmp/${id}" > fileName=".*" > dataSource="null" > rootEntity="false" > > <entity name="file" > dataSource="ds.filesystem" > processor="XPathEntityProcessor" > forEach="/root" > url="${dir.fileAbsolutePath}" > stream="false" > > <field column="text" xpath="/root" /> > </entity> > </entity> > </entity> > </document> > > > Only one additional adjustment has to be made: Since I'm using Solr 1.3 and > it comes without PlainTextEntityProcessor, I have to transform my plain text > files in xml files by surrounding the content with a root element. That's > all! > >> On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote: >>> >>> Hello, >>> >>> is it possible (and if it is, how can I accomplish it) to configure DIH >>> to >>> build up index documents by using content that resides in different data >>> sources? >>> >>> Here is an example scenario: >>> Let's assume we have a table T with two columns, ID (which is the primary >>> key of T) and TITLE. Furthermore, each record in T is assigned a >>> directory >>> containing text files that were generated out of pdf documents by using >>> Tika. A directory name is build by using the ID of the record in T >>> associated to that directory, e.g. all text files associated to a record >>> with id = 101 are stored in direcory 101. >>> >>> Is there a way to configure DIH such that it uses ID, TITLE and the >>> content >>> of all related text files when building a document (the documents should >>> have three fields: id, title, and text)? >>> >>> Furthermore, as you may have noticed, a second question arises naturally: >>> Will there be any integration of Solr Cell and DIH in an upcoming >>> release, >>> so that it would be possible to directly use the pdf documents instead of >>> the extracted text files that were generated outside of Solr? >> >> This is something I wish to see. But there has been no user request >> yet. You can raise an issue and it can be looked upon > > I've raised issue SOLR-1358. > > Best, > Sascha > > -- ----------------------------------------------------- Noble Paul | Principal Engineer| AOL | http://aol.com