On 6/28/2010 8:28 AM, Alexey Serba wrote:
Ok, I'm trying to integrate the TikaEntityProcessor as suggested. �I'm using
Solr Version: 1.4.0 and getting the following error:

java.lang.ClassNotFoundException: Unable to load BinURLDataSource or
org.apache.solr.handler.dataimport.BinURLDataSource
It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1
release. You should use trunk / nightly builds.
https://issues.apache.org/jira/browse/SOLR-1583


Thanks, that would explain things - I'm using a stock 1.4.0 download.


My data-config.xml looks like this:

<dataConfig>
�<dataSource type="JdbcDataSource"
� �driver="oracle.jdbc.driver.OracleDriver"
� �url="jdbc:oracle:thin:@whatever:12345:whatever"
� �user="me"
� �name="ds-db"
� �password="secret"/>

�<dataSource type="BinURLDataSource"
� �name="ds-url"/>

�<document>
� �<entity name="my_database"
� � dataSource="ds-db"
� � query="select * from my_database where rownum &lt;=2">
� � �<field column="CONTENT_ID" � � � � � � � �name="content_id"/>
� � �<field column="CMS_TITLE" � � � � � � � � name="cms_title"/>
� � �<field column="FORM_TITLE" � � � � � � � �name="form_title"/>
� � �<field column="FILE_SIZE" � � � � � � � � name="file_size"/>
� � �<field column="KEYWORDS" � � � � � � � � �name="keywords"/>
� � �<field column="DESCRIPTION" � � � � � � � name="description"/>
� � �<field column="CONTENT_URL" � � � � � � � name="content_url"/>
� �</entity>

� �<entity name="my_database_url"
� � dataSource="ds-url"
� � query="select CONTENT_URL from my_database where
content_id='${my_database.CONTENT_ID}'">
� � <entity processor="TikaEntityProcessor"
� � �dataSource="ds-url"
� � �format="text">
� � �url="http://www.mysite.com/${my_database.content_url}";
� � �<field column="text"/>
� � </entity>
� �</entity>

�</document>
</dataConfig>

I added the entity name="my_database_url" section to an existing (working)
database entity to be able to have Tika index the content pointed to by the
content_url.

Is there anything obviously wrong with what I've tried so far?

I think you should move Tika entity into my_database entity and
simplify the whole configuration

<entity name="my_database" dataSource="ds-db" query="select * from
my_database where rownum &lt;=2">
    ...
    <field column="CONTENT_URL"               name="content_url"/>

    <entity processor="TikaEntityProcessor" dataSource="ds-url"
format="text" url="http://www.mysite.com/${my_database.content_url}";
        <field column="text"/>
    </entity>
</entity>


This, I guess, would be after I checked out and built from trunk?


Thanks - Tod

Reply via email to