On 6/28/2010 8:28 AM, Alexey Serba wrote:
Ok, I'm trying to integrate the TikaEntityProcessor as suggested. �I'm using
Solr Version: 1.4.0 and getting the following error:
java.lang.ClassNotFoundException: Unable to load BinURLDataSource or
org.apache.solr.handler.dataimport.BinURLDataSource
It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1
release. You should use trunk / nightly builds.
https://issues.apache.org/jira/browse/SOLR-1583
Thanks, that would explain things - I'm using a stock 1.4.0 download.
My data-config.xml looks like this:
<dataConfig>
�<dataSource type="JdbcDataSource"
� �driver="oracle.jdbc.driver.OracleDriver"
� �url="jdbc:oracle:thin:@whatever:12345:whatever"
� �user="me"
� �name="ds-db"
� �password="secret"/>
�<dataSource type="BinURLDataSource"
� �name="ds-url"/>
�<document>
� �<entity name="my_database"
� � dataSource="ds-db"
� � query="select * from my_database where rownum <=2">
� � �<field column="CONTENT_ID" � � � � � � � �name="content_id"/>
� � �<field column="CMS_TITLE" � � � � � � � � name="cms_title"/>
� � �<field column="FORM_TITLE" � � � � � � � �name="form_title"/>
� � �<field column="FILE_SIZE" � � � � � � � � name="file_size"/>
� � �<field column="KEYWORDS" � � � � � � � � �name="keywords"/>
� � �<field column="DESCRIPTION" � � � � � � � name="description"/>
� � �<field column="CONTENT_URL" � � � � � � � name="content_url"/>
� �</entity>
� �<entity name="my_database_url"
� � dataSource="ds-url"
� � query="select CONTENT_URL from my_database where
content_id='${my_database.CONTENT_ID}'">
� � <entity processor="TikaEntityProcessor"
� � �dataSource="ds-url"
� � �format="text">
� � �url="http://www.mysite.com/${my_database.content_url}"
� � �<field column="text"/>
� � </entity>
� �</entity>
�</document>
</dataConfig>
I added the entity name="my_database_url" section to an existing (working)
database entity to be able to have Tika index the content pointed to by the
content_url.
Is there anything obviously wrong with what I've tried so far?
I think you should move Tika entity into my_database entity and
simplify the whole configuration
<entity name="my_database" dataSource="ds-db" query="select * from
my_database where rownum <=2">
...
<field column="CONTENT_URL" name="content_url"/>
<entity processor="TikaEntityProcessor" dataSource="ds-url"
format="text" url="http://www.mysite.com/${my_database.content_url}"
<field column="text"/>
</entity>
</entity>
This, I guess, would be after I checked out and built from trunk?
Thanks - Tod