Please refer to this thread for history:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201006.mbox/%3c4c1b6bb6.7010...@gmail.com%3e


I'm trying to integrate the TikaEntityProcessor as suggested. I'm using Solr Version: 1.4.0 and getting the following error:

java.lang.ClassNotFoundException: Unable to load BinURLDataSource or org.apache.solr.handler.dataimport.BinURLDataSource

curl -s http://test.html|curl http://localhost:9080/solr/update/extract?extractOnly=true --data-binary @- -H 'Content-type:text/html'

... works fine so presumably my Tika processor is working.


My data-config.xml looks like this:

<dataConfig>
  <dataSource type="JdbcDataSource"
    driver="oracle.jdbc.driver.OracleDriver"
    url="jdbc:oracle:thin:@whatever:12345:whatever"
    user="me"
    name="ds-db"
    password="secret"/>

  <dataSource type="BinURLDataSource"
    name="ds-url"/>

  <document>
    <entity name="my_database"
     dataSource="ds-db"
     query="select * from my_database where rownum &lt;=2">
      <field column="CONTENT_ID"                name="content_id"/>
      <field column="CMS_TITLE"                 name="cms_title"/>
      <field column="FORM_TITLE"                name="form_title"/>
      <field column="FILE_SIZE"                 name="file_size"/>
      <field column="KEYWORDS"                  name="keywords"/>
      <field column="DESCRIPTION"               name="description"/>
      <field column="CONTENT_URL"               name="content_url"/>
    </entity>

    <entity name="my_database_url"
     dataSource="ds-url"
query="select CONTENT_URL from my_database where content_id='${my_database.CONTENT_ID}'">
     <entity processor="TikaEntityProcessor"
      dataSource="ds-url"
      format="text">
      url="http://www.mysite.com/${my_database.content_url}";
      <field column="text"/>
     </entity>
    </entity>

  </document>
</dataConfig>

I added the entity name="my_database_url" section to an existing (working) database entity to be able to have Tika index the content pointed to by the content_url.

Is there anything obviously wrong with what I've tried so far because this is not working, it keeps rolling back with the error above.


Thanks - Tod

Reply via email to