Hi all,

I am using Solr 4.9.0 to index a DB with DIH. In the DB there is a URL
field. In the DIH Tika uses that field to fetch and parse the documents. The
URL from the field is valid and will download the document in the browser
just fine. But Tika is getting HTTP response code 400. Any ideas why?

ERROR
BinURLDataSource
java.io.IOException: Server returned HTTP response code: 400 for URL:

EntityProcessorWrapper
Exception in entity :
tika_content:org.apache.solr.handler.dataimport.DataImportHandlerException:
Exception in invoking url

DIH
<dataConfig>
        <dataSource type="JdbcDataSource"
              name="ds-1"
              driver="net.sourceforge.jtds.jdbc.Driver"
        
url="jdbc:jtds:sqlserver://1.2.3.4/database;instance=INSTANCE;user=USER;pass
word=PASSWORD" />

        <dataSource type="BinURLDataSource" name="ds-2" />

        <document>
        <entity name="db_content" dataSource="ds-1"
transformer="ClobTransformer, RegexTransformer" 
                query="SELECT ContentID,
                        DownloadURL
                        FROM DATABASE.VIEW
                        <field column="ContentID" name="id" />
                        <field column="DownloadURL" clob="true"
name="DownloadURL" />
                        
                        <entity name="tika_content"
processor="TikaEntityProcessor" url="${db_content.DownloadURL}"
onError="continue" dataSource="ds-2">
                                <field column="TikaParsedContent" />
                        </entity>       
                        
        </entity>
        </document>
</dataConfig>

SCHEMA - Fields
<field name="DownloadURL" type="string" indexed="true" stored="true" />
<field name="TikaParsedContent" type="text_general" indexed="true"
stored="true" multiValued="true"/>



Reply via email to