Fwd: Tika Integration problem with DIH and JDBC

Alexandre Rafalovitch Fri, 10 Oct 2014 12:18:03 -0700

I would concentrate on the stack traces and try reading them. They
often provide a lot of clues. For example, you original stack trace
had


org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:283)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:240)
2) at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:44)
at 
org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:188)
1) at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:112)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)

I added 1) and 2) to show the lines of importance. You can see in 1)
that your TikaEntityProcessor is calling 2) JdbcDataSource, which was
not what you wanted as you specified BinDataSource. So, you focus on
that until it gets resolved.

Sometimes these happens when the XML file says 'datasource' instead of
'dataSource' (DIH is case-sensitive), but it does not seem to be the
case in your situation.

Regards,
    Alex.
P.s. If you still haven't figure it out, mention the Solr version on
the next email. Sometimes it makes difference, though DIH has been
largely unchanged for a while.

---------- Forwarded message ----------
From: Dan Davis <d...@danizen.net>
Date: 10 October 2014 15:00
Subject: Re: Tika Integration problem with DIH and JDBC
To: Alexandre Rafalovitch <arafa...@gmail.com>


The definition of dataSource name="bin" type="BinURLDataSource" is in
each of the dih-*.xml files.
But only the xml version has the definition at the top, above the document.

Moving the dataSource definition to the top does change the behavior,
now I get the following error for that entity:

Exception in entity :
extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
JDBC URL or JNDI name has to be specified Processing Document # 30

When I changed it to specify url="", it then reverted to form:

Exception in entity :
extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query: http://www.cdc.gov/flu/swineflu/ Processing
Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)

It does seem to be a problem resolving the dataSource in some way.   I
did double check another part of solrconfig.xml therefore.   Since the
XML example still works, I guess I know it has to be there.

  <lib dir="${solr.solr.home:}/dist/" regex="solr-dataimporthandler-.*\.jar" />

  <lib dir="${solr.solr.home:}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.solr.home:}/dist/" regex="solr-cell-\d.*\.jar" />

  <lib dir="${solr.solr.home:}/contrib/clustering/lib/" regex=".*\.jar" />
  <lib dir="${solr.solr.home:}/dist/" regex="solr-clustering-\d.*\.jar" />

  <lib dir="${solr.solr.home:}/contrib/langid/lib/" regex=".*\.jar" />
  <lib dir="${solr.solr.home:}/dist/" regex="solr-langid-\d.*\.jar" />

  <lib dir="${solr.solr.home:}/contrib/velocity/lib" regex=".*\.jar" />
  <lib dir="${solr.solr.home:}/dist/" regex="solr-velocity-\d.*\.jar" />


On Fri, Oct 10, 2014 at 2:37 PM, Alexandre Rafalovitch
<arafa...@gmail.com> wrote:
>
> You say "dataSource='bin'" but I don't see you defining that datasource. E.g.:
>
> <dataSource type="BinURLDataSource" name="bin"/>
>
> So, there might be some weird default fallback that's just causes
> strange problems.
>
> Regards,
>     Alex.
>
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 10 October 2014 14:17, Dan Davis <dansm...@gmail.com> wrote:
> >
> > What I want to do is to pull an URL out of an Oracle database, and then use
> > TikaEntityProcessor and BinURLDataSource to go fetch and process that URL.
> > I'm having a problem with this that seems general to JDBC with Tika - I get
> > an exception as follows:
> >
> > Exception in entity :
> > extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
> > Unable to execute query: http://www.cdc.gov/healthypets/pets/wildlife.html
> > Processing Document # 14
> >       at
> > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
> > ...
> >
> > Steps to reproduce any problem should be:
> >
> > Try it with the XML and verify you get two documents and they contain text
> > (schema browser with the text field)
> > Try it with a JDBC sqlite3 dataSource and verify that you get an exception,
> > and advise me what may be the problem in my configuration ...
> >
> > Now, I've tried this 3 ways:
> >
> > My Oracle database - fails as above
> > An SQLite3 database to see if it is Oracle specific - fails with "Unable to
> > execute query", but doesn't have the URL as part of the message.
> > An XML file listing two URLs - succeeds without error.
> >
> > For the SQL attempts, setting onError="skip" leads the data from the
> > database to be indexed, but the exception is logged for each root entity.
> > I can tell that nothing is indexed from the text extraction by browsing the
> > "text" field from the schema browser and seeing how few terms there are.
> > The exceptions also sort of give it away, but it is good to be careful :)
> >
> > This is using:
> >
> > Tomcat 7.0.55
> > Solr 4.10.1
> > and JDBC drivers
> >
> > ojdbc7.jar
> > sqlite-jdbc-3.7.2.jar
> >
> > Excerpt of solrconfig.xml:
> >
> >   <!-- Data Import Handler for Health Topics -->
> >   <requestHandler name="/dih-healthtopics" class="solr.DataImportHandler">
> >     <lst name="defaults">
> >       <str name="config">dih-healthtopics.xml</str>
> >     </lst>
> >   </requestHandler>
> >
> >   <!-- Data Import Handler that imports a single URL via Tika -->
> >   <requestHandler name="/dih-smallxml" class="solr.DataImportHandler">
> >     <lst name="defaults">
> >       <str name="config">dih-smallxml.xml</str>
> >     </lst>
> >   </requestHandler>
> >
> >     <!-- Data Import Handler that imports a single URL via Tika -->
> >   <requestHandler name="/dih-smallsqlite" class="solr.DataImportHandler">
> >     <lst name="defaults">
> >       <str name="config">dih-smallsqlite.xml</str>
> >     </lst>
> >   </requestHandler>
> >
> >
> > The data import handlers and a copy-paste from Solr logging are attached.

Fwd: Tika Integration problem with DIH and JDBC

Reply via email to