Hi Jorg, This is working now. If you look at SOLR-1583 (http://issues.apache.org/jira/browse/SOLR-1583) you can see that an InputStream was needed from the DataSource for file and URL data sources. The same is true for the FieldReaderDataSource. I created a class, BinFieldReaderDataSource that returns the InputStream rather than a Reader of the BLOB.
I am working off the trunk code from a few days ago which I checked out using tortoise svn and compiled using the ant that was in my eclipse plugin directory, a fairly painless process. I am somewhat new to open source development, so for now I have just copied the text of the java file and my xml config below. ##### BinFieldReaderDataSource.java public class BinFieldReaderDataSource extends DataSource<InputStream> { private static final Logger LOG = LoggerFactory .getLogger(FieldReaderDataSource.class); protected VariableResolver vr; protected String dataField; private String encoding; private EntityProcessorWrapper entityProcessor; public void init(Context context, Properties initProps) { dataField = context.getEntityAttribute("dataField"); encoding = context.getEntityAttribute("encoding"); entityProcessor = (EntityProcessorWrapper) context.getEntityProcessor(); /* no op */ } public InputStream getData(String query) { Object o = entityProcessor.getVariableResolver().resolve(dataField); if (o == null) { throw new DataImportHandlerException(SEVERE, "No field available for name : " + dataField); } if (o instanceof String) { throw new DataImportHandlerException(SEVERE, "Unsupported field type: String"); } else if (o instanceof Clob) { throw new DataImportHandlerException(SEVERE, "Unsupported field type: CLOB"); } else if (o instanceof Blob) { Blob blob = (Blob) o; try { // Most of the JDBC drivers have getBinaryStream defined as // public // so let us just check it Method m = blob.getClass().getDeclaredMethod("getBinaryStream"); if (Modifier.isPublic(m.getModifiers())) { return getInputStream(m, blob); } else { // force invoke m.setAccessible(true); return getInputStream(m, blob); } } catch (Exception e) { LOG.info("Unable to get data from BLOB"); return null; } } else { return null; } } static Reader readCharStream(Clob clob) { try { Method m = clob.getClass().getDeclaredMethod("getCharacterStream"); if (Modifier.isPublic(m.getModifiers())) { return (Reader) m.invoke(clob); } else { // force invoke m.setAccessible(true); return (Reader) m.invoke(clob); } } catch (Exception e) { wrapAndThrow(SEVERE, e, "Unable to get reader from clob"); return null;// unreachable } } private InputStream getInputStream(Method m, Blob blob) throws IllegalAccessException, InvocationTargetException, UnsupportedEncodingException { InputStream is = (InputStream) m.invoke(blob); return is; } public void close() { } } ## Tika-data-config.xml <dataConfig> <dataSource name="f1" type="BinFieldReaderDataSource" /> <dataSource name="orcle" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:user/p...@host:1521:sid" /> <document> <entity dataSource="orcle" name="attach" query="select attachment from testtable2"> <entity dataSource="f1" processor="TikaEntityProcessor" url="attachment" dataField="attach.ATTACHMENT" format="text"> <field column="text" name="text" /> </entity> </entity> </document> </dataConfig> Nirmal Shah -----Original Message----- From: Jorg Heymans [mailto:jorg.heym...@gmail.com] Sent: Tuesday, January 26, 2010 3:43 AM To: solr-user@lucene.apache.org Subject: Re: DataImportHandler TikaEntityProcessor FieldReaderDataSource Hi Shah, I am assuming you are talking about the integration of SOLR-1358, i am very interested in this feature as well. Did you get it to work ? Is there a snapshot build available for this somewhere or do i have to build solr from source myself ? Thanks, Jorg On Mon, Jan 25, 2010 at 6:27 PM, Shah, Nirmal <ns...@columnit.com> wrote: > Hi, > > > > I am fairly new to Solr and would like to use the DIH to pull rich text > files (pdfs, etc) from BLOB fields in my database. > > > > There was a suggestion made to use the FieldReaderDataSource with the > recently commited TikaEntityProcessor. Has anyone accomplished this? > > This is my configuration, and the resulting error - I'm not sure if I'm > using the FieldReaderDataSource correctly. If anyone could shed light > on whether I am going the right direction or not, it would be > appreciated. > > > > ---------------Data-config.xml: > > <dataConfig> > > <datasource name="f1" type="FieldReaderDataSource" /> > > <dataSource name="orcle" driver="oracle.jdbc.driver.OracleDriver" > url="jdbc:oracle:thin:un/p...@host:1521:sid" /> > > <document> > > <entity dataSource="orcle" name="attach" query="select id as name, > attachment from testtable2"> > > <entity dataSource="f1" processor="TikaEntityProcessor" > dataField="attach.attachment" format="text"> > > <field column="text" name="NAME" /> > > </entity> > > </entity> > > </document> > > </dataConfig> > > > > > > -------------Debug error: > > <response> > > <lst name="responseHeader"> > > <int name="status">0</int> > > <int name="QTime">203</int> > > </lst> > > <lst name="initArgs"> > > <lst name="defaults"> > > <str name="config">testdb-data-config.xml</str> > > </lst> > > </lst> > > <str name="command">full-import</str> > > <str name="mode">debug</str> > > <null name="documents"/> > > <lst name="verbose-output"> > > <lst name="entity:attach"> > > <lst name="document#1"> > > <str name="query">select id as name, attachment from testtable2</str> > > <str name="time-taken">0:0:0.32</str> > > <str>----------- row #1-------------</str> > > <str name="NAME">java.math.BigDecimal:2</str> > > <str name="ATTACHMENT">oracle.sql.BLOB:oracle.sql.b...@1c8e807</str> > > <str>---------------------------------------------</str> > > <lst name="entity:253433571801723"> > > <str name="EXCEPTION"> > > org.apache.solr.handler.dataimport.DataImportHandlerException: No > dataSource :f1 available for entity :253433571801723 Processing Document > # 1 > > at > org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(Da > taImporter.java:279) > > at > org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl > .java:93) > > at > org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit > yProcessor.java:97) > > at > org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Entity > ProcessorWrapper.java:237) > > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j > ava:357) > > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j > ava:383) > > at > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java > :242) > > at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:18 > 0) > > at > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporte > r.java:331) > > at > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java > :389) > > at > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(D > ataImportHandler.java:203) > > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB > ase.java:131) > > at > org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja > va:338) > > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j > ava:241) > > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHan > dler.java:1089) > > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) > > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:2 > 16) > > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) > > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) > > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) > > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandler > Collection.java:211) > > at > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.jav > a:114) > > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) > > at org.mortbay.jetty.Server.handle(Server.java:285) > > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) > > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConne > ction.java:821) > > at > org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) > > at > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) > > at > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) > > at > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.jav > a:226) > > at > org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.ja > va:442) > > > > Thanks, > > Nirmal > >