A few things: 1) If your database uses a BLOB, you should not use clobtransformer; FieldStreamDataSource should be sufficient.
2) In a previous message, it showed that the converted/etxracted document was empty (except for an html boilerplate wrapper). This was using the configuration I suggested. I'm guessing that TikaEntityProcessor is either receiving empty strings as source, or failing to extract the content of certain file formats. To test the latter, you could export one of the blobs to a file, and run the stan-aloen tika app on it. As to the possibility that TikaEntitiyProcessor is receiving empty strings as input: I had a similar issue, but with varchars. In my case, the reason was that I was running a really old version of Oracle, which would not work with recent versions of the Oracle support libraries. Another thing that might be worth checking: your main query uses "select * ..." as the main query. Have you tried explicitly listing the columns you're interested in? Something like "select X_MSG_PK, MESSAGE from table1". On Tue, Feb 25, 2014 at 1:11 PM, Chandan khatua <chand...@nrifintech.com>wrote: > Okey. > > Here is my data-config file: > > <?xml version="1.0" encoding="UTF-8" ?> > <dataConfig> > <dataSource name="db" driver="oracle.jdbc.driver.OracleDriver" > url="jdbc:oracle:thin:@//1.2.3.4:1/d11gr21" user="aaaa" password="aaaa" /> > <dataSource name="dastream" type="FieldStreamDataSource"/> > <document> > <entity > name="messages" pk="X_MSG_PK" > query="select * from table1" > dataSource="db"> > <field column ="X_MSG_PK" name ="id" /> > <entity name="message" > transformer="ClobTransformer" > dataSource="dastream" > processor="TikaEntityProcessor" > dataField="messages.MESSAGE" > format="text"> > <field column="text" name="mxMsg" clob="true"/> > </entity> > </entity> > </document> > </dataConfig> > > > ---------------------------------------------------------------------------- > ---------------------- > > Solr.log file : > > INFO - 2014-02-25 17:33:40.023; org.apache.solr.core.SolrCore; > [CHESS_CORE] > webapp=/solr path=/admin/mbeans > params={cat=QUERYHANDLER&_=1393329819994&wt=json} status=0 QTime=1 > INFO - 2014-02-25 17:33:40.094; org.apache.solr.core.SolrCore; > [CHESS_CORE] > webapp=/solr path=/admin/mbeans > params={cat=QUERYHANDLER&_=1393329820083&wt=json} status=0 QTime=0 > INFO - 2014-02-25 17:33:40.117; org.apache.solr.core.SolrCore; > [CHESS_CORE] > webapp=/solr path=/dataimport > params={indent=true&command=status&_=1393329820089&wt=json} status=0 > QTime=16 > INFO - 2014-02-25 17:33:40.131; org.apache.solr.core.SolrCore; > [CHESS_CORE] > webapp=/solr path=/dataimport > params={indent=true&command=show-config&_=1393329820084} status=0 QTime=29 > INFO - 2014-02-25 17:33:42.026; > org.apache.solr.handler.dataimport.DataImporter; Loading DIH Configuration: > /dataconfig/data-config.xml > INFO - 2014-02-25 17:33:42.031; > org.apache.solr.handler.dataimport.DataImporter; Data Configuration loaded > successfully > INFO - 2014-02-25 17:33:42.033; org.apache.solr.core.SolrCore; > [CHESS_CORE] > webapp=/solr path=/dataimport > > params={optimize=false&indent=true&clean=true&commit=true&verbose=false&comm > and=full-import&debug=false&wt=json} status=0 QTime=8 > INFO - 2014-02-25 17:33:42.035; > org.apache.solr.handler.dataimport.DataImporter; Starting Full Import > INFO - 2014-02-25 17:33:42.043; org.apache.solr.core.SolrCore; > [CHESS_CORE] > webapp=/solr path=/dataimport > params={indent=true&command=status&_=1393329822040&wt=json} status=0 > QTime=0 > > INFO - 2014-02-25 17:33:42.064; > org.apache.solr.handler.dataimport.SimplePropertiesWriter; Read > dataimport.properties > INFO - 2014-02-25 17:33:42.092; org.apache.solr.search.SolrIndexSearcher; > Opening Searcher@2a858a73 realtime > INFO - 2014-02-25 17:33:42.093; > org.apache.solr.handler.dataimport.JdbcDataSource$1; Creating a connection > for entity messages with URL: jdbc:oracle:thin:@// > 172.16.29.92:1521/d11gr21 > INFO - 2014-02-25 17:33:42.113; > org.apache.solr.handler.dataimport.JdbcDataSource$1; Time taken for > getConnection(): 19 > INFO - 2014-02-25 17:33:42.564; > org.apache.solr.handler.dataimport.DocBuilder; Import completed > successfully > INFO - 2014-02-25 17:33:42.564; > org.apache.solr.update.DirectUpdateHandler2; start > > commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=fa > lse,softCommit=false,prepareCommit=false} > INFO - 2014-02-25 17:33:42.867; org.apache.solr.core.SolrDeletionPolicy; > SolrDeletionPolicy.onCommit: commits: num=2 > > commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@C > :\solr > -4.5.1\example\multicore\CHESS_CORE\data\index > lockFactory=org.apache.lucene.store.NativeFSLockFactory@2c6d8073; > maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_l,generation=21} > > commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@C > :\solr > -4.5.1\example\multicore\CHESS_CORE\data\index > lockFactory=org.apache.lucene.store.NativeFSLockFactory@2c6d8073; > maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_m,generation=22} > INFO - 2014-02-25 17:33:42.868; org.apache.solr.core.SolrDeletionPolicy; > newest commit generation = 22 > INFO - 2014-02-25 17:33:42.882; org.apache.solr.search.SolrIndexSearcher; > Opening Searcher@558ea0cc main > INFO - 2014-02-25 17:33:42.886; org.apache.solr.core.QuerySenderListener; > QuerySenderListener sending requests to Searcher@558ea0cc > main{StandardDirectoryReader(segments_m:55:nrt _d(4.5.1):C80)} > INFO - 2014-02-25 17:33:42.889; org.apache.solr.core.QuerySenderListener; > QuerySenderListener done. > INFO - 2014-02-25 17:33:42.889; org.apache.solr.core.SolrCore; > [CHESS_CORE] > Registered new searcher Searcher@558ea0cc > main{StandardDirectoryReader(segments_m:55:nrt _d(4.5.1):C80)} > INFO - 2014-02-25 17:33:42.893; > org.apache.solr.update.DirectUpdateHandler2; end_commit_flush > INFO - 2014-02-25 17:33:42.899; > org.apache.solr.handler.dataimport.SimplePropertiesWriter; Read > dataimport.properties > INFO - 2014-02-25 17:33:42.901; > org.apache.solr.handler.dataimport.SimplePropertiesWriter; Wrote last > indexed time to dataimport.properties > INFO - 2014-02-25 17:33:42.905; > org.apache.solr.handler.dataimport.DocBuilder; Time taken = 0:0:0.839 > INFO - 2014-02-25 17:33:42.905; > org.apache.solr.update.processor.LogUpdateProcessor; [CHESS_CORE] > webapp=/solr path=/dataimport > > params={optimize=false&indent=true&clean=true&commit=true&verbose=false&comm > and=full-import&debug=false&wt=json} status=0 QTime=8 {deleteByQuery=*:* > (-1461012211508969472),add=[2158 (1461012211583418368), 2265 > (1461012211591806976), 2225 (1461012211597049856), 2241 > (1461012211602292736), 2276 (1461012211607535616), 2277 > (1461012211612778496), 2302 (1461012211619069952), 4558 > (1461012211624312832), 2144 (1461012211629555712), 2145 > (1461012211635847168), ... (80 adds)],commit=} 0 8 > INFO - 2014-02-25 17:33:47.623; org.apache.solr.core.SolrCore; > [CHESS_CORE] > webapp=/solr path=/dataimport > params={indent=true&command=status&_=1393329827620&wt=json} status=0 > QTime=1 > > > > ---------------------------------------------------------------------------- > > ---------------------------------------------------------------------------- > ------------------------- > > Part of Query result screen : > > "docs": [ > { > "id": "2158", > "mxMsg": [ > "" > ], > "_version_": 1461012211583418400 > }, > { > "id": "2265", > "mxMsg": [ > "" > ], > "_version_": 1461012211591807000 > }, > > > ---------------------------------------------------------------------------- > > ---------------------------------------------------------------------------- > ---- > > As you see, > > 'id' is indexed properly, but 'mxMsg' is empty. > > > ---------------------------------------------------------------------------- > ------------------------------------------------------- > > Now, please suggest me so that I can get data in 'mxMsg' field. The binary > data is stored inDB as BLOB type. > > Please note: The same configuration is working fine ('mxMsg' displays data > if XML data are in DB as BLOB type). > > > > Please help, > > Looking forward, > > Chandan > > > -----Original Message----- > From: Gora Mohanty [mailto:g...@mimirtech.com] > Sent: Tuesday, February 25, 2014 4:35 PM > To: solr-user@lucene.apache.org > Subject: Re: Can not index raw binary data stored in Database in BLOB > format. > > On 25 February 2014 14:54, Chandan khatua <chand...@nrifintech.com> wrote: > > Hi Gora, > > > > The column type in DB is BLOB. It only stores binary data. > > > > If I do not use TikaEntityProcessor, then the following exception occurs: > [...] > > It is difficult to follow what you are doing when you say one thing, and > seem to do another. You say above that you are not using > TikaEntityProcessor > but your DIH data configuration file shows that you are. Please start with > one configuration, and show us the *exact* files in use, and the error from > the Solr logs. > > Regards, > Gora > >