Re: Trying to use FieldReaderDataSource in DIH

Jeff Schmidt Mon, 07 Mar 2011 09:58:24 -0800

I can see that XPathEntityProcessor.init() is using the no-arg version of 
Context.getDataSource(). Since fields are hierarchical, should that not be a 
request for the the current innermost data source (i.e. "fieldSource" which is 
a FieldReaderDataSource)?   Or should init() be looking at the dataSource 
attribute value of the field in order to effectively invoke 
Context.getDataSource("fieldSource")?


It seems I'm obsessing over this "bug" when it's probably some bigger picture 
thing I'm missing.  Given the other examples of using this technique, it's hard 
to believe I'm the first to encounter this issue. :)

Thanks,

Jeff

On Mar 4, 2011, at 10:00 AM, Jeff Schmidt wrote:

> Hello:
> 
> I'm trying to make use of FieldReaderDataSource so that I can read a (Oracle) 
> database CLOB, and then use XPathEntityProcessor to derive Solr field values 
> via xpath notation.
> 
> For an extra bit of fun, the CLOB itself is base 64 encoded and gzip'd.  I 
> created a transformer of my own to take care of the encoding and compression 
> and that seems to work.  I patterned the new transformer after the existing 
> ones (Solr 3.1 trunk).  Anyway, I can see in catalina.out, my own debug 
> output:
> 
> ------------- Processing field: {toWrite=false, clob=true, 
> column=SUMMARY_XML, boost=1.0, gzip64=true}
> ------------- Updated field: SUMMARY_XML to type: java.lang.String value: 
> '<node id="ING:2ylbg" name="LOC677213" type="gene"><synonym-list><synonym 
> name="LOC677213"/></synonym-list><macromolecule-list><macromolecule 
> id="677213" source="EG" species="MM" name="similar to U2AF homology motif 
> (UHM) kinase 1" 
> summary=""/></macromolecule-list><member-of></member-of><molecular-function></molecular-function><biological-process></biological-process><cellular-component></cellular-component><pathway-list></pathway-list><protein-family><term
>  
> name="unknown"/></protein-family><subcellular-location></subcellular-location><top-findings></top-findings><additional-findings></additional-findings><reference-list
>  finding-count="0"></reference-list><copyright>&#169;2000-2010  Ingenuity 
> Systems, Inc. All rights reserved.</copyright></node>'
> 
> So, the transformer replaces the original CLOB extracted by ClobTransformer 
> with a String representing the decoded result. I then want to feed this XML 
> string to XPathEntityProcessor.  So, in my DIH data config file:
> 
> <dataConfig>
>    <dataSource
>        name="ipsDb"
>        type="JdbcDataSource" 
>        driver="oracle.jdbc.driver.OracleDriver"
>        
> url="jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=ueipa1rac1-vip)(PORT=1537))(ADDRESS=(PROTOCOL=TCP)(HOST=ueipa1rac2-vip)(PORT=1537))(sdu=8760)(LOAD_BALANCE=yes)(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=EIPS1R)))"
>        user="user"
>        password="password"
>    />
> 
>    <datasource
>       name="fieldSource"
>       type="FieldReaderDataSource"
>    />
> 
>    <document>
>        <entity
>               rootEntity="false"
>               name="ipsNode"
>            dataSource="ipsDb"            
>               query="select SUMMARY_XML from IPS_NODE where ROWNUM &lt; 10"
>            
> transformer="ClobTransformer,com.ingenuity.isec.util.SolrDihGzip64Transformer">
> 
>            <field column="SUMMARY_XML" clob="true" gzip64="true"/>
> 
>               <entity
>                       name="node"
>                       dataSource="fieldSource"
>                       dataField="ipsNode.SUMMARY_XML"              
>                   processor="XPathEntityProcessor"            
>                   forEach="/node">
>       
>                   <field column="n_id" xpath="/node/@id"/>
>                   <field column="n_name" xpath="/node/@name"/>
>                   ...
>               </entity>
>        </entity>
>    </document>
> </dataConfig>
> 
> Basically, I'm trying to specify the (former CLOB, now String) SUMMARY_XML 
> field as the data field for the FieldReaderDataSource. I can see it has the 
> ability to simply return a StringReader() for String fields, rather than have 
> to deal with a Clob itself. So, I figured FieldReaderDataSource would be 
> happy with that and it would supply XPathEntityProcessor with XML contained 
> in the field's value.
> 
> But, when I do a full import, I see this:
> 
> Mar 4, 2011 9:10:26 AM org.apache.solr.handler.dataimport.DataImporter 
> doFullImport
> INFO: Starting Full Import
> Mar 4, 2011 9:10:26 AM org.apache.solr.core.SolrCore execute
> INFO: [ing-nodes] webapp=/solr path=/select 
> params={clean=false&commit=true&command=full-import&qt=/dataimport-ips} 
> status=0 QTime=31 
> Mar 4, 2011 9:10:26 AM org.apache.solr.handler.dataimport.SolrWriter 
> readIndexerProperties
> WARNING: Unable to read: dataimport-ips.properties
> Mar 4, 2011 9:10:26 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 
> call
> INFO: Creating a connection for entity ipsNode with URL: 
> jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=ueipa1rac1-vip)(PORT=1537))(ADDRESS=(PROTOCOL=TCP)(HOST=ueipa1rac2-vip)(PORT=1537))(sdu=8760)(LOAD_BALANCE=yes)(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=EIPS1R)))
> Mar 4, 2011 9:10:28 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 
> call
> INFO: Time taken for getConnection(): 1838
> Mar 4, 2011 9:10:28 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 
> call
> INFO: Creating a connection for entity node with URL: 
> jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=ueipa1rac1-vip)(PORT=1537))(ADDRESS=(PROTOCOL=TCP)(HOST=ueipa1rac2-vip)(PORT=1537))(sdu=8760)(LOAD_BALANCE=yes)(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=EIPS1R)))
> Mar 4, 2011 9:10:29 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 
> call
> INFO: Time taken for getConnection(): 1110
> Mar 4, 2011 9:10:29 AM org.apache.solr.handler.dataimport.DocBuilder 
> buildDocument
> SEVERE: Exception while processing: ipsNode document : null
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
> execute query: null Processing Document # 1
>       at 
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>       at 
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:253)
>       at 
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
>       at 
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
>       at 
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:262)
>       at 
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:203)
>       at 
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:183)
>       at 
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
>       at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:586)
>       at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:612)
>       at 
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:266)
>       at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185)
>       at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335)
>       at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
>       at 
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
> Caused by: java.sql.SQLException: SQL statement to execute cannot be empty or 
> null
>       at 
> oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:112)
>       at 
> oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:146)
>       at 
> oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:208)
>       at oracle.jdbc.driver.OracleSql.initialize(OracleSql.java:112)
>       at 
> oracle.jdbc.driver.OracleStatement.executeInternal(OracleStatement.java:1683)
>       at oracle.jdbc.driver.OracleStatement.execute(OracleStatement.java:1662)
>       at 
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:246)
>       ... 13 more
> 
> I looks like XPathEntityProcessor is not using the FieldReaderDataSource I 
> configured for the "node" entity. Instead it creates another JDBC connection 
> for the "node" entity and the stack trace indicates 
> XPathEntityProcessor.initQuery() invokes the getData() method of that data 
> source rather than the FieldReaderDataSource.
> 
> I see in initQuery():
> 
>  private void initQuery(String s) {
>    Reader data = null;
>    try {
>      final List<Map<String, Object>> rows = new ArrayList<Map<String, 
> Object>>();
>      try {
>        data = dataSource.getData(s);
> ...
> 
> dataSource is set up as:
> 
>  @Override
>  @SuppressWarnings("unchecked")
>  public void init(Context context) {
>    super.init(context);
>    if (xpathReader == null)
>      initXpathReader();
>    pk = context.getEntityAttribute("pk");
>    dataSource = context.getDataSource();
>    rowIterator = null;
>  }
> 
> I'm not sure how all of these DIH components work, but it seems 
> context.getDataSource() must be returning the JDBC data source configured for 
> the outer entity (ipsNode), not the FieldDataSource configured for the inner 
> entity (node) where I'm making use of XPathEntityProcessor.
> 
> What am I missing conceptually?  I've found a few of references to the very 
> same problem, and I think I'm following the same pattern.
> 
> Thanks for any insights you can share,
> 
> Jeff
> --
> Jeff Schmidt
> 535 Consulting
> j...@535consulting.com
> http://www.535consulting.com
> 



--
Jeff Schmidt
535 Consulting
j...@535consulting.com
http://www.535consulting.com

Re: Trying to use FieldReaderDataSource in DIH

Reply via email to