Large RDBMS dataset

Finotti Simone Wed, 14 Dec 2011 02:19:08 -0800

Hello,
I have a very large dataset (> 1 Mrecords) on the RDBMS which I want my Solr 
application to pull data from.


Problem is that the document fields which I have to index aren't in the same 
table, but I have to join records with two other tables. Well, in fact they are 
views, but I don't think that this makes any difference.

That's the data import handler that I've actually written:

<?xml version="1.0"?>
<dataConfig>
  <dataSource type="JdbcDataSource" driver="net.sourceforge.jtds.jdbc.Driver" 
url="jdbc:jtds:sqlserver://YSQLDEV01BLQ/YooxProcessCluster1" 
instance="SVCSQLDEV" />
  <document name="Products">
    <entity name="fd" query="SELECT * FROM clust_w_fast_dump ORDER BY 
endeca_id;">
      <entity name="fd2" query="SELECT macrocolor_id, color_descr, gsize_descr, 
size_descr FROM clust_w_fast_dump2_ByMarkets WHERE endeca_id='${fd.Endeca_ID}' 
ORDER BY endeca_id;"/>
      <entity name="cpd" query="SELECT DepartmentCode, Ranking, 
DepartmentPriceRangeCode FROM clust_w_CatalogProductsDepartments_ByMarket WHERE 
endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;"/>
      <entity name="env" query="SELECT Environment FROM clust_w_Environment 
WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;"/>
    </entity>
  </document>
</dataConfig>

It works, but it takes 1'38" to parse 100 records: it means 1 rec/s! That means 
that digesting the whole dataset would take 1 Ms (=> 12 days).

The problem is that for each record in "fd", Solr makes three distinct SELECT 
on the other three tables. Of course, this is absolutely inefficient.

Is there a way to have Solr loading every record in the four tables and join 
them when they are already loaded in memory?

TIA

Large RDBMS dataset

Reply via email to