Thanks Bryan . 

That clarifies a lot. 

But even with streaming - retrieving one document at a time and adding to the 
IndexWriter seems to making it more serializable . 

So - may be the DataImportHandler could be optimized to retrieve a bunch of 
results from the query and add the Documents in a separate thread , from a 
Executor pool (and make this number configurable / may be retrieved from the 
System as the number of physical cores to exploit maximum parallelism ) since 
that seems like a bottleneck. 

Any comments on the same. 



--- On Fri, 12/12/08, Bryan Talbot <btal...@aeriagames.com> wrote:
From: Bryan Talbot <btal...@aeriagames.com>
Subject: Re: Solr - DataImportHandler - Large Dataset results ?
To: solr-user@lucene.apache.org
Date: Friday, December 12, 2008, 5:26 PM

It only supports streaming if properly enabled which is completely lame:
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html

 By default, ResultSets are completely retrieved and stored in memory. In most
cases this is the most efficient way to operate, and due to the design of the
MySQL network protocol is easier to implement. If you are working with
ResultSets that have a large number of rows or large values, and can not
allocate heap space in your JVM for the memory required, you can tell the driver
to stream the results back one row at a time.

To enable this functionality, you need to create a Statement instance in the
following manner:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
              java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

The combination of a forward-only, read-only result set, with a fetch size of
Integer.MIN_VALUE serves as a signal to the driver to stream result sets
row-by-row. After this any result sets created with the statement will be
retrieved row-by-row.



-Bryan




On Dec 12, 2008, at Dec 12, 2:15 PM, Kay Kay wrote:

> I am using MySQL. I believe (since MySQL 5) supports streaming.
> 
> On more about streaming - can we assume that when the database driver
supports streaming , the resultset iterator is a forward directional iterator.
> 
> If , say the streaming size is 10K records and we are trying to retrieve a
total of 100K records - what exactly happens when the threshold is reached ,
(say , the first 10K records were retrieved ).
> 
> Are the previous set of records thrown away and replaced in memory by the
new batch of records.
> 
> 
> 
> --- On Fri, 12/12/08, Shalin Shekhar Mangar <shalinman...@gmail.com>
wrote:
> From: Shalin Shekhar Mangar <shalinman...@gmail.com>
> Subject: Re: Solr - DataImportHandler - Large Dataset results ?
> To: solr-user@lucene.apache.org
> Date: Friday, December 12, 2008, 9:41 PM
> 
> DataImportHandler is designed to stream rows one by one to create Solr
> documents. As long as your database driver supports streaming, you should
be
> fine. Which database are you using?
> 
> On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay <kaykay.uni...@yahoo.com>
wrote:
> 
>> As per the example in the wiki -
>> http://wiki.apache.org/solr/DataImportHandler  - I am seeing the
following
>> fragment.
>> 
>> <dataSource driver="org.hsqldb.jdbcDriver"
>> url="jdbc:hsqldb:/temp/example/ex" user="sa" />
>>   <document name="products">
>>       <entity name="item" query="select * from
> item">
>>           <field column="ID" name="id" />
>>           <field column="NAME" name="name"
/>
>>             ......................
>>   </entity>
>> </document>
>> </dataSource>
>> 
>> My scaled-down application looks very similar along these lines but
where
>> my resultset is so big that it cannot fit within main memory by any
> chance.
>> 
>> So I was planning to split this single query into multiple subqueries
-
>> with another conditional based on the id . ( id < 0 and id > 100
,
> say ) .
>> 
>> I am curious if there is any way to specify another conditional clause
,
>> (<splitData Column = "id"  batch="10000" />,
> where the column is supposed to
>> be an integer value) - and internally , the implementation could
actually
>> generate the subqueries -
>> 
>> i) get the min , max of the numeric column , and send queries to the
>> database based on the batch size
>> 
>> ii) Add Documents for each batch and close the resultset .
>> 
>> This might end up putting more load on the database (but at least the
>> dataset would fit in the main memory ).
>> 
>> Let me know if anyone else had run into similar issues and how this
was
>> encountered.
>> 
>> 
>> 
> 
> 
> 
> 
> --Regards,
> Shalin Shekhar Mangar.
> 
> 
> 




      

Reply via email to