Re: Solr - DataImportHandler - Large Dataset results ?

Shalin Shekhar Mangar Fri, 12 Dec 2008 21:05:33 -0800

On Sat, Dec 13, 2008 at 4:51 AM, Kay Kay <kaykay.uni...@yahoo.com> wrote:


> Thanks Bryan .
>
> That clarifies a lot.
>
> But even with streaming - retrieving one document at a time and adding to
> the IndexWriter seems to making it more serializable .
>

We have experimented with making DataImportHandler multi-threaded in the
past. We found that the improvement was very small (5-10%) because, with
databases on the local network, the bottleneck is Lucene's ability to index
documents rather than DIH's ability to create documents. Since that made the
implementation much more complex, we did not go with it.


>
> So - may be the DataImportHandler could be optimized to retrieve a bunch of
> results from the query and add the Documents in a separate thread , from a
> Executor pool (and make this number configurable / may be retrieved from the
> System as the number of physical cores to exploit maximum parallelism )
> since that seems like a bottleneck.
>

For now, you can try creating multiple root entities with LIMIT clause to
fetch rows in batches.

For example:
<entity name="first" query="select * from table LIMIT 0, 5000">
....
</entity>
<entity name="second" query="select * from table LIMIT 5000, 10000">
...
</entity>

and so on.

An alternate solution would be to use request parameters as variables in the
LIMIT clause and call DIH full import with different start and offset.

For example:
<entity name="x" query="select * from x LIMIT
${dataimporter.request.startAt}, ${dataimporter.request.count}"
...
</entity>

Then call:
http://host:port/solr/dataimport?command=full-import&startAt=0&count=5000
Wait for it to complete import (you'll have to monitor the output to figure
out when the import ends), and then call:
http://host:port
/solr/dataimport?command=full-import&startAt=5000&count=10000
and so on. Note, "start" and "rows" are parameters used by DIH, so don't use
these parameter names.

I guess this will be more complex than using multiple root entities.


>
> Any comments on the same.
>
>
A workaround for the streaming bug with MySql JDBC driver is detailed here:
http://wiki.apache.org/solr/DataImportHandlerFaq

If you try any of these tricks, do let us know if it improves the
performance. If there is something which gives a lot of improvement, we can
figure out ways to implement them inside DataImportHandler itself.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr - DataImportHandler - Large Dataset results ?

Reply via email to