On Sat, Dec 13, 2008 at 4:51 AM, Kay Kay <kaykay.uni...@yahoo.com> wrote:
> Thanks Bryan . > > That clarifies a lot. > > But even with streaming - retrieving one document at a time and adding to > the IndexWriter seems to making it more serializable . > We have experimented with making DataImportHandler multi-threaded in the past. We found that the improvement was very small (5-10%) because, with databases on the local network, the bottleneck is Lucene's ability to index documents rather than DIH's ability to create documents. Since that made the implementation much more complex, we did not go with it. > > So - may be the DataImportHandler could be optimized to retrieve a bunch of > results from the query and add the Documents in a separate thread , from a > Executor pool (and make this number configurable / may be retrieved from the > System as the number of physical cores to exploit maximum parallelism ) > since that seems like a bottleneck. > For now, you can try creating multiple root entities with LIMIT clause to fetch rows in batches. For example: <entity name="first" query="select * from table LIMIT 0, 5000"> .... </entity> <entity name="second" query="select * from table LIMIT 5000, 10000"> ... </entity> and so on. An alternate solution would be to use request parameters as variables in the LIMIT clause and call DIH full import with different start and offset. For example: <entity name="x" query="select * from x LIMIT ${dataimporter.request.startAt}, ${dataimporter.request.count}" ... </entity> Then call: http://host:port/solr/dataimport?command=full-import&startAt=0&count=5000 Wait for it to complete import (you'll have to monitor the output to figure out when the import ends), and then call: http://host:port /solr/dataimport?command=full-import&startAt=5000&count=10000 and so on. Note, "start" and "rows" are parameters used by DIH, so don't use these parameter names. I guess this will be more complex than using multiple root entities. > > Any comments on the same. > > A workaround for the streaming bug with MySql JDBC driver is detailed here: http://wiki.apache.org/solr/DataImportHandlerFaq If you try any of these tricks, do let us know if it improves the performance. If there is something which gives a lot of improvement, we can figure out ways to implement them inside DataImportHandler itself. -- Regards, Shalin Shekhar Mangar.