On 2/14/2014 10:45 PM, William Bell wrote: > On virtual cores the DIH handler is really slow. On a 12 core box it only > uses 1 core while indexing. > > Does anyone know how to do Java threading from a SQL query into Solr? > Examples? > > I can use SolrJ to do it, or I might be able to modify DIH to enable > threading. > > At some point in 3.x threading was enabled in DIH, but it was removed since > people where having issues with it (we never did).
If you know how to fix DIH so it can do multiple indexing threads safely, please open an issue and upload a patch. I'm still using DIH for full rebuilds, but I'd actually like to replace it with a rebuild routine written in SolrJ. I currently achieve decent speed by running DIH on all my shards at the same time. I do use SolrJ for once-a-minute index maintenance, but the code that I've written to pull data out of SQL and write it to Solr is not able to index millions of documents in a single thread as fast as DIH does. I have been building a multithreaded design in my head, but I haven't had a chance to write real code and see whether it's actually a good design. For me, the bottleneck is definitely Solr, not the database. I recently wrote a test program that uses my current SolrJ indexing method. If I skip the "server.add(docs)" line, it can read all 91 million docs from the database and build SolrInputDocument objects for them in 2.5 hours or less, all with a single thread. When I do a real rebuild with DIH, it takes a little more than 4.5 hours -- and that is inherently multithreaded, because it's doing all the shards simultaneously. I have no idea how long it would take with a single-threaded SolrJ program. Thanks, Shawn