On Sat, Feb 15, 2014 at 1:07 PM, Shawn Heisey <s...@elyograg.org> wrote:
> On 2/14/2014 10:45 PM, William Bell wrote: > > On virtual cores the DIH handler is really slow. On a 12 core box it only > > uses 1 core while indexing. > > > > Does anyone know how to do Java threading from a SQL query into Solr? > > Examples? > > > > I can use SolrJ to do it, or I might be able to modify DIH to enable > > threading. > > > > At some point in 3.x threading was enabled in DIH, but it was removed > since > > people where having issues with it (we never did). > > If you know how to fix DIH so it can do multiple indexing threads > safely, please open an issue and upload a patch. > Please! Don't do it. Never again! https://issues.apache.org/jira/browse/SOLR-3011 As far as I understand the general idea is to find the DIH successor https://issues.apache.org/jira/browse/SOLR-4799?focusedCommentId=13738424&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13738424 > > I'm still using DIH for full rebuilds, but I'd actually like to replace > it with a rebuild routine written in SolrJ. I currently achieve decent > speed by running DIH on all my shards at the same time. > > I do use SolrJ for once-a-minute index maintenance, but the code that > I've written to pull data out of SQL and write it to Solr is not able to > index millions of documents in a single thread as fast as DIH does. I > have been building a multithreaded design in my head, but I haven't had > a chance to write real code and see whether it's actually a good design. > > For me, the bottleneck is definitely Solr, not the database. I recently > wrote a test program that uses my current SolrJ indexing method. If I > skip the "server.add(docs)" line, it can read all 91 million docs from > the database and build SolrInputDocument objects for them in 2.5 hours > or less, all with a single thread. When I do a real rebuild with DIH, > it takes a little more than 4.5 hours -- and that is inherently > multithreaded, because it's doing all the shards simultaneously. I have > no idea how long it would take with a single-threaded SolrJ program. > > Thanks, > Shawn > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>