Re: The 'threads' parameter in DIH - SOLR 4.3.0

Java One Fri, 14 Jun 2013 09:22:37 -0700

Hello,
 
       I'm more than happy to contribute to this effort as well.  
 
We are still on Solr 3.5 and never got solr 'threads' working properly. I've 
heard much of this was fixed in 3.6 but still a bit buggy and deprecated in 
later versions. Fully support in 4.X is a major wish-list item, that I would be 
willing to sacrafice a few weekends for myself helping write it.
 
Nevertheless and related to this, its a bit interesting to see this topic open 
because I just started experimenting with simulating a multi-threaded import 
via the DIH and the following approach and it appears to be working fine with a 
caveat:
 
Here's are my steps:
 
I created a series of similiar entities, partitioning the data I'm targeting by 
a logical range (i.e WHERE somefield BETWEEN 'SOME VALUE' AND 'SOME VALUE'.
 
And I have a few of these, but depending on your data, you'll need to 
experiement.. (You'll need to be careful not to bring your database to its 
knees! )
 
Within Solrconfig.xml , I've created a corresponding Data Import Handlers, one 
for each of these entitites.
 
And when I initiate a nimport, I call each one, similiar to the below 
(obviously I've stripped out my server & naming conventions.
 
http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&commit=true
 
http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2]&commit=true
 
 
...etc
 
Import seems to run fine with SIGNIFICANT performance gains. The only caveat I 
haven't figured out yet is one of the threads doesn't appear to commit its data 
(although it states that it did). I have turned off auto-commit as well.. but 
still playing with it.
 
Either way an out of the box is much preferred.
 
Cheers!
Mike

From: Mikhail Khludnev <mkhlud...@griddynamics.com>
To: solr-user <solr-user@lucene.apache.org> 
Sent: Friday, June 14, 2013 9:15 AM
Subject: Re: The 'threads' parameter in DIH - SOLR 4.3.0

Hello,

Most times users end-up with coding multithread SolrJ indexer that I
consider as a sad thing. As 3.x fix contributor I want to share my vision
to the problem. While I did that work I realized that join operation itself
is too hard and even impossible to make concurrent. I propose to add
concurrency into outbound and inbound streams.

My plan is:
1. add threads to outbound flow
https://issues.apache.org/jira/browse/SOLR-3585it allows to don't wait for
Solr. I mostly like that code, but recently I realized that this code
implements ConcurrentUpdateSolrServer algorithm, looking forward I prefer
to unify some core concurrent code between them or it's kind of using CUSS
inside of DIH's SolrWriter
2. The next problem, which we've faced is SQLEntityProcessor. It has two
modes, one of them gets miserable performance due to N+1 problem; cached
version is not production capable with default heap cache.  Our proposal
for it https://issues.apache.org/jira/browse/SOLR-4799unfortunately I have
no time to polish the patch.
3. After that the only thing which DIH  waits for is jdbc. it can be easily
boosted by implementing DataSource wrapper with producer thread and bounded
queue as a buffer.

if we complete this plan, we will never need to code SolrJ indexers.

Particular question to you is what you need to speed up?

On Thu, Jun 13, 2013 at 11:01 PM, Shawn Heisey <s...@elyograg.org> wrote:

> On 6/13/2013 12:08 PM, bbarani wrote:
>
>> I see that the threads parameter has been removed from DIH from all
>> version
>> starting SOLR 4.x. Can someone let me know the best way to initiate
>> indexing
>> in multi threaded mode when using DIH now? Is there a way to do that?
>>
>
> That parameter was removed because it didn't work right, and there was no
> apparent way to fix it.  The change that went into a later 3.6 version was
> a bandaid, not a fix.  I don't know all the details.
>
> There's no way to get multithreading with DIH directly, but you can do it
> indirectly:
>
> Create multiple request handlers with different names, such as
> /dataimport1, /dataimport2, etc.  Configure each handler with settings that
> will pull part of your data source.  Start them so they run concurrently.
>
> Depending on your environment, it may be easier to just write a
> multi-threaded indexing application using the Solr API for your language of
> choice.
>
> Thanks,
> Shawn
>
>

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<mkhlud...@griddynamics.com>

Re: The 'threads' parameter in DIH - SOLR 4.3.0

Reply via email to