Re: SolrCloud MatchAllDocsQuery returning different number of docs each request

Mark Miller Thu, 02 Aug 2012 16:07:57 -0700

FYI: I've committed the rest of the work I was doing on trunk in this area.


On Aug 2, 2012, at 4:42 PM, Timothy Potter <thelabd...@gmail.com> wrote:

> Yes, I can but won't get to it today unfortunately. I had my eval
> environment running on some very expensive EC2 instances and shut it
> down for the time being until I can focus on it again. Will try to get
> back to this either tomorrow or over the weekend. Sorry for the delay.
> 
> Tim
> 
> On Thu, Aug 2, 2012 at 1:35 PM, Mark Miller <markrmil...@gmail.com> wrote:
>> Can you do me a favor and try not using the batch add for a run?
>> 
>> Just do the add one doc at a time. (solrServer.add(doc) rather than 
>> solrServer.add(collection))
>> 
>> I just fixed one issue with it this morning on trunk - it may be the cause 
>> of this oddity.
>> 
>> I'm also working on some performance issues around that method too (good 
>> performance without starting thousands of threads).
>> 
>> Until I get all that straightened out (hopefully very soon), I think you 
>> will have better luck not using the bulk, collection add method.
>> 
>> On Aug 2, 2012, at 2:16 PM, Timothy Potter <thelabd...@gmail.com> wrote:
>> 
>>> Thanks Mark.
>>> 
>>> I'm actually using SolrJ 3.4.0, so using CommonsHttpSolrServer:
>>> 
>>> Collection<SolrInputDocument> batch = ...
>>> ... build up batch ...
>>> solrServer.add( batch );
>>> 
>>> Basically, I have a custom Pig StoreFunc that sends docs to Solr from
>>> our Hadoop analytics nodes. The reason I'm not using SolrJ 4.0.0-ALPHA
>>> is that I couldn't get it to run in my Hadoop environment. There's
>>> some classpath conflict with the Apache HttpClient. SolrJ 4 depends on
>>> 4.1.3 but when I run it in my env, I get the following:
>>> 
>>> Caused by: java.lang.NoSuchMethodError:
>>> org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager: method
>>> <init>()V not found
>>>      at 
>>> org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:94)
>>>      at 
>>> org.apache.solr.client.solrj.impl.CloudSolrServer.<init>(CloudSolrServer.java:70)
>>>      ... 16 more
>>> 
>>> I spent hours trying to resolve the classpath issue and finally had to
>>> bail and just used the 3.4 SolrJ client as I'm just at the evaluation
>>> stage at this point. So it sounds like this could be the cause of my
>>> problems.
>>> 
>>> One other thing ... I do have the _version_ field defined in my
>>> schema.xml but am not setting it on the client side when indexing.
>>> Should I be doing that?
>>> 
>>> Cheers,
>>> Tim
>>> 
>>> 
>>> On Thu, Aug 2, 2012 at 11:27 AM, Mark Miller <markrmil...@gmail.com> wrote:
>>>> 
>>>> On Aug 2, 2012, at 11:08 AM, Timothy Potter <thelabd...@gmail.com> wrote:
>>>> 
>>>>> Just starting to get into SolrCloud using 4.0.0-ALPHA and am very
>>>>> impressed so far ...
>>>>> 
>>>>> I have a 12-shard index with ~104M docs with each shard having
>>>>> 1-replica (so 24 Solr servers running)
>>>>> 
>>>>> Using the Query form on the Admin panel, I issue the MatchAllDocsQuery
>>>>> (*:*) and each time I send the request the value for numFound in the
>>>>> result is different. It's always close but not exactly the same as I
>>>>> would expect? Can anyone shed some light on this issue? I also tried a
>>>>> real query, such as "#olympics lochte" and same thing - different
>>>>> numFound each time. The first page of actual docs returned is the same
>>>>> so maybe I should just ignore the numFound issue?
>>>>> 
>>>>> Note that while experiencing this behavior, I am not adding any docs
>>>>> to the index and all docs have been committed with waitFlush=true and
>>>>> waitSearcher=true on the commit. Also, not doing soft commits at this
>>>>> point. In addition, after having committed all 104M docs, I hit the
>>>>> optimize button the panel so I have only 1 segment. In other words,
>>>>> the index is not being updated and has been optimized at this point.
>>>> 
>>>> 
>>>> How are you adding docs? Eg what client and what method in particular 
>>>> (what is your line of code that actually adds the doc).
>>>> 
>>>> You can find the numFound result for each node by passing the param 
>>>> distrib=false. What does this tell you? Are your replicas in sync with the 
>>>> leader? What does the count for each shard add up to?
>>>> 
>>>> I would not ignore the issue - something must be off. It may somehow be 
>>>> user error, it may be a bug that has been fixed since the alpha, or it may 
>>>> be something new.
>>>> 
>>>> Are you sure every shard you are issuing the query *from* is active and 
>>>> live according to ZooKeeper? Eg when you look at the cloud admin view and 
>>>> look at the cluster visualization, are all the nodes green?
>>>> 
>>>> - Mark Miller
>>>> lucidimagination.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 

- Mark Miller
lucidimagination.com

Re: SolrCloud MatchAllDocsQuery returning different number of docs each request

Reply via email to