Re: Replicates not recovering after rolling restart

Erick Erickson Fri, 22 Sep 2017 16:15:07 -0700

Gah! Don't you hate it when you spend days on something like this?

Slight clarification. _version_ is used for optimistic locking, not
replication. Let's say you have two clients updating the same document
and sending it to Solr at the same time. The _version_ field is filled
out automagically and one of the updates will be rejected. Otherwise
there'd be no good way to fail a document due to this kind of thing.


Thanks for letting us know what the problem really was.

Best,
Erick

On Fri, Sep 22, 2017 at 2:57 PM, Bill Oconnor <bocon...@plos.org> wrote:
>
> Thanks everyone for the responses.
>
>
> I believe I have found the problem.
>
>
> The type of __version__ is incorrect in our schema. This is a required field 
> that is primarily used by Solr.
>
>
> Our schema has typed it as type=int instead of  type=long
>
>
> I believe that this number is used by the replication process to figure out 
> what needs to be sync'd on an
>
> individual replicate. In our case Solr puts the value in during indexing. It 
> appears that Solr has chosen a
>
> number that cannot be represented by "int". As the replicates query the 
> leader to determine if a sync is
>
> necessary the the leader throws an error as it try's to format the response 
> with the large _version_ .
>
> This process continues until the replicates give up.
>
>
> I finally verified this by doing a simple query _version_:*    which throws 
> the same error but gives
>
> more helpful info "re-index your documents"
>
>
> Thanks.
>
>
>
>
>
> ________________________________
> From: Rick Leir <rl...@leirtech.com>
> Sent: Friday, September 22, 2017 12:34:57 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Replicates not recovering after rolling restart
>
> Wunder, Erick
>
> $ dc
> 16o
> 1578578283947098112p
> 15E83C95E8D00000
>
> That is an interesting number. Is it, as a guess, machine instructions
> or an address pointer? It does not look like UTF-8 or ASCII. Machine
> code looks promising:
>
>
> Disassembly:
>
> 0:  15 e8 3c 95 e8          adc    eax,0xe8953ce8
> 5:  d0 00                   rol    BYTE PTR [rax],1
> ....
>
> /ADC/dest,src Modifies flags: AF CF OF SF PF ZF Sums two binary operands
> placing the result in the destination.
>
> *ROL - Rotate Left*
>
> Registers: the/64-bit/extension of/eax/is called/rax/.
>
> Is that code possibly in the JVM executable? Or a random memory page.
>
> cheers -- Rick
>
> On 2017-09-20 07:21 PM, Walter Underwood wrote:
>> 1578578283947098112 needs 61 bits. Is it being parsed into a 32 bit target?
>>
>> That doesn’t explain where it came from, of course.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Sep 20, 2017, at 3:35 PM, Erick Erickson <erickerick...@gmail.com> wrote:
>>>
>>> The numberformatexception is...odd. Clearly that's too big a number
>>> for an integer, did anything in the underlying schema change?
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Sep 20, 2017 at 3:00 PM, Walter Underwood <wun...@wunderwood.org> 
>>> wrote:
>>>> Rolling restarts work fine for us. I often include installing new configs 
>>>> with that. Here is our script. Pass it any hostname in the cluster. I use 
>>>> the load balancer name. You’ll need to change the domain and the install 
>>>> directory of course.
>>>>
>>>> #!/bin/bash
>>>>
>>>> cluster=$1
>>>>
>>>> hosts=`curl -s 
>>>> "http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json";
>>>>  | jq -r '.cluster.live_nodes[]' | sort`
>>>>
>>>> for host in $hosts
>>>> do
>>>>     host="${host}.cloud.cheggnet.com"
>>>>     echo restarting Solr on $host
>>>>     ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin 
>>>> bin/solr start -cloud -h `hostname`'
>>>> done
>>>>
>>>>
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>>
>>>>
>>>>> On Sep 20, 2017, at 1:42 PM, Bill Oconnor <bocon...@plos.org> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>>
>>>>> Background:
>>>>>
>>>>>
>>>>> We have been successfully using Solr for over 5 years and we recently 
>>>>> made the decision to move into SolrCloud. For the most part that has been 
>>>>> easy but we have repeated problems with our rolling restart were server 
>>>>> remain functional but stay in Recovery until they stop trying. We 
>>>>> restarted because we increased the memory from 12GB to 16GB on the JVM.
>>>>>
>>>>>
>>>>> Does anyone have any insight as to what is going on here?
>>>>>
>>>>> Is there a special procedure I should use for starting a stopping host?
>>>>>
>>>>> Is it ok to do a rolling restart on all the nodes in s shard?
>>>>>
>>>>>
>>>>> Any insight would be appreciated.
>>>>>
>>>>>
>>>>> Configuration:
>>>>>
>>>>>
>>>>> We have a group of servers with multiple collections. Each collection 
>>>>> consist of one shard and multiple replicates. We are running the latest 
>>>>> stable version of SolrClound 6.6 on Ubuntu LTS and Oracle Corporation 
>>>>> Java HotSpot(TM) 64-Bit Server VM 1.8.0_66 25.66-b17
>>>>>
>>>>>
>>>>> (collection)              (shard)          (replicates)
>>>>>
>>>>> journals_stage   ->  shard1  ->  solr-220 (leader) , solr-223, solr-221, 
>>>>> solr-222 (replicates)
>>>>>
>>>>>
>>>>> Problem:
>>>>>
>>>>>
>>>>> Restarting the system puts the replicates in a recovery state they never 
>>>>> exit from. They eventually give up after 500 tries.  If I go to the 
>>>>> individual replicates and execute a query the data is still available.
>>>>>
>>>>>
>>>>> Using tcpdump I find the replicates sending this request to the leader 
>>>>> (the leader appears to be active).
>>>>>
>>>>>
>>>>> The exchange goes  like this - :
>>>>>
>>>>>
>>>>> solr-220 is the leader.
>>>>>
>>>>> Solr-221 to Solr-220
>>>>>
>>>>>
>>>>> 10:18:42.426823 IP solr-221:54341 > solr-220:8983:
>>>>>
>>>>>
>>>>> POST /solr/journals_stage_shard1_replica1/update HTTP/1.1
>>>>> Content-Type: application/x-www-form-urlencoded; charset=UTF-8
>>>>> User-Agent: 
>>>>> Solr[org.apache.solr<http://org.apache.solr/>.client.solrj.impl<http://client.solrj.impl/>.HttpSolrClient]
>>>>>  1.0
>>>>> Content-Length: 108
>>>>> Host: solr-220:8983
>>>>> Connection: Keep-Alive
>>>>>
>>>>>
>>>>> commit_end_point=true&openSearcher=false&commit=true&softCommit=false&waitSearcher=true&wt=javabin&version=2
>>>>>
>>>>>
>>>>> Solr-220 back to Solr-221
>>>>>
>>>>>
>>>>> IP solr-220:8983 > solr-221:54341: Flags [P.], seq 1:5152, ack 385, win 
>>>>> 235, options [nop,nop,
>>>>> TS val 858155553 ecr 858107069], length 5151
>>>>> ..HTTP/1.1 500 Server Error
>>>>> Content-Type: application/octet-stream
>>>>> Content-Length: 5060
>>>>>
>>>>>
>>>>> .responseHeader..&statusT..%QTimeC.%error..#msg?.For input string: 
>>>>> "1578578283947098112".%trace?.&java.lang.NumberFormatException: For
>>>>> input string: "1578578283947098112"
>>>>> at 
>>>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>>>> at java.lang.Integer.parseInt(Integer.java:583)
>>>>> at java.lang.Integer.parseInt(Integer.java:615)
>>>>> at 
>>>>> org.apache.lucene.queries.function.docvalues.IntDocValues.getRangeScorer(IntDocValues.java:89)
>>>>> at 
>>>>> org.apache.solr<http://org.apache.solr/>.search.function.ValueSourceRangeFilter$1.iterator(ValueSourceRangeFilter.java:83)
>>>>> at 
>>>>> org.apache.solr<http://org.apache.solr/>.search.SolrConstantScoreQuery$ConstantWeight.scorer(SolrConstantScoreQuery.java:100)
>>>>> at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:126)
>>>>> at 
>>>>> org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:400)
>>>>> at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:381)
>>>>> at 
>>>>> org.apache.solr<http://org.apache.solr/>.update.DeleteByQueryWrapper$1.scorer(DeleteByQueryWrapper.java:90)
>>>>> at 
>>>>> org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:709)
>>>>>
>>>>> at 
>>>>> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:267)
>>>>>
>>>>>
>>
>

Re: Replicates not recovering after rolling restart

Reply via email to