Gah! Don't you hate it when you spend days on something like this? Slight clarification. _version_ is used for optimistic locking, not replication. Let's say you have two clients updating the same document and sending it to Solr at the same time. The _version_ field is filled out automagically and one of the updates will be rejected. Otherwise there'd be no good way to fail a document due to this kind of thing.
Thanks for letting us know what the problem really was. Best, Erick On Fri, Sep 22, 2017 at 2:57 PM, Bill Oconnor <bocon...@plos.org> wrote: > > Thanks everyone for the responses. > > > I believe I have found the problem. > > > The type of __version__ is incorrect in our schema. This is a required field > that is primarily used by Solr. > > > Our schema has typed it as type=int instead of type=long > > > I believe that this number is used by the replication process to figure out > what needs to be sync'd on an > > individual replicate. In our case Solr puts the value in during indexing. It > appears that Solr has chosen a > > number that cannot be represented by "int". As the replicates query the > leader to determine if a sync is > > necessary the the leader throws an error as it try's to format the response > with the large _version_ . > > This process continues until the replicates give up. > > > I finally verified this by doing a simple query _version_:* which throws > the same error but gives > > more helpful info "re-index your documents" > > > Thanks. > > > > > > ________________________________ > From: Rick Leir <rl...@leirtech.com> > Sent: Friday, September 22, 2017 12:34:57 AM > To: solr-user@lucene.apache.org > Subject: Re: Replicates not recovering after rolling restart > > Wunder, Erick > > $ dc > 16o > 1578578283947098112p > 15E83C95E8D00000 > > That is an interesting number. Is it, as a guess, machine instructions > or an address pointer? It does not look like UTF-8 or ASCII. Machine > code looks promising: > > > Disassembly: > > 0: 15 e8 3c 95 e8 adc eax,0xe8953ce8 > 5: d0 00 rol BYTE PTR [rax],1 > .... > > /ADC/dest,src Modifies flags: AF CF OF SF PF ZF Sums two binary operands > placing the result in the destination. > > *ROL - Rotate Left* > > Registers: the/64-bit/extension of/eax/is called/rax/. > > Is that code possibly in the JVM executable? Or a random memory page. > > cheers -- Rick > > On 2017-09-20 07:21 PM, Walter Underwood wrote: >> 1578578283947098112 needs 61 bits. Is it being parsed into a 32 bit target? >> >> That doesn’t explain where it came from, of course. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >>> On Sep 20, 2017, at 3:35 PM, Erick Erickson <erickerick...@gmail.com> wrote: >>> >>> The numberformatexception is...odd. Clearly that's too big a number >>> for an integer, did anything in the underlying schema change? >>> >>> Best, >>> Erick >>> >>> On Wed, Sep 20, 2017 at 3:00 PM, Walter Underwood <wun...@wunderwood.org> >>> wrote: >>>> Rolling restarts work fine for us. I often include installing new configs >>>> with that. Here is our script. Pass it any hostname in the cluster. I use >>>> the load balancer name. You’ll need to change the domain and the install >>>> directory of course. >>>> >>>> #!/bin/bash >>>> >>>> cluster=$1 >>>> >>>> hosts=`curl -s >>>> "http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json" >>>> | jq -r '.cluster.live_nodes[]' | sort` >>>> >>>> for host in $hosts >>>> do >>>> host="${host}.cloud.cheggnet.com" >>>> echo restarting Solr on $host >>>> ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin >>>> bin/solr start -cloud -h `hostname`' >>>> done >>>> >>>> >>>> Walter Underwood >>>> wun...@wunderwood.org >>>> http://observer.wunderwood.org/ (my blog) >>>> >>>> >>>>> On Sep 20, 2017, at 1:42 PM, Bill Oconnor <bocon...@plos.org> wrote: >>>>> >>>>> Hello, >>>>> >>>>> >>>>> Background: >>>>> >>>>> >>>>> We have been successfully using Solr for over 5 years and we recently >>>>> made the decision to move into SolrCloud. For the most part that has been >>>>> easy but we have repeated problems with our rolling restart were server >>>>> remain functional but stay in Recovery until they stop trying. We >>>>> restarted because we increased the memory from 12GB to 16GB on the JVM. >>>>> >>>>> >>>>> Does anyone have any insight as to what is going on here? >>>>> >>>>> Is there a special procedure I should use for starting a stopping host? >>>>> >>>>> Is it ok to do a rolling restart on all the nodes in s shard? >>>>> >>>>> >>>>> Any insight would be appreciated. >>>>> >>>>> >>>>> Configuration: >>>>> >>>>> >>>>> We have a group of servers with multiple collections. Each collection >>>>> consist of one shard and multiple replicates. We are running the latest >>>>> stable version of SolrClound 6.6 on Ubuntu LTS and Oracle Corporation >>>>> Java HotSpot(TM) 64-Bit Server VM 1.8.0_66 25.66-b17 >>>>> >>>>> >>>>> (collection) (shard) (replicates) >>>>> >>>>> journals_stage -> shard1 -> solr-220 (leader) , solr-223, solr-221, >>>>> solr-222 (replicates) >>>>> >>>>> >>>>> Problem: >>>>> >>>>> >>>>> Restarting the system puts the replicates in a recovery state they never >>>>> exit from. They eventually give up after 500 tries. If I go to the >>>>> individual replicates and execute a query the data is still available. >>>>> >>>>> >>>>> Using tcpdump I find the replicates sending this request to the leader >>>>> (the leader appears to be active). >>>>> >>>>> >>>>> The exchange goes like this - : >>>>> >>>>> >>>>> solr-220 is the leader. >>>>> >>>>> Solr-221 to Solr-220 >>>>> >>>>> >>>>> 10:18:42.426823 IP solr-221:54341 > solr-220:8983: >>>>> >>>>> >>>>> POST /solr/journals_stage_shard1_replica1/update HTTP/1.1 >>>>> Content-Type: application/x-www-form-urlencoded; charset=UTF-8 >>>>> User-Agent: >>>>> Solr[org.apache.solr<http://org.apache.solr/>.client.solrj.impl<http://client.solrj.impl/>.HttpSolrClient] >>>>> 1.0 >>>>> Content-Length: 108 >>>>> Host: solr-220:8983 >>>>> Connection: Keep-Alive >>>>> >>>>> >>>>> commit_end_point=true&openSearcher=false&commit=true&softCommit=false&waitSearcher=true&wt=javabin&version=2 >>>>> >>>>> >>>>> Solr-220 back to Solr-221 >>>>> >>>>> >>>>> IP solr-220:8983 > solr-221:54341: Flags [P.], seq 1:5152, ack 385, win >>>>> 235, options [nop,nop, >>>>> TS val 858155553 ecr 858107069], length 5151 >>>>> ..HTTP/1.1 500 Server Error >>>>> Content-Type: application/octet-stream >>>>> Content-Length: 5060 >>>>> >>>>> >>>>> .responseHeader..&statusT..%QTimeC.%error..#msg?.For input string: >>>>> "1578578283947098112".%trace?.&java.lang.NumberFormatException: For >>>>> input string: "1578578283947098112" >>>>> at >>>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) >>>>> at java.lang.Integer.parseInt(Integer.java:583) >>>>> at java.lang.Integer.parseInt(Integer.java:615) >>>>> at >>>>> org.apache.lucene.queries.function.docvalues.IntDocValues.getRangeScorer(IntDocValues.java:89) >>>>> at >>>>> org.apache.solr<http://org.apache.solr/>.search.function.ValueSourceRangeFilter$1.iterator(ValueSourceRangeFilter.java:83) >>>>> at >>>>> org.apache.solr<http://org.apache.solr/>.search.SolrConstantScoreQuery$ConstantWeight.scorer(SolrConstantScoreQuery.java:100) >>>>> at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:126) >>>>> at >>>>> org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:400) >>>>> at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:381) >>>>> at >>>>> org.apache.solr<http://org.apache.solr/>.update.DeleteByQueryWrapper$1.scorer(DeleteByQueryWrapper.java:90) >>>>> at >>>>> org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:709) >>>>> >>>>> at >>>>> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:267) >>>>> >>>>> >> >