Thanks everyone for the responses.
I believe I have found the problem. The type of __version__ is incorrect in our schema. This is a required field that is primarily used by Solr. Our schema has typed it as type=int instead of type=long I believe that this number is used by the replication process to figure out what needs to be sync'd on an individual replicate. In our case Solr puts the value in during indexing. It appears that Solr has chosen a number that cannot be represented by "int". As the replicates query the leader to determine if a sync is necessary the the leader throws an error as it try's to format the response with the large _version_ . This process continues until the replicates give up. I finally verified this by doing a simple query _version_:* which throws the same error but gives more helpful info "re-index your documents" Thanks. ________________________________ From: Rick Leir <rl...@leirtech.com> Sent: Friday, September 22, 2017 12:34:57 AM To: solr-user@lucene.apache.org Subject: Re: Replicates not recovering after rolling restart Wunder, Erick $ dc 16o 1578578283947098112p 15E83C95E8D00000 That is an interesting number. Is it, as a guess, machine instructions or an address pointer? It does not look like UTF-8 or ASCII. Machine code looks promising: Disassembly: 0: 15 e8 3c 95 e8 adc eax,0xe8953ce8 5: d0 00 rol BYTE PTR [rax],1 .... /ADC/dest,src Modifies flags: AF CF OF SF PF ZF Sums two binary operands placing the result in the destination. *ROL - Rotate Left* Registers: the/64-bit/extension of/eax/is called/rax/. Is that code possibly in the JVM executable? Or a random memory page. cheers -- Rick On 2017-09-20 07:21 PM, Walter Underwood wrote: > 1578578283947098112 needs 61 bits. Is it being parsed into a 32 bit target? > > That doesn’t explain where it came from, of course. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Sep 20, 2017, at 3:35 PM, Erick Erickson <erickerick...@gmail.com> wrote: >> >> The numberformatexception is...odd. Clearly that's too big a number >> for an integer, did anything in the underlying schema change? >> >> Best, >> Erick >> >> On Wed, Sep 20, 2017 at 3:00 PM, Walter Underwood <wun...@wunderwood.org> >> wrote: >>> Rolling restarts work fine for us. I often include installing new configs >>> with that. Here is our script. Pass it any hostname in the cluster. I use >>> the load balancer name. You’ll need to change the domain and the install >>> directory of course. >>> >>> #!/bin/bash >>> >>> cluster=$1 >>> >>> hosts=`curl -s >>> "http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json" >>> | jq -r '.cluster.live_nodes[]' | sort` >>> >>> for host in $hosts >>> do >>> host="${host}.cloud.cheggnet.com" >>> echo restarting Solr on $host >>> ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin >>> bin/solr start -cloud -h `hostname`' >>> done >>> >>> >>> Walter Underwood >>> wun...@wunderwood.org >>> http://observer.wunderwood.org/ (my blog) >>> >>> >>>> On Sep 20, 2017, at 1:42 PM, Bill Oconnor <bocon...@plos.org> wrote: >>>> >>>> Hello, >>>> >>>> >>>> Background: >>>> >>>> >>>> We have been successfully using Solr for over 5 years and we recently made >>>> the decision to move into SolrCloud. For the most part that has been easy >>>> but we have repeated problems with our rolling restart were server remain >>>> functional but stay in Recovery until they stop trying. We restarted >>>> because we increased the memory from 12GB to 16GB on the JVM. >>>> >>>> >>>> Does anyone have any insight as to what is going on here? >>>> >>>> Is there a special procedure I should use for starting a stopping host? >>>> >>>> Is it ok to do a rolling restart on all the nodes in s shard? >>>> >>>> >>>> Any insight would be appreciated. >>>> >>>> >>>> Configuration: >>>> >>>> >>>> We have a group of servers with multiple collections. Each collection >>>> consist of one shard and multiple replicates. We are running the latest >>>> stable version of SolrClound 6.6 on Ubuntu LTS and Oracle Corporation Java >>>> HotSpot(TM) 64-Bit Server VM 1.8.0_66 25.66-b17 >>>> >>>> >>>> (collection) (shard) (replicates) >>>> >>>> journals_stage -> shard1 -> solr-220 (leader) , solr-223, solr-221, >>>> solr-222 (replicates) >>>> >>>> >>>> Problem: >>>> >>>> >>>> Restarting the system puts the replicates in a recovery state they never >>>> exit from. They eventually give up after 500 tries. If I go to the >>>> individual replicates and execute a query the data is still available. >>>> >>>> >>>> Using tcpdump I find the replicates sending this request to the leader >>>> (the leader appears to be active). >>>> >>>> >>>> The exchange goes like this - : >>>> >>>> >>>> solr-220 is the leader. >>>> >>>> Solr-221 to Solr-220 >>>> >>>> >>>> 10:18:42.426823 IP solr-221:54341 > solr-220:8983: >>>> >>>> >>>> POST /solr/journals_stage_shard1_replica1/update HTTP/1.1 >>>> Content-Type: application/x-www-form-urlencoded; charset=UTF-8 >>>> User-Agent: >>>> Solr[org.apache.solr<http://org.apache.solr/>.client.solrj.impl<http://client.solrj.impl/>.HttpSolrClient] >>>> 1.0 >>>> Content-Length: 108 >>>> Host: solr-220:8983 >>>> Connection: Keep-Alive >>>> >>>> >>>> commit_end_point=true&openSearcher=false&commit=true&softCommit=false&waitSearcher=true&wt=javabin&version=2 >>>> >>>> >>>> Solr-220 back to Solr-221 >>>> >>>> >>>> IP solr-220:8983 > solr-221:54341: Flags [P.], seq 1:5152, ack 385, win >>>> 235, options [nop,nop, >>>> TS val 858155553 ecr 858107069], length 5151 >>>> ..HTTP/1.1 500 Server Error >>>> Content-Type: application/octet-stream >>>> Content-Length: 5060 >>>> >>>> >>>> .responseHeader..&statusT..%QTimeC.%error..#msg?.For input string: >>>> "1578578283947098112".%trace?.&java.lang.NumberFormatException: For >>>> input string: "1578578283947098112" >>>> at >>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) >>>> at java.lang.Integer.parseInt(Integer.java:583) >>>> at java.lang.Integer.parseInt(Integer.java:615) >>>> at >>>> org.apache.lucene.queries.function.docvalues.IntDocValues.getRangeScorer(IntDocValues.java:89) >>>> at >>>> org.apache.solr<http://org.apache.solr/>.search.function.ValueSourceRangeFilter$1.iterator(ValueSourceRangeFilter.java:83) >>>> at >>>> org.apache.solr<http://org.apache.solr/>.search.SolrConstantScoreQuery$ConstantWeight.scorer(SolrConstantScoreQuery.java:100) >>>> at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:126) >>>> at >>>> org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:400) >>>> at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:381) >>>> at >>>> org.apache.solr<http://org.apache.solr/>.update.DeleteByQueryWrapper$1.scorer(DeleteByQueryWrapper.java:90) >>>> at >>>> org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:709) >>>> >>>> at >>>> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:267) >>>> >>>> >