Thanks everyone for the response.
I do not think we changed anything other than the JVM memory size. I did leave out one piece of info - one of the host is a replicate in another shard. collection1 -> shard1 -> *h1, h2, h3, h4 where star is leader collection2 -> shard1 -> *h5, h3 When I restart *h1 works fine h2,h3,h4 go into recovery but still respond to request. *h1 starts getting the post from the recovering servers and responds with the 500 Server Error until the servers quit. Collection2 with h3 is active and fine even though it is recovering in collection1. This happened before and I resolved it by deleting and then creating a new collection. I restart using the standard "sudo service solr start/stop" I have to say I am not comfortable with have multiple shards being shared on the same host. The Productions servers will not be configured this way but these servers are for development. ________________________________ From: Erick Erickson <erickerick...@gmail.com> Sent: Wednesday, September 20, 2017 3:35:16 PM To: solr-user Subject: Re: Replicates not recovering after rolling restart The numberformatexception is...odd. Clearly that's too big a number for an integer, did anything in the underlying schema change? Best, Erick On Wed, Sep 20, 2017 at 3:00 PM, Walter Underwood <wun...@wunderwood.org> wrote: > Rolling restarts work fine for us. I often include installing new configs > with that. Here is our script. Pass it any hostname in the cluster. I use the > load balancer name. You’ll need to change the domain and the install > directory of course. > > #!/bin/bash > > cluster=$1 > > hosts=`curl -s > "http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json" > | jq -r '.cluster.live_nodes[]' | sort` > > for host in $hosts > do > host="${host}.cloud.cheggnet.com" > echo restarting Solr on $host > ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin > bin/solr start -cloud -h `hostname`' > done > > > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Sep 20, 2017, at 1:42 PM, Bill Oconnor <bocon...@plos.org> wrote: >> >> Hello, >> >> >> Background: >> >> >> We have been successfully using Solr for over 5 years and we recently made >> the decision to move into SolrCloud. For the most part that has been easy >> but we have repeated problems with our rolling restart were server remain >> functional but stay in Recovery until they stop trying. We restarted because >> we increased the memory from 12GB to 16GB on the JVM. >> >> >> Does anyone have any insight as to what is going on here? >> >> Is there a special procedure I should use for starting a stopping host? >> >> Is it ok to do a rolling restart on all the nodes in s shard? >> >> >> Any insight would be appreciated. >> >> >> Configuration: >> >> >> We have a group of servers with multiple collections. Each collection >> consist of one shard and multiple replicates. We are running the latest >> stable version of SolrClound 6.6 on Ubuntu LTS and Oracle Corporation Java >> HotSpot(TM) 64-Bit Server VM 1.8.0_66 25.66-b17 >> >> >> (collection) (shard) (replicates) >> >> journals_stage -> shard1 -> solr-220 (leader) , solr-223, solr-221, >> solr-222 (replicates) >> >> >> Problem: >> >> >> Restarting the system puts the replicates in a recovery state they never >> exit from. They eventually give up after 500 tries. If I go to the >> individual replicates and execute a query the data is still available. >> >> >> Using tcpdump I find the replicates sending this request to the leader (the >> leader appears to be active). >> >> >> The exchange goes like this - : >> >> >> solr-220 is the leader. >> >> Solr-221 to Solr-220 >> >> >> 10:18:42.426823 IP solr-221:54341 > solr-220:8983: >> >> >> POST /solr/journals_stage_shard1_replica1/update HTTP/1.1 >> Content-Type: application/x-www-form-urlencoded; charset=UTF-8 >> User-Agent: >> Solr[org.apache.solr<http://org.apache.solr/>.client.solrj.impl<http://client.solrj.impl/>.HttpSolrClient] >> 1.0 >> Content-Length: 108 >> Host: solr-220:8983 >> Connection: Keep-Alive >> >> >> commit_end_point=true&openSearcher=false&commit=true&softCommit=false&waitSearcher=true&wt=javabin&version=2 >> >> >> Solr-220 back to Solr-221 >> >> >> IP solr-220:8983 > solr-221:54341: Flags [P.], seq 1:5152, ack 385, win 235, >> options [nop,nop, >> TS val 858155553 ecr 858107069], length 5151 >> ..HTTP/1.1 500 Server Error >> Content-Type: application/octet-stream >> Content-Length: 5060 >> >> >> .responseHeader..&statusT..%QTimeC.%error..#msg?.For input string: >> "1578578283947098112".%trace?.&java.lang.NumberFormatException: For >> input string: "1578578283947098112" >> at >> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) >> at java.lang.Integer.parseInt(Integer.java:583) >> at java.lang.Integer.parseInt(Integer.java:615) >> at >> org.apache.lucene.queries.function.docvalues.IntDocValues.getRangeScorer(IntDocValues.java:89) >> at >> org.apache.solr<http://org.apache.solr/>.search.function.ValueSourceRangeFilter$1.iterator(ValueSourceRangeFilter.java:83) >> at >> org.apache.solr<http://org.apache.solr/>.search.SolrConstantScoreQuery$ConstantWeight.scorer(SolrConstantScoreQuery.java:100) >> at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:126) >> at >> org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:400) >> at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:381) >> at >> org.apache.solr<http://org.apache.solr/>.update.DeleteByQueryWrapper$1.scorer(DeleteByQueryWrapper.java:90) >> at >> org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:709) >> >> at >> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:267) >> >> >