problem with replication/solrcloud - getting 'missing required field' during update intermittently (SOLR-6251)
Issue was closed in Jira requesting it be discussed here first. Looking for any diagnostic assistance on this issue with 4.8.0 since it is intermittent and occurs without warning. Setup is two nodes, with external zk ensemble. Nodes are accessed round-robin on EC2 behind an ELB. Schema has: ... omitNorms="true" /> ... Most of the updates are working without issue, but randomly we'll get the above failure, even though searches before and after the update clearly indicate that the document had the timestamp field in it. The error occurs when the second node does it's distrib operation against the first node. Diagnostic details are all in the jira issue. Can provide more as needed, but would appreciate any suggestions on what to try or to help diagnose this other than just trying to throw thousands of requests at it in round-robin between the two instances to see if it's possible to reproduce the issue. -- Nathan -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: problem with replication/solrcloud - getting 'missing required field' during update intermittently (SOLR-6251)
FYI. We finally tracked down the problem at least 99.9% sure at this point, and it was staring me in the face the whole time - just never noticed: [{"id":"4b2c4d09-31e2-4fe2-b767-3868efbdcda1","channel": {"add": "preet"},"channel": {"add": "adam"}}] Look at the JSON... It's trying to add two "channel" array elements... Should have been: [{"id":"4b2c4d09-31e2-4fe2-b767-3868efbdcda1","channel": {"add": "preet"}}, {"id":"4b2c4d09-31e2-4fe2-b767-3868efbdcda1","channel": {"add": "adam"}}] I half wonder how it chose to interpret that particular chunk of json, but either way, I think the origin of our issue is resolved. From what I'm reading on JSON - this isn't valid syntax at all. I'm guessing that SOLR doesn't actually validate the JSON, and it's parser is just creating something weird in that situation like a new request for a whole new document. -- Nathan On 07/15/2014 07:19 PM, Nathan Neulinger wrote: Issue was closed in Jira requesting it be discussed here first. Looking for any diagnostic assistance on this issue with 4.8.0 since it is intermittent and occurs without warning. Setup is two nodes, with external zk ensemble. Nodes are accessed round-robin on EC2 behind an ELB. Schema has: ... ... Most of the updates are working without issue, but randomly we'll get the above failure, even though searches before and after the update clearly indicate that the document had the timestamp field in it. The error occurs when the second node does it's distrib operation against the first node. Diagnostic details are all in the jira issue. Can provide more as needed, but would appreciate any suggestions on what to try or to help diagnose this other than just trying to throw thousands of requests at it in round-robin between the two instances to see if it's possible to reproduce the issue. -- Nathan -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 -- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
/solr/admin/ping causing exceptions in log?
Recently deployed haproxy in front of my solr instances, and seeing a large number of exceptions in the logs now... Example below. I can pound the server with requests against /solr/admin/ping via curl, with no obvious issue, but the haproxy checks appear to be aggravating something. Solr 4.8.0 w/ solr cloud, 2 nodes, 3 zk, linux x86_64 It seems like when the issue occurs, I get a set of the errors all in a burst (below), never just one. Suggestions? -- Nathan Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 2014-07-26 23:04:36,506 ERROR qtp1532385072-4864 [g.apache.solr.servlet.SolrDispatchFilter] - null:org.eclipse.jetty.io.EofException at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914) at org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:443) at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:100) at org.eclipse.jetty.server.AbstractHttpConnection$Output.flush(AbstractHttpConnection.java:1094) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:297) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at org.apache.solr.util.FastWriter.flush(FastWriter.java:137) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:763) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:431) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:339) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.eclipse.jetty.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:375) at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:164) at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:194) at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838) ... 36 more 2014-07-26 23:04:36,513 ERROR qtp1532385072-4864 [g.apache.solr.servlet.SolrDispatchFilter] - null:org.eclipse.jetty.io.EofException at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914) at org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:443
Re: /solr/admin/ping causing exceptions in log?
Tried changing to use /solr/admin/cores instead as a test - still see the same issue, though much less frequent. Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 On Sat, Jul 26, 2014 at 6:15 PM, Nathan Neulinger wrote: > Recently deployed haproxy in front of my solr instances, and seeing a > large number of exceptions in the logs now... Example below. I can pound > the server with requests against /solr/admin/ping via curl, with no obvious > issue, but the haproxy checks appear to be aggravating something. > > Solr 4.8.0 w/ solr cloud, 2 nodes, 3 zk, linux x86_64 > > It seems like when the issue occurs, I get a set of the errors all in a > burst (below), never just one. > > Suggestions? > > -- Nathan > > -------- > Nathan Neulinger nn...@neulinger.org > Neulinger Consulting (573) 612-1412 > > > > 2014-07-26 23:04:36,506 ERROR qtp1532385072-4864 > [g.apache.solr.servlet.SolrDispatchFilter] > - null:org.eclipse.jetty.io.EofException > at org.eclipse.jetty.http.HttpGenerator.flushBuffer( > HttpGenerator.java:914) > at org.eclipse.jetty.http.AbstractGenerator.flush( > AbstractGenerator.java:443) > at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:100) > at org.eclipse.jetty.server.AbstractHttpConnection$Output. > flush(AbstractHttpConnection.java:1094) > at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:297) > at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) > at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) > at org.apache.solr.util.FastWriter.flush(FastWriter.java:137) > at org.apache.solr.servlet.SolrDispatchFilter.writeResponse( > SolrDispatchFilter.java:763) > at org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:431) > at org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:339) > at org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:207) > at org.eclipse.jetty.servlet.ServletHandler$CachedChain. > doFilter(ServletHandler.java:1419) > at org.eclipse.jetty.servlet.ServletHandler.doHandle( > ServletHandler.java:455) > at org.eclipse.jetty.server.handler.ScopedHandler.handle( > ScopedHandler.java:137) > at org.eclipse.jetty.security.SecurityHandler.handle( > SecurityHandler.java:557) > at org.eclipse.jetty.server.session.SessionHandler. > doHandle(SessionHandler.java:231) > at org.eclipse.jetty.server.handler.ContextHandler. > doHandle(ContextHandler.java:1075) > at org.eclipse.jetty.servlet.ServletHandler.doScope( > ServletHandler.java:384) > at org.eclipse.jetty.server.session.SessionHandler. > doScope(SessionHandler.java:193) > at org.eclipse.jetty.server.handler.ContextHandler. > doScope(ContextHandler.java:1009) > at org.eclipse.jetty.server.handler.ScopedHandler.handle( > ScopedHandler.java:135) > at org.eclipse.jetty.server.handler.ContextHandlerCollection. > handle(ContextHandlerCollection.java:255) > at org.eclipse.jetty.server.handler.HandlerCollection. > handle(HandlerCollection.java:154) > at org.eclipse.jetty.server.handler.HandlerWrapper.handle( > HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:368) > at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest( > AbstractHttpConnection.java:489) > at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest( > BlockingHttpConnection.java:53) > at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete( > AbstractHttpConnection.java:942) > at org.eclipse.jetty.server.AbstractHttpConnection$ > RequestHandler.headerComplete(AbstractHttpConnection.java:1004) > at org.eclipse.jetty.http.HttpParser.parseNext( > HttpParser.java:640) > at org.eclipse.jetty.http.HttpParser.parseAvailable( > HttpParser.java:235) > at org.eclipse.jetty.server.BlockingHttpConnection.handle( > BlockingHttpConnection.java:72) > at org.eclipse.jetty.server.bio.SocketConnector$ > ConnectorEndPoint.run(SocketConnector.java:264) > at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob( > QueuedThreadPool.java:608) > at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run( > QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.SocketException: Conn
Re: /solr/admin/ping causing exceptions in log?
Cool. That's likely exactly it, since I don't have one set, it's using the check interval, and occasionally must just be too short. Thank you! -- Nathan I assume that this is the httpchk config to make sure that the server is operational. If so, you need to increase the "timeout check" value, because it is too small. The ping request is taking longer to run than you have allowed in the timeout. Here's part of my haproxy config: -- -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: /solr/admin/ping causing exceptions in log?
Unfortunately, doesn't look like this clears the symptom. The ping is responding almost instantly every time. I've tried setting a 15 second timeout on the check, with no change in occurences of the error. Looking at a packet capture on the server side, there is a clear distinction between working and failing/error-triggering connections. It looks like in a "working" case, I see two packets immediately back to back (one with header, and next a continuation with content) with no ack in between, followed by ack, rst+ack, rst. In the failing request, I see the GET request, acked, then the http/1.1 200 Ok response from Solr, a single ack, and then an almost instantaneous reset sent by the client. I'm only seeing this on traffic to/from haproxy checks. If I do a simple: while [ true ]; do curl -s http://host:8983/solr/admin/ping; done from the same box, that flood runs with generally 10-20ms request times and zero errors. -- Nathan On 07/27/2014 07:12 PM, Nathan Neulinger wrote: Cool. That's likely exactly it, since I don't have one set, it's using the check interval, and occasionally must just be too short. Thank you! -- Nathan I assume that this is the httpchk config to make sure that the server is operational. If so, you need to increase the "timeout check" value, because it is too small. The ping request is taking longer to run than you have allowed in the timeout. Here's part of my haproxy config: -- -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 solr-working.cap Description: application/vnd.tcpdump.pcap solr-cutoff2.cap Description: application/vnd.tcpdump.pcap
Re: /solr/admin/ping causing exceptions in log?
Either way, looks like this is not a SOLR issue, but rather haproxy. Thanks. -- Nathan On 07/27/2014 08:23 PM, Nathan Neulinger wrote: Unfortunately, doesn't look like this clears the symptom. The ping is responding almost instantly every time. I've tried setting a 15 second timeout on the check, with no change in occurences of the error. Looking at a packet capture on the server side, there is a clear distinction between working and failing/error-triggering connections. It looks like in a "working" case, I see two packets immediately back to back (one with header, and next a continuation with content) with no ack in between, followed by ack, rst+ack, rst. In the failing request, I see the GET request, acked, then the http/1.1 200 Ok response from Solr, a single ack, and then an almost instantaneous reset sent by the client. I'm only seeing this on traffic to/from haproxy checks. If I do a simple: while [ true ]; do curl -s http://host:8983/solr/admin/ping; done from the same box, that flood runs with generally 10-20ms request times and zero errors. -- Nathan On 07/27/2014 07:12 PM, Nathan Neulinger wrote: Cool. That's likely exactly it, since I don't have one set, it's using the check interval, and occasionally must just be too short. Thank you! -- Nathan I assume that this is the httpchk config to make sure that the server is operational. If so, you need to increase the "timeout check" value, because it is too small. The ping request is taking longer to run than you have allowed in the timeout. Here's part of my haproxy config: -- -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: /solr/admin/ping causing exceptions in log?
Thing is - I wouldn't expect any of the default options mentioned to change the behavior intermittently. i.e. it's working for 95% of the health check requests, it's just the intermittent ones that seem to be cut off... I'm inquiring with haproxy devs since it appears that at least one other person on #haproxy is seeing the same behavior. Doesn't appear to be specific to solr. -- Nathan On 07/27/2014 10:44 PM, Shawn Heisey wrote: On 7/27/2014 7:23 PM, Nathan Neulinger wrote: Unfortunately, doesn't look like this clears the symptom. The ping is responding almost instantly every time. I've tried setting a 15 second timeout on the check, with no change in occurences of the error. Looking at a packet capture on the server side, there is a clear distinction between working and failing/error-triggering connections. It looks like in a "working" case, I see two packets immediately back to back (one with header, and next a continuation with content) with no ack in between, followed by ack, rst+ack, rst. In the failing request, I see the GET request, acked, then the http/1.1 200 Ok response from Solr, a single ack, and then an almost instantaneous reset sent by the client. I'm only seeing this on traffic to/from haproxy checks. If I do a simple: while [ true ]; do curl -s http://host:8983/solr/admin/ping; done from the same box, that flood runs with generally 10-20ms request times and zero errors. I won't claim to understand what's going on here, but it might be a matter of the haproxy options. Here are the options I'm using in the "defaults" section of the config: defaults log global modehttp option httplog option dontlognull option redispatch option abortonclose option http-server-close option http-pretend-keepalive retries 1 maxconn 1024 timeout connect 1s timeout client 5s timeout server 30s One bit of information I came across when I first started setting haproxy up for Solr is that servlet containers like Jetty and Tomcat require the "http-pretend-keepalive" option to work properly. Are you using this option? Thanks, Shawn -- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
What is the "right" way to bring a failed SolrCloud node back online?
I have an environment where new collections are being added frequently (isolated per customer), and the backup is virtually guaranteed to be missing some of them. As it stands, bringing up the restored/out-of-date instance results in thos collections being stuck in 'Recovering' state, because the cores don't exist on the resulting server. This can also be extended to the case of restoring a completely blank instance. Is there any way to tell SolrCloud "Try recreating any missing cores for this collection based on where you know they should be located." Or do I need to actually determine a list of cores (..._shardX_replicaY) and trigger the core creates myself, at which point I gather that it will start recovery for each of them? -- Nathan ---- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Replica not consistent after update request?
How can we issue an update request and be certain that all of the replicas in the SolrCloud cluster are up to date? I found this post: http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/79886 which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were almost always "not current" - i.e. the replicas were missing documents/etc. Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud deployment are supposed to function? If it's a problem with our cloud setup, do you have any suggestions on diagnostics? Alternatively, are we perhaps just using it wrong? -- Nathan -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: Replica not consistent after update request?
Wow, the detail in that jira issue makes my brain hurt... Great to see it's got a quick answer/fix! Thank you! -- Nathan On 01/24/2014 09:43 PM, Joel Bernstein wrote: If you're on Solr 4.6 then this is likely the issue: https://issues.apache.org/jira/browse/SOLR-4260. The issue is resolved for Solr 4.6.1 which should be out next week. Joel Bernstein Search Engineer at Heliosearch On Fri, Jan 24, 2014 at 9:52 PM, Nathan Neulinger wrote: How can we issue an update request and be certain that all of the replicas in the SolrCloud cluster are up to date? I found this post: http://comments.gmane.org/gmane.comp.jakarta.lucene. solr.user/79886 which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were almost always "not current" - i.e. the replicas were missing documents/etc. Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud deployment are supposed to function? If it's a problem with our cloud setup, do you have any suggestions on diagnostics? Alternatively, are we perhaps just using it wrong? -- Nathan -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 -- -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: Replica not consistent after update request?
It's 4.6.0. Pair of servers with an external 3-node zk ensemble. SOLR-4260 looks like a very promising answer. Will check it out as soon as 4.6.1 is released. May also check out the nightly builds since this is still just development/prototype usage. -- Nathan On 01/24/2014 09:45 PM, Anshum Gupta wrote: Hi Nathan, It'd be great to have more information about your setup, Solr Version? Depending upon your version, you might want to also look at: https://issues.apache.org/jira/browse/SOLR-4260 (which is now fixed). On Fri, Jan 24, 2014 at 6:52 PM, Nathan Neulinger wrote: How can we issue an update request and be certain that all of the replicas in the SolrCloud cluster are up to date? I found this post: http://comments.gmane.org/gmane.comp.jakarta.lucene. solr.user/79886 which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were almost always "not current" - i.e. the replicas were missing documents/etc. Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud deployment are supposed to function? If it's a problem with our cloud setup, do you have any suggestions on diagnostics? Alternatively, are we perhaps just using it wrong? -- Nathan -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 -- -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: Replica not consistent after update request?
Ok, so our issue sounds like a combination of not having softCommits properly done, combined with SOLR-4260. Thanks everyone! On 01/24/2014 11:04 PM, Erick Erickson wrote: Right. There updates are guaranteed to be on the replicas and in their transaction logs. That doesn't mean they're searchable, however. For a document to be found in a search there must be a commit, either soft, or hard with openSearcher=true. Here's a post that outlines all this. If you have discrepancies when after commits, that's a problem Best, Erick On Fri, Jan 24, 2014 at 8:52 PM, Nathan Neulinger wrote: How can we issue an update request and be certain that all of the replicas in the SolrCloud cluster are up to date? I found this post: http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/79886 which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were almost always "not current" - i.e. the replicas were missing documents/etc. Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud deployment are supposed to function? If it's a problem with our cloud setup, do you have any suggestions on diagnostics? Alternatively, are we perhaps just using it wrong? -- Nathan -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 -- -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: What is the "right" way to bring a failed SolrCloud node back online?
Thanks, yeah, I did just that - and sent the script in on SOLR-5665 if anyone wants a copy. Script is trivial, but you're welcome to stick it (trivial) in contrib or something if it's at all useful to anyone. -- Nathan On 01/26/2014 08:28 AM, Mark Miller wrote: We are working on a new mode (which should become the default) where ZooKeeper will be treated as the truth for a cluster. This mode will be able to handle situations like this - if the cluster state says a core should exist on a node and it doesn’t, it will be created on startup. The way things work currently is this kind of hybrid situation where the truth is partly in ZooKeeper partly on each node. This is not ideal at all. I think this new mode is very important, and it will be coming shortly. Until then, I’d recommend writing this logic externally as you suggest (I’ve seen it done before). - Mark http://about.me/markrmiller On Jan 24, 2014, at 12:01 PM, Nathan Neulinger wrote: I have an environment where new collections are being added frequently (isolated per customer), and the backup is virtually guaranteed to be missing some of them. As it stands, bringing up the restored/out-of-date instance results in thos collections being stuck in 'Recovering' state, because the cores don't exist on the resulting server. This can also be extended to the case of restoring a completely blank instance. Is there any way to tell SolrCloud "Try recreating any missing cores for this collection based on where you know they should be located." Or do I need to actually determine a list of cores (..._shardX_replicaY) and trigger the core creates myself, at which point I gather that it will start recovery for each of them? -- Nathan -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 -- -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Why does CLUSTERSTATUS return different information than the web cloud view?
In particular, a shard being 'active' vs. 'gone'. The web ui is clearly showing the given replicas as being in "Gone" state when I shut down a server, yet the CLUSTERSTATUS says that each replica has state: "active" Is there any way to ask it for status that will reflect that the replica is gone? This is with 4.8.0. -- Nathan -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: Why does CLUSTERSTATUS return different information than the web cloud view?
Is there a way to query the 'live node' state without sending a query to every node myself? i.e. to get the same data that is used for that cloud status screen? -- Nathan On 08/23/2014 06:39 PM, Mark Miller wrote: The state is actually a combo of the state in clusterstate and the live nodes. If the live node is not there, it's gone regardless of the last state it published. - Mark On Aug 23, 2014, at 6:00 PM, Nathan Neulinger wrote: In particular, a shard being 'active' vs. 'gone'. The web ui is clearly showing the given replicas as being in "Gone" state when I shut down a server, yet the CLUSTERSTATUS says that each replica has state: "active" Is there any way to ask it for status that will reflect that the replica is gone? This is with 4.8.0. -- Nathan -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 -- -------- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412