57 million queries later, with constant indexing going on and 9 dummy collections in the mix and the main collection I'm querying having 2 shards, 2 replicas each, I have no errors.
So unless the code doesn't look like it exercises any similar path, I'm not sure what more I can test. "It works on my machine" ;) Here's my querying code, does it look like it what you're seeing? while (Main.allStop.get() == false) { try (SolrClient client = new HttpSolrClient.Builder() //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) { .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) { //SolrQuery query = new SolrQuery(); String lower = Integer.toString(rand.nextInt(1_000_000)); SolrDocument rsp = client.getById(lower); if (rsp == null) { System.out.println("Got a null response!"); Main.allStop.set(true); } rsp = client.getById(lower); if (rsp.get("id").equals(lower) == false) { System.out.println("Got an invalid response, looking for " + lower + " got: " + rsp.get("id")); Main.allStop.set(true); } long queries = Main.eoeCounter.incrementAndGet(); if ((queries % 100_000) == 0) { long seconds = (System.currentTimeMillis() - Main.start) / 1000; System.out.println("Query count: " + numFormatter.format(queries) + ", rate is " + numFormatter.format(queries / seconds) + " QPS"); } } catch (Exception cle) { cle.printStackTrace(); Main.allStop.set(true); } } }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson <erickerick...@gmail.com> wrote: > > Steve: > > bq. Basically, one core had data in it that should belong to another > core. Here's my question about this: Is it possible that two request to the > /get API coming in at the same time would get confused and either both get > the same result or result get inverted? > > Well, that shouldn't be happening, these are all supposed to be thread-safe > calls.... All things are possible of course ;) > > If two replicas of the same shard have different documents, that could account > for what you're seeing, meanwhile begging the question of why that is the case > since it should never be true for a quiescent index. Technically there _are_ > conditions where this is true on a very temporary basis, commits on the leader > and follower can trigger at different wall-clock times. Say your soft commit > (or hard-commit-with-opensearcher-true) is 10 seconds. It should never be the > case that s1r1 and s1r2 are out of sync 10 seconds after the last update was > sent. This doesn't seem likely from what you've described though... > > Hmmmm. I guess that one other thing I can set up is to have a bunch of dummy > collections laying around. Currently I have only the active one, and > if there's some > code path whereby the RTG request goes to a replica of a different > collection, my > test setup wouldn't reproduce it. > > Currently, I'm running a 2-shard, 1 replica setup, so if there's some > way that the replicas > get out of sync that wouldn't show either. > > So I'm starting another run with these changes: > > opening a new connection each query > > switched so the collection I'm querying is 2x2 > > added some dummy collections that are empty > > One nit, while "core" is exactly correct. When we talk about a core > that's part of a collection, we try to use "replica" to be clear we're > talking about > a core with some added characteristics, i.e. we're in SolrCloud-land. > No big deal > of course.... > > Best, > Erick > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <apa...@elyograg.org> wrote: > > > > On 9/28/2018 8:11 PM, sgaron cse wrote: > > > @Shawn > > > We're running two instance on one machine for two reason: > > > 1. The box has plenty of resources (48 cores / 256GB ram) and since I was > > > reading that it's not recommended to use more than 31GB of heap in SOLR we > > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per > > > instance was a good idea. > > > > Do you know that these Solr instances actually DO need 31 GB of heap, or > > are you following advice from somewhere, saying "use one quarter of your > > memory as the heap size"? That advice is not in the Solr documentation, > > and never will be. Figuring out the right heap size requires > > experimentation. > > > > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F > > > > How big (on disk) are each of these nine cores, and how many documents > > are in each one? Which of them is in each Solr instance? With that > > information, we can make a *guess* about how big your heap should be. > > Figuring out whether the guess is correct generally requires careful > > analysis of a GC log. > > > > > 2. We're in testing phase so we wanted a SOLR cloud configuration, we will > > > most likely have a much bigger deployment once going to production. In > > > prod > > > right now, we currently to run a six machines Riak cluster. Riak is a > > > key/value document store an has SOLR built-in for search, but we are > > > trying > > > to push the key/value aspect of Riak inside SOLR. That way we would have > > > one less piece to worry about in our system. > > > > Solr is not a database. It is not intended to be a data repository. > > All of its optimizations (most of which are actually in Lucene) are > > geared towards search. While technically it can be a key-value store, > > that is not what it was MADE for. Software actually designed for that > > role is going to be much better than Solr as a key-value store. > > > > > When I say null document, I mean the /get API returns: {doc: null} > > > > > > The problem is definitely not always there. We also have large period of > > > time (few hours) were we have no problems. I'm just extremely hesitant on > > > retrying when I get a null document because in some case, getting a null > > > document is a valid outcome. Our caching layer heavily rely on this for > > > example. If I was to retry every nulls I'd pay a big penalty in > > > performance. > > > > I've just done a little test with the 7.5.0 techproducts example. It > > looks like returning doc:null actually is how the RTG handler says it > > didn't find the document. This seems very wrong to me, but I didn't > > design it, and that response needs SOME kind of format. > > > > Have you done any testing to see whether the standard searching handler > > (typically /select, but many other URL paths are possible) returns > > results when RTG doesn't? Do you know for these failures whether the > > document has been committed or not? > > > > > As for your last comment, part of our testing phase is also testing the > > > limits. Our framework has auto-scaling built-in so if we have a burst of > > > request, the system will automatically spin up more clients. We're pushing > > > 10% of our production system to that Test server to see how it will handle > > > it. > > > > To spin up another replica, Solr must copy all its index data from the > > leader replica. Not only can this take a long time if the index is big, > > but it will put a lot of extra I/O load on the machine(s) with the > > leader roles. So performance will actually be WORSE before it gets > > better when you spin up another replica, and if the index is big, that > > condition will persist for quite a while. Copying the index data will > > be constrained by the speed of your network and by the speed of your > > disks. Often the disks are slower than the network, but that is not > > always the case. > > > > Thanks, > > Shawn > >