Steve: bq. Basically, one core had data in it that should belong to another core. Here's my question about this: Is it possible that two request to the /get API coming in at the same time would get confused and either both get the same result or result get inverted?
Well, that shouldn't be happening, these are all supposed to be thread-safe calls.... All things are possible of course ;) If two replicas of the same shard have different documents, that could account for what you're seeing, meanwhile begging the question of why that is the case since it should never be true for a quiescent index. Technically there _are_ conditions where this is true on a very temporary basis, commits on the leader and follower can trigger at different wall-clock times. Say your soft commit (or hard-commit-with-opensearcher-true) is 10 seconds. It should never be the case that s1r1 and s1r2 are out of sync 10 seconds after the last update was sent. This doesn't seem likely from what you've described though... Hmmmm. I guess that one other thing I can set up is to have a bunch of dummy collections laying around. Currently I have only the active one, and if there's some code path whereby the RTG request goes to a replica of a different collection, my test setup wouldn't reproduce it. Currently, I'm running a 2-shard, 1 replica setup, so if there's some way that the replicas get out of sync that wouldn't show either. So I'm starting another run with these changes: > opening a new connection each query > switched so the collection I'm querying is 2x2 > added some dummy collections that are empty One nit, while "core" is exactly correct. When we talk about a core that's part of a collection, we try to use "replica" to be clear we're talking about a core with some added characteristics, i.e. we're in SolrCloud-land. No big deal of course.... Best, Erick On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <apa...@elyograg.org> wrote: > > On 9/28/2018 8:11 PM, sgaron cse wrote: > > @Shawn > > We're running two instance on one machine for two reason: > > 1. The box has plenty of resources (48 cores / 256GB ram) and since I was > > reading that it's not recommended to use more than 31GB of heap in SOLR we > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per > > instance was a good idea. > > Do you know that these Solr instances actually DO need 31 GB of heap, or > are you following advice from somewhere, saying "use one quarter of your > memory as the heap size"? That advice is not in the Solr documentation, > and never will be. Figuring out the right heap size requires > experimentation. > > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F > > How big (on disk) are each of these nine cores, and how many documents > are in each one? Which of them is in each Solr instance? With that > information, we can make a *guess* about how big your heap should be. > Figuring out whether the guess is correct generally requires careful > analysis of a GC log. > > > 2. We're in testing phase so we wanted a SOLR cloud configuration, we will > > most likely have a much bigger deployment once going to production. In prod > > right now, we currently to run a six machines Riak cluster. Riak is a > > key/value document store an has SOLR built-in for search, but we are trying > > to push the key/value aspect of Riak inside SOLR. That way we would have > > one less piece to worry about in our system. > > Solr is not a database. It is not intended to be a data repository. > All of its optimizations (most of which are actually in Lucene) are > geared towards search. While technically it can be a key-value store, > that is not what it was MADE for. Software actually designed for that > role is going to be much better than Solr as a key-value store. > > > When I say null document, I mean the /get API returns: {doc: null} > > > > The problem is definitely not always there. We also have large period of > > time (few hours) were we have no problems. I'm just extremely hesitant on > > retrying when I get a null document because in some case, getting a null > > document is a valid outcome. Our caching layer heavily rely on this for > > example. If I was to retry every nulls I'd pay a big penalty in > > performance. > > I've just done a little test with the 7.5.0 techproducts example. It > looks like returning doc:null actually is how the RTG handler says it > didn't find the document. This seems very wrong to me, but I didn't > design it, and that response needs SOME kind of format. > > Have you done any testing to see whether the standard searching handler > (typically /select, but many other URL paths are possible) returns > results when RTG doesn't? Do you know for these failures whether the > document has been committed or not? > > > As for your last comment, part of our testing phase is also testing the > > limits. Our framework has auto-scaling built-in so if we have a burst of > > request, the system will automatically spin up more clients. We're pushing > > 10% of our production system to that Test server to see how it will handle > > it. > > To spin up another replica, Solr must copy all its index data from the > leader replica. Not only can this take a long time if the index is big, > but it will put a lot of extra I/O load on the machine(s) with the > leader roles. So performance will actually be WORSE before it gets > better when you spin up another replica, and if the index is big, that > condition will persist for quite a while. Copying the index data will > be constrained by the speed of your network and by the speed of your > disks. Often the disks are slower than the network, but that is not > always the case. > > Thanks, > Shawn >