@Shawn We're running two instance on one machine for two reason: 1. The box has plenty of resources (48 cores / 256GB ram) and since I was reading that it's not recommended to use more than 31GB of heap in SOLR we figured 96 GB for keeping index data in OS cache + 31 GB of heap per instance was a good idea. 2. We're in testing phase so we wanted a SOLR cloud configuration, we will most likely have a much bigger deployment once going to production. In prod right now, we currently to run a six machines Riak cluster. Riak is a key/value document store an has SOLR built-in for search, but we are trying to push the key/value aspect of Riak inside SOLR. That way we would have one less piece to worry about in our system.
When I say null document, I mean the /get API returns: {doc: null} The problem is definitely not always there. We also have large period of time (few hours) were we have no problems. I'm just extremely hesitant on retrying when I get a null document because in some case, getting a null document is a valid outcome. Our caching layer heavily rely on this for example. If I was to retry every nulls I'd pay a big penalty in performance. As for your last comment, part of our testing phase is also testing the limits. Our framework has auto-scaling built-in so if we have a burst of request, the system will automatically spin up more clients. We're pushing 10% of our production system to that Test server to see how it will handle it. @Erick Thanks a lot for testing, there has to be a variable that I don't understand for this scenario to happen. Let me try to reproduce it reliably from my side and when I do I'll send you instructions on how to reproduce. I don't want you to waste your time on this. There might be one thing tho that your testing scenario does not account for, while we use the /get API a lot, we use it sporadically and most likely create a new connection to the API each time because calls comes from newly spawned processes. I had no luck reproducing the problem using 50 thread while keeping a session opened and hammering the /get API. I need to find a better way to reproduce this. @both Erick and Shawn I saw something really weird today, there was a mix up in some of the cores data. Basically, one core had data in it that should belong to another core. Here's my question about this: Is it possible that two request to the /get API coming in at the same time would get confused and either both get the same result or result get inverted? Cause that could explain my {doc:null} problem, our caching layer that looks up for IDs in some cores is usually hit pretty hard. Anyway let me time to do more testing on monday/tuesday and to try to pinpoint and make the issues easily reproduce-able. Thanks again for helping, Steve On Fri, Sep 28, 2018 at 4:19 PM Erick Erickson <erickerick...@gmail.com> wrote: > Well, I flipped indexing on and after another 7 million queries, no fails. > No reason to stop just yet, but not encouraging so far... > > > On Fri, Sep 28, 2018, 10:58 Erick Erickson <erickerick...@gmail.com> > wrote: > > > I've set up a test program on a local machine, we'll see if I can > reproduce > > here's the setup: > > > > 1> created a 2-shard, leader(primary) only collection > > 2> added 1M simple docs to it (ids 0-999,999) and some text > > 3> re-added 100_000 docs with a random id between 0 - 999,999 (inclusive) > > to insure there were deleted docs. Don't have any clue whether > > that matters. > > 4> fired up a 16 thread query program doing RTG on random doc IDs > > The program will stop when either it gets a null response or the > > response > > isn't the doc asked for. > > 5> running 7.3.1 > > 6> I'm using the SolrJ RTG code 'cause it was easy > > 7> All this is running locally on a Mac Pro, no network involved which is > > another variable I suppose > > 8> 7M queries later no issues > > 9> there's no indexing going on at all > > > > Steve and Chris: > > > > What about this test setup do you imagine doesn't reflect what your > > setup is doing? Things I can think of in order of things to test: > > > > > mimic you y'all are calling RTG more faithfully > > > index to this collection, perhaps not at a high rate > > > create another collection and actively index to it > > > separate the machines running solr from the one > > doing any querying or indexing > > > ??? > > > > And, of course if it reproduces then run it to death on 7.5 to see if > > it's still a problem > > > > Best, > > Erick > > On Fri, Sep 28, 2018 at 10:21 AM Shawn Heisey <apa...@elyograg.org> > wrote: > > > > > > On 9/28/2018 6:09 AM, sgaron cse wrote: > > > > because this is a test deployment replica is set to 1 so as far as I > > > > understand, data will not be replicated for this core. Basically we > > have > > > > two SOLR instances running on the same box. One on port 8983, the > > other on > > > > port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which > > on the > > > > instance on port 8983 and the other 4 on port 8984. > > > > > > A question that isn't really related to the problem you're > investigating > > > now: Why are you running two Solr instances on the same machine? 9 > > > cores is definitely not too many for one Solr instance. > > > > > > > As far as I can tell > > > > all cores suffer from the occasional null document. But the one that > I > > can > > > > easily see error from is a config core where we store configuration > > data > > > > for our system. Since the configuration data should always be there > we > > > > throw exceptions as soon as we get a null document which is why I > > noticed > > > > the problem. > > > > > > When you say "null document" do you mean that you get no results, or > > > that you get a result with a document, but that document has nothing in > > > it? Are there any errors returned or logged by Solr when this happens? > > > > > > > Our client code that connects to the APIs randomly chooses between > all > > the > > > > different ports because it does not know which instance it should > ask. > > So > > > > no, we did not try sending directly to the instance that has the data > > but > > > > since there is no replica there is no way that this should get out of > > sync. > > > > > > I was suggesting this as a troubleshooting step, not a change to how > you > > > use Solr. Basically trying to determine what happens if you send a > > > request directly to the instance and core that contains the document > > > with distrib=false, to see if it behaves differently than when it's a > > > more generic collection-directed query. The idea was to try and narrow > > > down exactly where to look for a problem. > > > > > > If you wait a few seconds, does the problem go away? When using real > > > time get, a new document must be written to a segment and a new > realtime > > > searcher must be created before you can get that document. These > things > > > typically happen very quickly, but it's not instantaneous. > > > > > > > To add up to what Chris was saying, although the core that is seeing > > the > > > > issue is not hit very hard, other core in the setup will be. We are > > > > building a clustering environment that has auto-scaling so if we are > > under > > > > heavy load, we can easily have 200-300 client hitting the SOLR > instance > > > > simultaneously. > > > > > > That much traffic is going to need multiple replicas on separate > > > hardware, with something in place to do load balancing. Unless your > code > > > is Java and you can use CloudSolrClient, I would recommend an external > > > load balancer. > > > > > > Thanks, > > > Shawn > > > > > >