Well, I flipped indexing on and after another 7 million queries, no fails. No reason to stop just yet, but not encouraging so far...
On Fri, Sep 28, 2018, 10:58 Erick Erickson <erickerick...@gmail.com> wrote: > I've set up a test program on a local machine, we'll see if I can reproduce > here's the setup: > > 1> created a 2-shard, leader(primary) only collection > 2> added 1M simple docs to it (ids 0-999,999) and some text > 3> re-added 100_000 docs with a random id between 0 - 999,999 (inclusive) > to insure there were deleted docs. Don't have any clue whether > that matters. > 4> fired up a 16 thread query program doing RTG on random doc IDs > The program will stop when either it gets a null response or the > response > isn't the doc asked for. > 5> running 7.3.1 > 6> I'm using the SolrJ RTG code 'cause it was easy > 7> All this is running locally on a Mac Pro, no network involved which is > another variable I suppose > 8> 7M queries later no issues > 9> there's no indexing going on at all > > Steve and Chris: > > What about this test setup do you imagine doesn't reflect what your > setup is doing? Things I can think of in order of things to test: > > > mimic you y'all are calling RTG more faithfully > > index to this collection, perhaps not at a high rate > > create another collection and actively index to it > > separate the machines running solr from the one > doing any querying or indexing > > ??? > > And, of course if it reproduces then run it to death on 7.5 to see if > it's still a problem > > Best, > Erick > On Fri, Sep 28, 2018 at 10:21 AM Shawn Heisey <apa...@elyograg.org> wrote: > > > > On 9/28/2018 6:09 AM, sgaron cse wrote: > > > because this is a test deployment replica is set to 1 so as far as I > > > understand, data will not be replicated for this core. Basically we > have > > > two SOLR instances running on the same box. One on port 8983, the > other on > > > port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which > on the > > > instance on port 8983 and the other 4 on port 8984. > > > > A question that isn't really related to the problem you're investigating > > now: Why are you running two Solr instances on the same machine? 9 > > cores is definitely not too many for one Solr instance. > > > > > As far as I can tell > > > all cores suffer from the occasional null document. But the one that I > can > > > easily see error from is a config core where we store configuration > data > > > for our system. Since the configuration data should always be there we > > > throw exceptions as soon as we get a null document which is why I > noticed > > > the problem. > > > > When you say "null document" do you mean that you get no results, or > > that you get a result with a document, but that document has nothing in > > it? Are there any errors returned or logged by Solr when this happens? > > > > > Our client code that connects to the APIs randomly chooses between all > the > > > different ports because it does not know which instance it should ask. > So > > > no, we did not try sending directly to the instance that has the data > but > > > since there is no replica there is no way that this should get out of > sync. > > > > I was suggesting this as a troubleshooting step, not a change to how you > > use Solr. Basically trying to determine what happens if you send a > > request directly to the instance and core that contains the document > > with distrib=false, to see if it behaves differently than when it's a > > more generic collection-directed query. The idea was to try and narrow > > down exactly where to look for a problem. > > > > If you wait a few seconds, does the problem go away? When using real > > time get, a new document must be written to a segment and a new realtime > > searcher must be created before you can get that document. These things > > typically happen very quickly, but it's not instantaneous. > > > > > To add up to what Chris was saying, although the core that is seeing > the > > > issue is not hit very hard, other core in the setup will be. We are > > > building a clustering environment that has auto-scaling so if we are > under > > > heavy load, we can easily have 200-300 client hitting the SOLR instance > > > simultaneously. > > > > That much traffic is going to need multiple replicas on separate > > hardware, with something in place to do load balancing. Unless your code > > is Java and you can use CloudSolrClient, I would recommend an external > > load balancer. > > > > Thanks, > > Shawn > > >