Well, I flipped indexing on and after another 7 million queries, no fails.
No reason to stop just yet, but not encouraging so far...


On Fri, Sep 28, 2018, 10:58 Erick Erickson <erickerick...@gmail.com> wrote:

> I've set up a test program on a local machine, we'll see if I can reproduce
> here's the setup:
>
> 1> created a 2-shard, leader(primary) only collection
> 2> added 1M simple docs to it (ids 0-999,999) and some text
> 3> re-added 100_000 docs with a random id between 0 - 999,999 (inclusive)
>      to insure there were deleted docs. Don't have any clue whether
> that matters.
> 4> fired up a 16 thread query program doing RTG on random doc IDs
>      The program will stop when either it gets a null response or the
> response
>      isn't the doc asked for.
> 5> running 7.3.1
> 6> I'm using the SolrJ RTG code 'cause it was easy
> 7> All this is running locally on a Mac Pro, no network involved which is
>      another variable I suppose
> 8> 7M queries later no issues
> 9> there's no indexing going on at all
>
> Steve and Chris:
>
> What about this test setup do you imagine doesn't reflect what your
> setup is doing? Things I can think of in order of things to test:
>
> > mimic you y'all are calling RTG more faithfully
> > index to this collection, perhaps not at a high rate
> > create another collection and actively index to it
> > separate the machines running solr from the one
>      doing any querying or indexing
> > ???
>
> And, of course if it reproduces then run it to death on 7.5 to see if
> it's still a problem
>
> Best,
> Erick
> On Fri, Sep 28, 2018 at 10:21 AM Shawn Heisey <apa...@elyograg.org> wrote:
> >
> > On 9/28/2018 6:09 AM, sgaron cse wrote:
> > > because this is a test deployment replica is set to 1 so as far as I
> > > understand, data will not be replicated for this core. Basically we
> have
> > > two SOLR instances running on the same box. One on port 8983, the
> other on
> > > port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which
> on the
> > > instance on port 8983 and the other 4 on port 8984.
> >
> > A question that isn't really related to the problem you're investigating
> > now:  Why are you running two Solr instances on the same machine?  9
> > cores is definitely not too many for one Solr instance.
> >
> > > As far as I can tell
> > > all cores suffer from the occasional null document. But the one that I
> can
> > > easily see error from is a config core where we store configuration
> data
> > > for our system. Since the configuration data should always be there we
> > > throw exceptions as soon as we get a null document which is why I
> noticed
> > > the problem.
> >
> > When you say "null document" do you mean that you get no results, or
> > that you get a result with a document, but that document has nothing in
> > it?  Are there any errors returned or logged by Solr when this happens?
> >
> > > Our client code that connects to the APIs randomly chooses between all
> the
> > > different ports because it does not know which instance it should ask.
> So
> > > no, we did not try sending directly to the instance that has the data
> but
> > > since there is no replica there is no way that this should get out of
> sync.
> >
> > I was suggesting this as a troubleshooting step, not a change to how you
> > use Solr.  Basically trying to determine what happens if you send a
> > request directly to the instance and core that contains the document
> > with distrib=false, to see if it behaves differently than when it's a
> > more generic collection-directed query.  The idea was to try and narrow
> > down exactly where to look for a problem.
> >
> > If you wait a few seconds, does the problem go away?  When using real
> > time get, a new document must be written to a segment and a new realtime
> > searcher must be created before you can get that document.  These things
> > typically happen very quickly, but it's not instantaneous.
> >
> > > To add up to what Chris was saying, although the core that is seeing
> the
> > > issue is not hit very hard, other core in the setup will be. We are
> > > building a clustering environment that has auto-scaling so if we are
> under
> > > heavy load, we can easily have 200-300 client hitting the SOLR instance
> > > simultaneously.
> >
> > That much traffic is going to need multiple replicas on separate
> > hardware, with something in place to do load balancing. Unless your code
> > is Java and you can use CloudSolrClient, I would recommend an external
> > load balancer.
> >
> > Thanks,
> > Shawn
> >
>

Reply via email to