I've set up a test program on a local machine, we'll see if I can reproduce
here's the setup:

1> created a 2-shard, leader(primary) only collection
2> added 1M simple docs to it (ids 0-999,999) and some text
3> re-added 100_000 docs with a random id between 0 - 999,999 (inclusive)
     to insure there were deleted docs. Don't have any clue whether
that matters.
4> fired up a 16 thread query program doing RTG on random doc IDs
     The program will stop when either it gets a null response or the response
     isn't the doc asked for.
5> running 7.3.1
6> I'm using the SolrJ RTG code 'cause it was easy
7> All this is running locally on a Mac Pro, no network involved which is
     another variable I suppose
8> 7M queries later no issues
9> there's no indexing going on at all

Steve and Chris:

What about this test setup do you imagine doesn't reflect what your
setup is doing? Things I can think of in order of things to test:

> mimic you y'all are calling RTG more faithfully
> index to this collection, perhaps not at a high rate
> create another collection and actively index to it
> separate the machines running solr from the one
     doing any querying or indexing
> ???

And, of course if it reproduces then run it to death on 7.5 to see if
it's still a problem

Best,
Erick
On Fri, Sep 28, 2018 at 10:21 AM Shawn Heisey <apa...@elyograg.org> wrote:
>
> On 9/28/2018 6:09 AM, sgaron cse wrote:
> > because this is a test deployment replica is set to 1 so as far as I
> > understand, data will not be replicated for this core. Basically we have
> > two SOLR instances running on the same box. One on port 8983, the other on
> > port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which on the
> > instance on port 8983 and the other 4 on port 8984.
>
> A question that isn't really related to the problem you're investigating
> now:  Why are you running two Solr instances on the same machine?  9
> cores is definitely not too many for one Solr instance.
>
> > As far as I can tell
> > all cores suffer from the occasional null document. But the one that I can
> > easily see error from is a config core where we store configuration data
> > for our system. Since the configuration data should always be there we
> > throw exceptions as soon as we get a null document which is why I noticed
> > the problem.
>
> When you say "null document" do you mean that you get no results, or
> that you get a result with a document, but that document has nothing in
> it?  Are there any errors returned or logged by Solr when this happens?
>
> > Our client code that connects to the APIs randomly chooses between all the
> > different ports because it does not know which instance it should ask. So
> > no, we did not try sending directly to the instance that has the data but
> > since there is no replica there is no way that this should get out of sync.
>
> I was suggesting this as a troubleshooting step, not a change to how you
> use Solr.  Basically trying to determine what happens if you send a
> request directly to the instance and core that contains the document
> with distrib=false, to see if it behaves differently than when it's a
> more generic collection-directed query.  The idea was to try and narrow
> down exactly where to look for a problem.
>
> If you wait a few seconds, does the problem go away?  When using real
> time get, a new document must be written to a segment and a new realtime
> searcher must be created before you can get that document.  These things
> typically happen very quickly, but it's not instantaneous.
>
> > To add up to what Chris was saying, although the core that is seeing the
> > issue is not hit very hard, other core in the setup will be. We are
> > building a clustering environment that has auto-scaling so if we are under
> > heavy load, we can easily have 200-300 client hitting the SOLR instance
> > simultaneously.
>
> That much traffic is going to need multiple replicas on separate
> hardware, with something in place to do load balancing. Unless your code
> is Java and you can use CloudSolrClient, I would recommend an external
> load balancer.
>
> Thanks,
> Shawn
>

Reply via email to