On 9/28/2018 6:09 AM, sgaron cse wrote:
because this is a test deployment replica is set to 1 so as far as I
understand, data will not be replicated for this core. Basically we have
two SOLR instances running on the same box. One on port 8983, the other on
port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which on the
instance on port 8983 and the other 4 on port 8984.

A question that isn't really related to the problem you're investigating now:  Why are you running two Solr instances on the same machine?  9 cores is definitely not too many for one Solr instance.

As far as I can tell
all cores suffer from the occasional null document. But the one that I can
easily see error from is a config core where we store configuration data
for our system. Since the configuration data should always be there we
throw exceptions as soon as we get a null document which is why I noticed
the problem.

When you say "null document" do you mean that you get no results, or that you get a result with a document, but that document has nothing in it?  Are there any errors returned or logged by Solr when this happens?

Our client code that connects to the APIs randomly chooses between all the
different ports because it does not know which instance it should ask. So
no, we did not try sending directly to the instance that has the data but
since there is no replica there is no way that this should get out of sync.

I was suggesting this as a troubleshooting step, not a change to how you use Solr.  Basically trying to determine what happens if you send a request directly to the instance and core that contains the document with distrib=false, to see if it behaves differently than when it's a more generic collection-directed query.  The idea was to try and narrow down exactly where to look for a problem.

If you wait a few seconds, does the problem go away?  When using real time get, a new document must be written to a segment and a new realtime searcher must be created before you can get that document.  These things typically happen very quickly, but it's not instantaneous.

To add up to what Chris was saying, although the core that is seeing the
issue is not hit very hard, other core in the setup will be. We are
building a clustering environment that has auto-scaling so if we are under
heavy load, we can easily have 200-300 client hitting the SOLR instance
simultaneously.

That much traffic is going to need multiple replicas on separate hardware, with something in place to do load balancing. Unless your code is Java and you can use CloudSolrClient, I would recommend an external load balancer.

Thanks,
Shawn

Reply via email to