On 9/28/2018 6:09 AM, sgaron cse wrote:
because this is a test deployment replica is set to 1 so as far as I understand, data will not be replicated for this core. Basically we have two SOLR instances running on the same box. One on port 8983, the other on port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which on the instance on port 8983 and the other 4 on port 8984.
A question that isn't really related to the problem you're investigating now: Why are you running two Solr instances on the same machine? 9 cores is definitely not too many for one Solr instance.
As far as I can tell all cores suffer from the occasional null document. But the one that I can easily see error from is a config core where we store configuration data for our system. Since the configuration data should always be there we throw exceptions as soon as we get a null document which is why I noticed the problem.
When you say "null document" do you mean that you get no results, or that you get a result with a document, but that document has nothing in it? Are there any errors returned or logged by Solr when this happens?
Our client code that connects to the APIs randomly chooses between all the different ports because it does not know which instance it should ask. So no, we did not try sending directly to the instance that has the data but since there is no replica there is no way that this should get out of sync.
I was suggesting this as a troubleshooting step, not a change to how you use Solr. Basically trying to determine what happens if you send a request directly to the instance and core that contains the document with distrib=false, to see if it behaves differently than when it's a more generic collection-directed query. The idea was to try and narrow down exactly where to look for a problem.
If you wait a few seconds, does the problem go away? When using real time get, a new document must be written to a segment and a new realtime searcher must be created before you can get that document. These things typically happen very quickly, but it's not instantaneous.
To add up to what Chris was saying, although the core that is seeing the issue is not hit very hard, other core in the setup will be. We are building a clustering environment that has auto-scaling so if we are under heavy load, we can easily have 200-300 client hitting the SOLR instance simultaneously.
That much traffic is going to need multiple replicas on separate hardware, with something in place to do load balancing. Unless your code is Java and you can use CloudSolrClient, I would recommend an external load balancer.
Thanks, Shawn