@Shawn
We're running two instance on one machine for two reason:
1. The box has plenty of resources (48 cores / 256GB ram) and since I was
reading that it's not recommended to use more than 31GB of heap in SOLR we
figured 96 GB for keeping index data in OS cache + 31 GB of heap per
instance was a good idea.
2. We're in testing phase so we wanted a SOLR cloud configuration, we will
most likely have a much bigger deployment once going to production. In prod
right now, we currently to run a six machines Riak cluster. Riak is a
key/value document store an has SOLR built-in for search, but we are trying
to push the key/value aspect of Riak inside SOLR. That way we would have
one less piece to worry about in our system.

When I say null document, I mean the /get API returns: {doc: null}

The problem is definitely not always there. We also have large period of
time (few hours) were we have no problems. I'm just extremely hesitant on
retrying when I get a null document because in some case, getting a null
document is a valid outcome. Our caching layer heavily rely on this for
example. If I was to retry every nulls I'd pay a big penalty in
performance.

As for your last comment, part of our testing phase is also testing the
limits. Our framework has auto-scaling built-in so if we have a burst of
request, the system will automatically spin up more clients. We're pushing
10% of our production system to that Test server to see how it will handle
it.

@Erick
Thanks a lot for testing, there has to be a variable that I don't
understand for this scenario to happen. Let me try to reproduce it reliably
from my side and when I do I'll send you instructions on how to reproduce.
I don't want you to waste your time on this. There might be one thing tho
that your testing scenario does not account for, while we use the /get API
a lot, we use it sporadically and most likely create a new connection to
the API each time because calls comes from newly spawned processes. I had
no luck reproducing the problem using 50 thread while keeping a session
opened and hammering the /get API. I need to find a better way to reproduce
this.

@both Erick and Shawn
I saw something really weird today, there was a mix up in some of the cores
data. Basically, one core had data in it that should belong to another
core. Here's my question about this: Is it possible that two request to the
/get API coming in at the same time would get confused and either both get
the same result or result get inverted? Cause that could explain my
{doc:null} problem, our caching layer that looks up for IDs in some cores
is usually hit pretty hard.

Anyway let me time to do more testing on monday/tuesday and to try to
pinpoint and make the issues easily reproduce-able.

Thanks again for helping,
Steve




On Fri, Sep 28, 2018 at 4:19 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> Well, I flipped indexing on and after another 7 million queries, no fails.
> No reason to stop just yet, but not encouraging so far...
>
>
> On Fri, Sep 28, 2018, 10:58 Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > I've set up a test program on a local machine, we'll see if I can
> reproduce
> > here's the setup:
> >
> > 1> created a 2-shard, leader(primary) only collection
> > 2> added 1M simple docs to it (ids 0-999,999) and some text
> > 3> re-added 100_000 docs with a random id between 0 - 999,999 (inclusive)
> >      to insure there were deleted docs. Don't have any clue whether
> > that matters.
> > 4> fired up a 16 thread query program doing RTG on random doc IDs
> >      The program will stop when either it gets a null response or the
> > response
> >      isn't the doc asked for.
> > 5> running 7.3.1
> > 6> I'm using the SolrJ RTG code 'cause it was easy
> > 7> All this is running locally on a Mac Pro, no network involved which is
> >      another variable I suppose
> > 8> 7M queries later no issues
> > 9> there's no indexing going on at all
> >
> > Steve and Chris:
> >
> > What about this test setup do you imagine doesn't reflect what your
> > setup is doing? Things I can think of in order of things to test:
> >
> > > mimic you y'all are calling RTG more faithfully
> > > index to this collection, perhaps not at a high rate
> > > create another collection and actively index to it
> > > separate the machines running solr from the one
> >      doing any querying or indexing
> > > ???
> >
> > And, of course if it reproduces then run it to death on 7.5 to see if
> > it's still a problem
> >
> > Best,
> > Erick
> > On Fri, Sep 28, 2018 at 10:21 AM Shawn Heisey <apa...@elyograg.org>
> wrote:
> > >
> > > On 9/28/2018 6:09 AM, sgaron cse wrote:
> > > > because this is a test deployment replica is set to 1 so as far as I
> > > > understand, data will not be replicated for this core. Basically we
> > have
> > > > two SOLR instances running on the same box. One on port 8983, the
> > other on
> > > > port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which
> > on the
> > > > instance on port 8983 and the other 4 on port 8984.
> > >
> > > A question that isn't really related to the problem you're
> investigating
> > > now:  Why are you running two Solr instances on the same machine?  9
> > > cores is definitely not too many for one Solr instance.
> > >
> > > > As far as I can tell
> > > > all cores suffer from the occasional null document. But the one that
> I
> > can
> > > > easily see error from is a config core where we store configuration
> > data
> > > > for our system. Since the configuration data should always be there
> we
> > > > throw exceptions as soon as we get a null document which is why I
> > noticed
> > > > the problem.
> > >
> > > When you say "null document" do you mean that you get no results, or
> > > that you get a result with a document, but that document has nothing in
> > > it?  Are there any errors returned or logged by Solr when this happens?
> > >
> > > > Our client code that connects to the APIs randomly chooses between
> all
> > the
> > > > different ports because it does not know which instance it should
> ask.
> > So
> > > > no, we did not try sending directly to the instance that has the data
> > but
> > > > since there is no replica there is no way that this should get out of
> > sync.
> > >
> > > I was suggesting this as a troubleshooting step, not a change to how
> you
> > > use Solr.  Basically trying to determine what happens if you send a
> > > request directly to the instance and core that contains the document
> > > with distrib=false, to see if it behaves differently than when it's a
> > > more generic collection-directed query.  The idea was to try and narrow
> > > down exactly where to look for a problem.
> > >
> > > If you wait a few seconds, does the problem go away?  When using real
> > > time get, a new document must be written to a segment and a new
> realtime
> > > searcher must be created before you can get that document.  These
> things
> > > typically happen very quickly, but it's not instantaneous.
> > >
> > > > To add up to what Chris was saying, although the core that is seeing
> > the
> > > > issue is not hit very hard, other core in the setup will be. We are
> > > > building a clustering environment that has auto-scaling so if we are
> > under
> > > > heavy load, we can easily have 200-300 client hitting the SOLR
> instance
> > > > simultaneously.
> > >
> > > That much traffic is going to need multiple replicas on separate
> > > hardware, with something in place to do load balancing. Unless your
> code
> > > is Java and you can use CloudSolrClient, I would recommend an external
> > > load balancer.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
>

Reply via email to