Re: Realtime get not always returning existing data

Erick Erickson Mon, 01 Oct 2018 08:22:11 -0700
Thanks. I'll be away for the rest of the week, so won't be able to try
anything more....
On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <culicny@iq.media> wrote:
>
> In our case, we are heavily indexing in the collection while the /get
> requests are happening which is what we assumed was causing this very rare
> behavior. However, we have experienced the problem for a collection where
> the following happens in sequence with minutes in between them.
>
> 1. Document id=1 is indexed
> 2. Document successfully retrieved with /get?id=1
> 3. Document failed to be retrieved with /get?id=1
> 4. Document successfully retrieved with /get?id=1
>
> We've haven't looked at the issue in a while, so I don't have the exact
> timing of that sequence on hand right now. I'll try to find an actual
> example, although I'm relatively certain it was multiple minutes in between
> each of those requests. However our autocommit (and soft commit) times are
> 60s for both collections.
>
> I think the following two are probably the biggest differences for our
> setup, besides the version difference (v6.3.0):
>
> > index to this collection, perhaps not at a high rate
> > separate the machines running solr from the one doing any querying or
> indexing
>
> The clients are on 3 hosts separate from the solr instances. The total
> number of threads that are making updates and making /get requests is
> around 120-150. About 40-50 per host. Each of our two collections gets an
> average of 500 requests per second constantly for ~5 minutes, and then the
> number slowly tapers off to essentially 0 after ~15 minutes.
>
> Every thread attempts to make the same series of requests.
>
> -- Update with "_version_=-1". If successful, no other requests are made.
> -- On 409 Conflict failure, it makes a /get request for the id
> -- On doc:null failure, the client handles the error and moves on
>
> Combining this with the previous series of /get requests, we end up with
> situations where an update fails as expected, but the subsequent /get
> request fails to retrieve the existing document:
>
> 1. Thread 1 updates id=1 successfully
> 2. Thread 2 tries to update id=1, fails (409)
> 3. Thread 2 tries to get id=1 succeeds.
>
> ...Minutes later...
>
> 4. Thread 3 tries to update id=1, fails (409)
> 5. Thread 3 tries to get id=1, fails (doc:null)
>
> ...Minutes later...
>
> 6. Thread 4 tries to update id=1, fails (409)
> 7. Thread 4 tries to get id=1 succeeds.
>
> As Steven mentioned, it happens very, very rarely. We tried to recreate it
> in a more controlled environment, but ran into the same issue that you are,
> Erick. Every simplified situation we ran produced no problems. Since it's
> not a large issue for us and happens very rarely, we stopped trying to
> recreate it.
>
>
> On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > 57 million queries later, with constant indexing going on and 9 dummy
> > collections in the mix and the main collection I'm querying having 2
> > shards, 2 replicas each, I have no errors.
> >
> > So unless the code doesn't look like it exercises any similar path,
> > I'm not sure what more I can test. "It works on my machine" ;)
> >
> > Here's my querying code, does it look like it what you're seeing?
> >
> >       while (Main.allStop.get() == false) {
> >         try (SolrClient client = new HttpSolrClient.Builder()
> > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4";)) {
> >             .withBaseSolrUrl("http://localhost:8981/solr/eoe";).build()) {
> >
> >           //SolrQuery query = new SolrQuery();
> >           String lower = Integer.toString(rand.nextInt(1_000_000));
> >           SolrDocument rsp = client.getById(lower);
> >           if (rsp == null) {
> >             System.out.println("Got a null response!");
> >             Main.allStop.set(true);
> >           }
> >
> >           rsp = client.getById(lower);
> >
> >           if (rsp.get("id").equals(lower) == false) {
> >             System.out.println("Got an invalid response, looking for "
> > + lower + " got: " + rsp.get("id"));
> >             Main.allStop.set(true);
> >           }
> >           long queries = Main.eoeCounter.incrementAndGet();
> >           if ((queries % 100_000) == 0) {
> >             long seconds = (System.currentTimeMillis() - Main.start) /
> > 1000;
> >             System.out.println("Query count: " +
> > numFormatter.format(queries) + ", rate is " +
> > numFormatter.format(queries / seconds) + " QPS");
> >           }
> >         } catch (Exception cle) {
> >           cle.printStackTrace();
> >           Main.allStop.set(true);
> >         }
> >       }
> >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > <erickerick...@gmail.com> wrote:
> > >
> > > Steve:
> > >
> > > bq.  Basically, one core had data in it that should belong to another
> > > core. Here's my question about this: Is it possible that two request to
> > the
> > > /get API coming in at the same time would get confused and either both
> > get
> > > the same result or result get inverted?
> > >
> > > Well, that shouldn't be happening, these are all supposed to be
> > thread-safe
> > > calls.... All things are possible of course ;)
> > >
> > > If two replicas of the same shard have different documents, that could
> > account
> > > for what you're seeing, meanwhile begging the question of why that is
> > the case
> > > since it should never be true for a quiescent index. Technically there
> > _are_
> > > conditions where this is true on a very temporary basis, commits on the
> > leader
> > > and follower can trigger at different wall-clock times. Say your soft
> > commit
> > > (or hard-commit-with-opensearcher-true) is 10 seconds. It should never
> > be the
> > > case that s1r1 and s1r2 are out of sync 10 seconds after the last update
> > was
> > > sent. This doesn't seem likely from what you've described though...
> > >
> > > Hmmmm. I guess that one other thing I can set up is to have a bunch of
> > dummy
> > > collections laying around. Currently I have only the active one, and
> > > if there's some
> > > code path whereby the RTG request goes to a replica of a different
> > > collection, my
> > > test setup wouldn't reproduce it.
> > >
> > > Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> > > way that the replicas
> > > get out of sync that wouldn't show either.
> > >
> > > So I'm starting another run with these changes:
> > > > opening a new connection each query
> > > > switched so the collection I'm querying is 2x2
> > > > added some dummy collections that are empty
> > >
> > > One nit, while "core" is exactly correct. When we talk about a core
> > > that's part of a collection, we try to use "replica" to be clear we're
> > > talking about
> > > a core with some added characteristics, i.e. we're in SolrCloud-land.
> > > No big deal
> > > of course....
> > >
> > > Best,
> > > Erick
> > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <apa...@elyograg.org>
> > wrote:
> > > >
> > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > @Shawn
> > > > > We're running two instance on one machine for two reason:
> > > > > 1. The box has plenty of resources (48 cores / 256GB ram) and since
> > I was
> > > > > reading that it's not recommended to use more than 31GB of heap in
> > SOLR we
> > > > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > > > instance was a good idea.
> > > >
> > > > Do you know that these Solr instances actually DO need 31 GB of heap,
> > or
> > > > are you following advice from somewhere, saying "use one quarter of
> > your
> > > > memory as the heap size"?  That advice is not in the Solr
> > documentation,
> > > > and never will be.  Figuring out the right heap size requires
> > > > experimentation.
> > > >
> > > >
> > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > >
> > > > How big (on disk) are each of these nine cores, and how many documents
> > > > are in each one?  Which of them is in each Solr instance?  With that
> > > > information, we can make a *guess* about how big your heap should be.
> > > > Figuring out whether the guess is correct generally requires careful
> > > > analysis of a GC log.
> > > >
> > > > > 2. We're in testing phase so we wanted a SOLR cloud configuration,
> > we will
> > > > > most likely have a much bigger deployment once going to production.
> > In prod
> > > > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > > > key/value document store an has SOLR built-in for search, but we are
> > trying
> > > > > to push the key/value aspect of Riak inside SOLR. That way we would
> > have
> > > > > one less piece to worry about in our system.
> > > >
> > > > Solr is not a database.  It is not intended to be a data repository.
> > > > All of its optimizations (most of which are actually in Lucene) are
> > > > geared towards search.  While technically it can be a key-value store,
> > > > that is not what it was MADE for.  Software actually designed for that
> > > > role is going to be much better than Solr as a key-value store.
> > > >
> > > > > When I say null document, I mean the /get API returns: {doc: null}
> > > > >
> > > > > The problem is definitely not always there. We also have large
> > period of
> > > > > time (few hours) were we have no problems. I'm just extremely
> > hesitant on
> > > > > retrying when I get a null document because in some case, getting a
> > null
> > > > > document is a valid outcome. Our caching layer heavily rely on this
> > for
> > > > > example. If I was to retry every nulls I'd pay a big penalty in
> > > > > performance.
> > > >
> > > > I've just done a little test with the 7.5.0 techproducts example.  It
> > > > looks like returning doc:null actually is how the RTG handler says it
> > > > didn't find the document.  This seems very wrong to me, but I didn't
> > > > design it, and that response needs SOME kind of format.
> > > >
> > > > Have you done any testing to see whether the standard searching handler
> > > > (typically /select, but many other URL paths are possible) returns
> > > > results when RTG doesn't?  Do you know for these failures whether the
> > > > document has been committed or not?
> > > >
> > > > > As for your last comment, part of our testing phase is also testing
> > the
> > > > > limits. Our framework has auto-scaling built-in so if we have a
> > burst of
> > > > > request, the system will automatically spin up more clients. We're
> > pushing
> > > > > 10% of our production system to that Test server to see how it will
> > handle
> > > > > it.
> > > >
> > > > To spin up another replica, Solr must copy all its index data from the
> > > > leader replica.  Not only can this take a long time if the index is
> > big,
> > > > but it will put a lot of extra I/O load on the machine(s) with the
> > > > leader roles.  So performance will actually be WORSE before it gets
> > > > better when you spin up another replica, and if the index is big, that
> > > > condition will persist for quite a while.  Copying the index data will
> > > > be constrained by the speed of your network and by the speed of your
> > > > disks.  Often the disks are slower than the network, but that is not
> > > > always the case.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> >
Re: Realtime get not always returning existing data

Reply via email to