[ https://issues.apache.org/jira/browse/HBASE-29460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Roudnitsky updated HBASE-29460: -------------------------------------- Summary: Inconsistent query results with timerange filter (was: Inconsistent query behavior with timerange filter when there are multiple column versions) > Inconsistent query results with timerange filter > ------------------------------------------------ > > Key: HBASE-29460 > URL: https://issues.apache.org/jira/browse/HBASE-29460 > Project: HBase > Issue Type: Bug > Affects Versions: 3.0.0-beta-1, 2.5.12 > Reporter: Daniel Roudnitsky > Assignee: Daniel Roudnitsky > Priority: Critical > > At my company a team reported that a query with a timerange filter which was > previously returning a non-empty result began returning an empty result, with > no deletions or major compactions having occurred between the time the query > returned data and when it stopped returning data. Upon investigating we found > that the behavior of GET/SCAN with a timerange filter when there are multiple > versions of the same column lying around is inconsistent. > The server accumulates excess versions until flush/major compaction, so by > design there will be long periods of time where we have cells that physically > exist but have logically versioned out and should not be visible/queryable by > user. The issue looks to boil down to store scanner being able to return > cells that have logically versioned out when: > # A timerange filter is specified AND > # The number of cells that fall in the specified timerange which have not > logically versioned out is less than both the number of VERSIONS configured > on the column family and the number of versions specified by the query > Take the example of a user updating the same column over time with new > versions and occasionally running queries to get the past version of the > column that existed at a specific point in time. This user will very > organically run into this scenario where a cell falling in the timerange of > interest physically exists but has logically versioned out. Whether this > user’s timerange query returns the matching but logically versioned out cell > and how long it continues to do so varies depending on > * How many younger versions exist in the specified timerange (either in > memstore or hfile) > * How the cell got flushed - if the cell was flushed in the same batch as > younger versions of the same column the query may return data before the > flush and stop returning data after the flush > * If the cell survived the flush process, then the query may continue to > return data until major compaction, after which its physically versioned out > and the query stops returning data > More concretely, take the base case with default VERSIONS=>1 where we do two > PUTS to the same column with PUT2 timestamp > PUT1 timestamp, and the two > cells are flushed independently to different hfiles. We observe a few > interesting things (hbase shell code in jira comment): > # A query with a timerange filter including only PUT1 timestamp returns PUT1 > if executed before major compaction - we return a cell that has logically > versioned out > # A query to get all versions, without any timerange, only returns PUT2 - we > respect logical versioning here and do not return the PUT1 cell > # A query to get all versions, with a timerange filter which includes both > PUT1 and PUT2 timestamps, only returns PUT2 - we respect logical versioning > here > # A query to get all versions, with a narrower timerange that includes only > PUT1 timestamp, returns PUT1. This is odd behavior from user perspective, > this query is identical to query 3 but with a time range that is a > subinterval of the one in query 3, one would reasonably expect the result of > the subinterval query to be a subset of the results when querying on the > larger interval, but the results are completely disjoint in this case. To > give a SQL example, one would not expect a SELECT * WHERE TIME < 10 to return > anything that would not appear in SELECT * WHERE TIME < 20, which is what > happens in our case > # After we major compact , PUT1 has physically versioned out and query 1 > will stop returning a result > For the default VERSIONS=>1 case these version visibility semantics are > especially strange. A user with VERSIONS=>1 may very reasonably expect that > only the latest version of a column can ever be returned by a query, > regardless of filter, but the reality is that the same query with a different > timerange filter can return an arbitrary number of different versions of the > same column (up until major compaction). For a user with VERSIONS=>1 who does > rely on the existing semantics, there is still the oddity that they cannot > query for all versions of a column that exist, since we return at most 1 > version for a given query, they can only slide the timerange around to get at > most one version falling in the timerange (queries 3/4 in example above). > We have additional query indeterminism when we have multiple versions in > memstore. We keep all (recent) versions in memstore until flushing, and one > can have a timerange query return logically versioned out cells while they > are in memstore. At flush time we will flush at most VERSIONS number of cells > - we do some “opportunistic” version pruning if we had more versions in > memstore than needed - but this means that before the flush one can have a > timerange query which returns data, and after the flush the same query no > longer returns data, and the behavior is dependent on the number of versions > that were in memstore at the time of flush. > With NEW_VERSION_BEHAVIOR enabled (HBASE-15968) the query behavior when > versions are in memstore changes - a timerange query where all versions are > in memstore won't return logically versioned out cells, but if the versioned > out cell was written out to an hfile then it is queryable. I have not tested > NEW_VERSION_BEHAVIOR thoroughly, but from my initial testing it does not > resolve the issues here, but does impact some of the query behavior in > question here. > I am of the opinion that we should not return logically versioned out cells > by default regardless of filter so that query behavior is > consistent/predictable and users can reason about how things will behave > without deep diving HBase internals and understanding the corner cases > involved here. Timerange queries look to have behaved this way for a long > time (HBASE-10102) so this would be an incompatible change to version > visibility semantics. If we want to continue to support querying data that > has been logically versioned out we could have a new API/flag that allows one > to do so if explicitly enabled, very similar to the raw scan option which > allows one to read tombstoned data that is still hanging around. > Where we need to preserve the existing version visibility semantics for > compatibility reasons, I am of the opinion that those semantics should behave > more predictably - I propose we do not do version pruning at flush time so > that timing of PUTS/flushes cannot change query result and update the docs to > make it clear that major compaction can change timerange query result. -- This message was sent by Atlassian Jira (v8.20.10#820010)