[jira] [Updated] (HBASE-29460) Inconsistent query behavior with timerange filter when there are multiple column versions

Daniel Roudnitsky (Jira) Sun, 20 Jul 2025 13:34:08 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-29460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Roudnitsky updated HBASE-29460:
--------------------------------------
    Description: 
A team at $dayjob reported that a query with a timerange filter which was 
previously returning a non-empty result began returning an empty result, with 
no deletions or major compactions having occurred between the time the query 
returned data and when it stopped returning data. Upon investigating we found 
that the behavior of GET/SCAN with a timerange filter when there are multiple 
versions of the same column lying around is inconsistent.

The server accumulates excess versions until flush/major compaction, so by 
design there will be long periods of time where we have cells that physically 
exist but have logically versioned out and should not be visible/queryable by 
user (at least that seems to have been the intention?). The issue looks to boil 
down to store scanner being able to return cells that have logically versioned 
out when:
 # A timerange filter is specified AND
 # The number of cells that fall in the specified timerange which have not 
logically versioned out is less than both the number of VERSIONS configured on 
the column family and the number of versions specified by the query

Take the example of a user updating the same column over time with new versions 
and occasionally running queries to get the past version of the column that 
existed at a specific point in time. This user will very organically run into 
this scenario where a cell falling in the timerange of interest physically 
exists but has logically versioned out. Whether this user’s timerange query 
returns the matching but logically versioned out cell and how long it continues 
to do so varies depending on
 * How many younger versions exist in the specified timerange (either in 
memstore or hfile)
 * How the cell got flushed - if the cell was flushed in the same batch as 
younger versions of the same column the query may return data before the flush 
and stop returning data after the flush
 * If the cell survived the flush process, then the query may continue to 
return data until major compaction, after which its physically versioned out 
and the query stops returning data

More concretely, take the base case with default VERSIONS=>1 where we do two 
PUTS to the same column with PUT2 timestamp > PUT1 timestamp, and the two cells 
are flushed independently to different hfiles. We observe a few interesting 
things (hbase shell code in jira comment):
 # A query with a timerange filter including only PUT1 timestamp returns PUT1 
if executed before major compaction - we return a cell that has logically 
versioned out
 # A query to get all versions, without any timerange, only returns PUT2 - we 
respect logical versioning here and do not return the PUT1 cell
 # A query to get all versions, with a timerange filter which includes both 
PUT1 and PUT2 timestamps, only returns PUT2 - we respect logical versioning here
 # A query to get all versions, with a narrower timerange that includes only 
PUT1 timestamp, returns PUT1. This is odd behavior from user perspective, this 
query is identical to query 3 but with a time range that is a subinterval of 
the one in query 3, one would reasonably expect the result of the subinterval 
query to be a subset of the results when querying on the larger interval, but 
the results are completely disjoint in this case. To give a SQL example, one 
would not expect a SELECT * WHERE TIME < 10 to return anything that would not 
appear in SELECT * WHERE TIME < 20, which is what happens in our case
 # After we major compact , PUT1 has physically versioned out and query 1 will 
stop returning a result

We have additional query indeterminism when we have multiple versions in 
memstore. We keep all (recent) versions in memstore until flushing, and one can 
have a timerange query return logically versioned out cells while they are in 
memstore. At flush time we will flush at most VERSIONS number of cells - we do 
some “opportunistic” version pruning if we had more versions in memstore than 
needed - but this means that before the flush one can have a timerange query 
which returns data, and after the flush the same query no longer returns data, 
and the behavior is dependent on the number of versions that were in memstore 
at the time of flush.

With NEW_VERSION_BEHAVIOR  enabled (HBASE-15968) the query behavior when 
versions are in memstore changes - a timerange query where all versions are in 
memstore won't return logically versioned out cells, but if the versioned out 
cell was written out to an hfile than it is queryable. I have not tested 
NEW_VERSION_BEHAVIOR thoroughly, but from my initial testing it does not 
resolve the issues here, but does impact some of the query behavior in question 
here.  

I am of the (possibly naive) opinion that we should not return logically 
versioned out cells by default so that query behavior is consistent/predictable 
and users can reason about how things will behave without deep diving HBase 
internals and understanding the corner cases involved here. I am not sure how 
long timerange queries have behaved this way, probably a long time, if we 
really want to preserve this behavior than I think at the very least it should 
behave predictably - timing of PUTS/flushes should not change query result and 
we should be clear in the docs that major compaction can change query result 
(even if you do not do any deletes).

  was:
A team at $dayjob reported that a query with a timerange filter which was 
previously returning a non-empty result began returning an empty result, with 
no deletions or major compactions having occurred between the time the query 
returned data and when it stopped returning data. Upon investigating we found 
that the behavior of GET/SCAN with a timerange filter when there are multiple 
versions of the same column lying around is inconsistent.

The server accumulates excess versions until flush/major compaction, so by 
design there will be long periods of time where we have cells that physically 
exist but have logically versioned out and should not be visible/queryable by 
user (at least that seems to have been the intention?). The issue looks to boil 
down to store scanner being able to return cells that have logically versioned 
out when:
 # A timerange filter is specified AND
 # The number of cells that fall in the specified timerange which have not 
logically versioned out is less than both the number of VERSIONS configured on 
the column family and the number of versions specified by the query

Take the example of a user updating the same column over time with new versions 
and occasionally running queries to get the past version of the column that 
existed at a specific point in time. This user will very organically run into 
this scenario where a cell falling in the timerange of interest physically 
exists but has logically versioned out. Whether this user’s timerange query 
returns the matching but logically versioned out cell and how long it continues 
to do so varies depending on
 * How many younger versions exist in the specified timerange (either in 
memstore or hfile)
 * How the cell got flushed - if the cell was flushed in the same batch as 
younger versions of the same column the query may return data before the flush 
and stop returning data after the flush
 * If the cell survived the flush process, then the query may continue to 
return data until major compaction, after which its physically versioned out 
and the query stops returning data

More concretely, take the base case with default VERSIONS=>1 where we do two 
PUTS to the same column with PUT2 timestamp > PUT1 timestamp, and the two cells 
are flushed independently to different hfiles. We observe a few interesting 
things (hbase shell code in jira comment):
 # A query with a timerange filter including only PUT1 timestamp returns PUT1 
if executed before major compaction - we return a cell that has logically 
versioned out
 # A query to get all versions, without any timerange, only returns PUT2 - we 
respect logical versioning here and do not return the PUT1 cell
 # A query to get all versions, with a timerange filter which includes both 
PUT1 and PUT2 timestamps, only returns PUT2 - we respect logical versioning here
 # A query to get all versions, with a narrower timerange that includes only 
PUT1 timestamp, returns PUT1. This is odd behavior from user perspective, this 
query is identical to query 3 but with a time range that is a subinterval of 
the one in query 3, one would reasonably expect the result of the subinterval 
query to be a subset of the results when querying on the larger interval, but 
the results are completely disjoint in this case. To give a SQL example, one 
would not expect a SELECT * WHERE TIME < 10 to return anything that would not 
appear in SELECT * WHERE TIME < 20, which is what happens in our case
 # After we major compact , PUT1 has physically versioned out and query 1 will 
stop returning a result

We have additional query indeterminism when we have multiple versions in 
memstore. We keep all (recent) versions in memstore until flushing, and one can 
have a timerange query return logically versioned out cells while they are in 
memstore. At flush time we will flush at most VERSIONS number of cells - we do 
some “opportunistic” version pruning if we had more versions in memstore than 
needed - but this means that before the flush one can have a timerange query 
which returns data, and after the flush the same query no longer returns data, 
and the behavior is dependent on the number of versions that were in memstore 
at the time of flush.

I am of the (possibly naive) opinion that we should not return logically 
versioned out cells by default so that query behavior is consistent/predictable 
and users can reason about how things will behave without deep diving HBase 
internals and understanding the corner cases involved here. I am not sure how 
long timerange queries have behaved this way, probably a long time, if we 
really want to preserve this behavior than I think at the very least it should 
behave predictably - timing of PUTS/flushes should not change query result and 
we should be clear in the docs that major compaction can change query result 
(even if you do not do any deletes).


> Inconsistent query behavior with timerange filter when there are multiple 
> column versions
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-29460
>                 URL: https://issues.apache.org/jira/browse/HBASE-29460
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0-beta-1, 2.5.12
>            Reporter: Daniel Roudnitsky
>            Assignee: Daniel Roudnitsky
>            Priority: Critical
>
> A team at $dayjob reported that a query with a timerange filter which was 
> previously returning a non-empty result began returning an empty result, with 
> no deletions or major compactions having occurred between the time the query 
> returned data and when it stopped returning data. Upon investigating we found 
> that the behavior of GET/SCAN with a timerange filter when there are multiple 
> versions of the same column lying around is inconsistent.
> The server accumulates excess versions until flush/major compaction, so by 
> design there will be long periods of time where we have cells that physically 
> exist but have logically versioned out and should not be visible/queryable by 
> user (at least that seems to have been the intention?). The issue looks to 
> boil down to store scanner being able to return cells that have logically 
> versioned out when:
>  # A timerange filter is specified AND
>  # The number of cells that fall in the specified timerange which have not 
> logically versioned out is less than both the number of VERSIONS configured 
> on the column family and the number of versions specified by the query
> Take the example of a user updating the same column over time with new 
> versions and occasionally running queries to get the past version of the 
> column that existed at a specific point in time. This user will very 
> organically run into this scenario where a cell falling in the timerange of 
> interest physically exists but has logically versioned out. Whether this 
> user’s timerange query returns the matching but logically versioned out cell 
> and how long it continues to do so varies depending on
>  * How many younger versions exist in the specified timerange (either in 
> memstore or hfile)
>  * How the cell got flushed - if the cell was flushed in the same batch as 
> younger versions of the same column the query may return data before the 
> flush and stop returning data after the flush
>  * If the cell survived the flush process, then the query may continue to 
> return data until major compaction, after which its physically versioned out 
> and the query stops returning data
> More concretely, take the base case with default VERSIONS=>1 where we do two 
> PUTS to the same column with PUT2 timestamp > PUT1 timestamp, and the two 
> cells are flushed independently to different hfiles. We observe a few 
> interesting things (hbase shell code in jira comment):
>  # A query with a timerange filter including only PUT1 timestamp returns PUT1 
> if executed before major compaction - we return a cell that has logically 
> versioned out
>  # A query to get all versions, without any timerange, only returns PUT2 - we 
> respect logical versioning here and do not return the PUT1 cell
>  # A query to get all versions, with a timerange filter which includes both 
> PUT1 and PUT2 timestamps, only returns PUT2 - we respect logical versioning 
> here
>  # A query to get all versions, with a narrower timerange that includes only 
> PUT1 timestamp, returns PUT1. This is odd behavior from user perspective, 
> this query is identical to query 3 but with a time range that is a 
> subinterval of the one in query 3, one would reasonably expect the result of 
> the subinterval query to be a subset of the results when querying on the 
> larger interval, but the results are completely disjoint in this case. To 
> give a SQL example, one would not expect a SELECT * WHERE TIME < 10 to return 
> anything that would not appear in SELECT * WHERE TIME < 20, which is what 
> happens in our case
>  # After we major compact , PUT1 has physically versioned out and query 1 
> will stop returning a result
> We have additional query indeterminism when we have multiple versions in 
> memstore. We keep all (recent) versions in memstore until flushing, and one 
> can have a timerange query return logically versioned out cells while they 
> are in memstore. At flush time we will flush at most VERSIONS number of cells 
> - we do some “opportunistic” version pruning if we had more versions in 
> memstore than needed - but this means that before the flush one can have a 
> timerange query which returns data, and after the flush the same query no 
> longer returns data, and the behavior is dependent on the number of versions 
> that were in memstore at the time of flush.
> With NEW_VERSION_BEHAVIOR  enabled (HBASE-15968) the query behavior when 
> versions are in memstore changes - a timerange query where all versions are 
> in memstore won't return logically versioned out cells, but if the versioned 
> out cell was written out to an hfile than it is queryable. I have not tested 
> NEW_VERSION_BEHAVIOR thoroughly, but from my initial testing it does not 
> resolve the issues here, but does impact some of the query behavior in 
> question here.  
> I am of the (possibly naive) opinion that we should not return logically 
> versioned out cells by default so that query behavior is 
> consistent/predictable and users can reason about how things will behave 
> without deep diving HBase internals and understanding the corner cases 
> involved here. I am not sure how long timerange queries have behaved this 
> way, probably a long time, if we really want to preserve this behavior than I 
> think at the very least it should behave predictably - timing of PUTS/flushes 
> should not change query result and we should be clear in the docs that major 
> compaction can change query result (even if you do not do any deletes).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-29460) Inconsistent query behavior with timerange filter when there are multiple column versions

Reply via email to