[jira] [Commented] (HBASE-29502) RegionReplicaReplicationEndpoint fails to forward mutations when meta cache does not contain secondary replica locations

Hudson (Jira) Sun, 31 Aug 2025 00:36:29 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-29502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18017229#comment-18017229
 ]


Hudson commented on HBASE-29502:
--------------------------------

Results for branch branch-2.6
        [build #355 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/355/]:
 (x) *{color:red}-1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/355/General_20Nightly_20Build_20Report/]


(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/355/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/355/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/355/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk17 hadoop3 checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/355/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk17 hadoop 3.3.5 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/355/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk17 hadoop 3.3.6 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/355/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk17 hadoop 3.4.0 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/355/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test for HBase 2 {color}
(/) {color:green}+1 client integration test for 3.3.5 {color}
(/) {color:green}+1 client integration test for 3.3.6 {color}
(/) {color:green}+1 client integration test for 3.4.0 {color}
(/) {color:green}+1 client integration test for 3.4.1 {color}


> RegionReplicaReplicationEndpoint fails to forward mutations when meta cache 
> does not contain secondary replica locations
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29502
>                 URL: https://issues.apache.org/jira/browse/HBASE-29502
>             Project: HBase
>          Issue Type: Bug
>          Components: read replicas
>    Affects Versions: 2.7.0, 2.6.3
>            Reporter: Charles Connell
>            Assignee: Charles Connell
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.7.0, 2.6.4
>
>
> (this only affects 2.x versions)
> When region replicas are enabled in "asynchronous WAL replication" mode, each 
> RegionServer uses a {{RegionReplicaReplicationEndpoint}} object to tail its 
> own WAL. Each mutation in its WAL may be related to a region which has its 
> primary replica on this RegionServer, and has one or more secondary replicas 
> on other servers. So, for each mutation in the WAL, 
> {{RegionReplicaReplicationEndpoint}} decides whether any other servers are 
> hosting replicas of the relevant region, and if so, sends an RPC to those 
> servers containing the mutations they should apply to their memstores.
> When region replicas are enabled, a {{RegionReplicaReplicationEndpoint}} 
> instance is created, with its own {{ConnectionImplementation}} and therefore 
> its own {{MetaCache}}. This {{RegionReplicaReplicationEndpoint}} immediately 
> starts attempting to send mutations to secondary replica regions, even though 
> they will not be open for a few more seconds or minutes. In this moment, the 
> {{MetaCache}} gets populated with entries that say that most regions are 
> hosted on only one server. These cached lookups remain in use indefinitely, 
> even though they are incorrect for most of their lifetime. Without knowing 
> where the secondary replica regions are hosted, or if they exist at all, the 
> {{RegionReplicaReplicationEndpoint}} cannot forward mutations to them. This 
> leads to the secondary replica regions' memstores not getting updates, so 
> their data is even more stale than it should be. Users would get 
> unnecessarily incorrect results.
> {{RegionReplicaReplicationEndpoint}} actually contains cache-busting logic 
> seemingly designed to fix this exact problem:
> {code:java}
> // Replicas can take a while to come online. The cache may have only the 
> primary. If we
> // keep going to the cache, we will not learn of the replicas and their 
> locations after
> // they come online.
> if (useCache && locations.size() == 1 && 
> TableName.isMetaTableName(tableName)) {
>   if (tableDescriptors.get(tableName).getRegionReplication() > 1) {
>     // Make an obnoxious log here. See how bad this issue is. Add a timer if 
> happening
>     // too much.
>     LOG.info("Skipping location cache; only one location found for {}", 
> tableName);
>     useCache = false;
>     continue;
>   }
> }
> {code}
> However, because of the {{TableName.isMetaTableName(tableName)}} clause, the 
> cache-busting only takes effect if the mutation being forwarded belongs to 
> the meta table. I don't know why that restriction would make sense.
> In this ticket I plan to just remove the "is meta table" clause to fix this 
> bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-29502) RegionReplicaReplicationEndpoint fails to forward mutations when meta cache does not contain secondary replica locations

Reply via email to