Re: [PR] HBASE-29727 Introduce a String pool for repeating filename, region and cf string fields in BlockCacheKey [hbase]

via GitHub Wed, 26 Nov 2025 15:30:07 -0800


taklwu commented on code in PR #7477:
URL: https://github.com/apache/hbase/pull/7477#discussion_r2566597313



##########
dev-support/spotbugs-exclude.xml:
##########
@@ -271,4 +271,12 @@
     <Bug pattern="ML_SYNC_ON_UPDATED_FIELD"/>
   </Match>
 
+  <Match>
+    <Class name="org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool"/>
+    <Or>
+      <Method name="getInstance"/>
+    </Or>
+    <Bug pattern="MS_EXPOSE_REP"/>

Review Comment:
   [nit] can we add some comment why this should be ignore? 



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/FilePathStringPool.java:
##########
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.hfile.bucket;
+
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Pool of string values encoded to integer IDs for use in BlockCacheKey. This 
allows for avoiding
+ * duplicating string values for file names, region and CF values on various 
BlockCacheKey
+ * instances. Normally, single hfiles have many blocks. This means all blocks 
from the same file
+ * will have the very same file, region and CF names. On very large 
BucketCache setups (i.e. file
+ * based cache with TB size order), can save few GBs of memory by avoiding 
repeating these common
+ * string values on blocks from the same file. The FilePathStringPool is 
implemented as a singleton,
+ * since the same pool should be shared by all BlockCacheKey instances, as 
well as the BucketCache
+ * object itself. The Id for an encoded string is an integer. Any new String 
added to the pool is
+ * assigned the next available integer ID, starting from 0 upwards. That sets 
the total pool
+ * capacity to Integer.MAX_VALUE. In the event of ID exhaustion (integer 
overflow when Id values
+ * reach Integer.MAX_VALUE), the encode() method will restart iterating over 
int values
+ * incrementally from 0 until it finds an unused ID. Strings can be removed 
from the pool using the
+ * remove() method. BucketCache should call this when evicting all blocks for 
a given file (see
+ * BucketCache.evictFileBlocksFromCache()).
+ * <p>
+ * Thread-safe implementation that maintains bidirectional mappings between 
strings and IDs.
+ * </p>
+ */
[email protected]
+public class FilePathStringPool {
+  private static final Logger LOG = 
LoggerFactory.getLogger(FilePathStringPool.class);
+
+  // Bidirectional mappings for string objects re-use
+  private final ConcurrentHashMap<String, Integer> stringToId = new 
ConcurrentHashMap<>();
+  private final ConcurrentHashMap<Integer, String> idToString = new 
ConcurrentHashMap<>();
+  private final AtomicInteger nexId = new AtomicInteger(0);

Review Comment:
   [nit] typos?
   ```suggestion
     private final AtomicInteger nextId = new AtomicInteger(0);
   ```



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/FilePathStringPool.java:
##########
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.hfile.bucket;
+
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Pool of string values encoded to integer IDs for use in BlockCacheKey. This 
allows for avoiding
+ * duplicating string values for file names, region and CF values on various 
BlockCacheKey
+ * instances. Normally, single hfiles have many blocks. This means all blocks 
from the same file
+ * will have the very same file, region and CF names. On very large 
BucketCache setups (i.e. file
+ * based cache with TB size order), can save few GBs of memory by avoiding 
repeating these common

Review Comment:
   I found the answer is few GB saving for heap, thanks.



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CacheConfig.java:
##########
@@ -398,6 +398,9 @@ public long getCacheCompactedBlocksOnWriteThreshold() {
    * configuration.
    */
   public boolean shouldReadBlockFromCache(BlockType blockType) {
+    if (!getBlockCache().isPresent()) {

Review Comment:
   [nit] do we hit any bug, why does need to check blockcache instance exist 
after this change?



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:
##########
@@ -19,83 +19,164 @@
 
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool;
 import org.apache.hadoop.hbase.util.ClassSize;
+import org.apache.hadoop.hbase.util.HFileArchiveUtil;
 import org.apache.yetus.audience.InterfaceAudience;
 
 /**
  * Cache Key for use with implementations of {@link BlockCache}
  */
 @InterfaceAudience.Private
 public class BlockCacheKey implements HeapSize, java.io.Serializable {
-  private static final long serialVersionUID = -5199992013113130534L;
-  private final String hfileName;
+  private static final long serialVersionUID = -5199992013113130535L; // 
Changed due to format
+                                                                      // change
+
+  // New compressed format using integer file ID (when codec is available)
+  private final int hfileNameId;
+
+  private transient final FilePathStringPool stringPool;
+
+  private final int regionId;
+
+  private final int cfId;
+
   private final long offset;
+
   private BlockType blockType;
+
   private final boolean isPrimaryReplicaBlock;
 
-  private Path filePath;
+  private final boolean archived;
 
   /**
-   * Construct a new BlockCacheKey
+   * Constructs a new BlockCacheKey with the file name and offset only. To be 
used for cache lookups
+   * only, DO NOT use this for creating keys when inserting into the cache. 
Use either the
+   * overriding constructors with the path parameter or the region and cf 
parameters, otherwise,
+   * region cache metrics won't be recorded properly.
    * @param hfileName The name of the HFile this block belongs to.
    * @param offset    Offset of the block into the file
    */
   public BlockCacheKey(String hfileName, long offset) {
-    this(hfileName, offset, true, BlockType.DATA);
+    this(hfileName, null, null, offset, true, BlockType.DATA, false);
   }
 
+  /**
+   * Constructs a new BlockCacheKey with the file name, offset, replica and 
type only. To be used
+   * for cache lookups only, DO NOT use this for creating keys when inserting 
into the cache. Use
+   * either the overriding constructors with the path parameter or the region 
and cf parameters,
+   * otherwise, region cache metrics won't be recorded properly.
+   * @param hfileName        The name of the HFile this block belongs to.
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
   public BlockCacheKey(String hfileName, long offset, boolean isPrimaryReplica,
     BlockType blockType) {
+    this(hfileName, null, null, offset, isPrimaryReplica, blockType, false);
+  }
+
+  /**
+   * Construct a new BlockCacheKey, with file, column family and region 
information. This should be
+   * used when inserting keys into the cache, so that region cache metrics are 
recorded properly.
+   * @param hfileName        The name of the HFile this block belongs to.
+   * @param cfName           The column family name
+   * @param regionName       The region name
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
+  public BlockCacheKey(String hfileName, String cfName, String regionName, 
long offset,
+    boolean isPrimaryReplica, BlockType blockType, boolean archived) {
     this.isPrimaryReplicaBlock = isPrimaryReplica;
-    this.hfileName = hfileName;
     this.offset = offset;
     this.blockType = blockType;
+    this.stringPool = FilePathStringPool.getInstance();
+    // Use string pool for file, region and cf values
+    this.hfileNameId = stringPool.encode(hfileName);
+    this.regionId = (regionName != null) ? stringPool.encode(regionName) : -1;
+    this.cfId = (cfName != null) ? stringPool.encode(cfName) : -1;
+    this.archived = archived;
   }
 
+  /**
+   * Construct a new BlockCacheKey using a file path. File, column family and 
region information
+   * will be extracted from the passed path. This should be used when 
inserting keys into the cache,
+   * so that region cache metrics are recorded properly.
+   * @param hfilePath        The path to the HFile
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
   public BlockCacheKey(Path hfilePath, long offset, boolean isPrimaryReplica, 
BlockType blockType) {
-    this.filePath = hfilePath;
-    this.isPrimaryReplicaBlock = isPrimaryReplica;
-    this.hfileName = hfilePath.getName();
-    this.offset = offset;
-    this.blockType = blockType;
+    this(hfilePath.getName(), hfilePath.getParent().getName(),
+      hfilePath.getParent().getParent().getName(), offset, isPrimaryReplica, 
blockType,
+      HFileArchiveUtil.isHFileArchived(hfilePath));
   }
 
   @Override
   public int hashCode() {
-    return hfileName.hashCode() * 127 + (int) (offset ^ (offset >>> 32));
+    return hfileNameId * 127 + (int) (offset ^ (offset >>> 32));
   }
 
   @Override
   public boolean equals(Object o) {
     if (o instanceof BlockCacheKey) {
       BlockCacheKey k = (BlockCacheKey) o;
-      return offset == k.offset
-        && (hfileName == null ? k.hfileName == null : 
hfileName.equals(k.hfileName));
-    } else {
-      return false;
+      if (offset != k.offset) {
+        return false;
+      }
+      return getHfileName().equals(k.getHfileName());
     }
+    return false;
   }
 
   @Override
   public String toString() {
-    return this.hfileName + '_' + this.offset;
+    return getHfileName() + '_' + this.offset;
   }
 
   public static final long FIXED_OVERHEAD = 
ClassSize.estimateBase(BlockCacheKey.class, false);
 
   /**
-   * Strings have two bytes per character due to default Java Unicode encoding 
(hence length times
-   * 2).
+   * With the compressed format using integer file IDs, the heap size is 
significantly reduced. We
+   * now only store a 4-byte integer instead of the full file name string.
    */
   @Override
   public long heapSize() {
-    return ClassSize.align(FIXED_OVERHEAD + ClassSize.STRING + 2 * 
hfileName.length());
+    return ClassSize.align(FIXED_OVERHEAD);
   }
 
-  // can't avoid this unfortunately
-  /** Returns The hfileName portion of this cache key */
+  /**
+   * Returns the hfileName portion of this cache key.
+   * @return The file name
+   */
   public String getHfileName() {
-    return hfileName;
+    return stringPool.decode(hfileNameId);
+  }
+
+  /**
+   * Returns the hfileName portion of this cache key.
+   * @return The region name
+   */
+  public String getRegionName() {
+    return stringPool.decode(regionId);
+  }
+
+  /**
+   * Returns the compressed file ID.
+   * @return the integer file ID
+   */
+  public int getRegionId() {
+    return regionId;
+  }

Review Comment:
   [nit] this is unused



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DataTieringManager.java:
##########
@@ -96,11 +95,14 @@ public static synchronized DataTieringManager getInstance() 
{
    * @throws DataTieringException if there is an error retrieving the HFile 
path or configuration
    */
   public boolean isDataTieringEnabled(BlockCacheKey key) throws 
DataTieringException {
-    Path hFilePath = key.getFilePath();
-    if (hFilePath == null) {
-      throw new DataTieringException("BlockCacheKey Doesn't Contain HFile 
Path");
+    if (key.getCfName() == null || key.getRegionName() == null) {
+      throw new DataTieringException(
+        "BlockCacheKey doesn't contain " + "Column Family Name or Region 
Name");

Review Comment:
   ```suggestion
           "BlockCacheKey doesn't contain Column Family Name or Region Name");
   ```



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:
##########
@@ -19,83 +19,164 @@
 
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool;
 import org.apache.hadoop.hbase.util.ClassSize;
+import org.apache.hadoop.hbase.util.HFileArchiveUtil;
 import org.apache.yetus.audience.InterfaceAudience;
 
 /**
  * Cache Key for use with implementations of {@link BlockCache}
  */
 @InterfaceAudience.Private
 public class BlockCacheKey implements HeapSize, java.io.Serializable {
-  private static final long serialVersionUID = -5199992013113130534L;
-  private final String hfileName;
+  private static final long serialVersionUID = -5199992013113130535L; // 
Changed due to format
+                                                                      // change
+
+  // New compressed format using integer file ID (when codec is available)
+  private final int hfileNameId;
+
+  private transient final FilePathStringPool stringPool;
+
+  private final int regionId;
+
+  private final int cfId;
+
   private final long offset;
+
   private BlockType blockType;
+
   private final boolean isPrimaryReplicaBlock;
 
-  private Path filePath;
+  private final boolean archived;
 
   /**
-   * Construct a new BlockCacheKey
+   * Constructs a new BlockCacheKey with the file name and offset only. To be 
used for cache lookups
+   * only, DO NOT use this for creating keys when inserting into the cache. 
Use either the
+   * overriding constructors with the path parameter or the region and cf 
parameters, otherwise,
+   * region cache metrics won't be recorded properly.
    * @param hfileName The name of the HFile this block belongs to.
    * @param offset    Offset of the block into the file
    */
   public BlockCacheKey(String hfileName, long offset) {
-    this(hfileName, offset, true, BlockType.DATA);
+    this(hfileName, null, null, offset, true, BlockType.DATA, false);
   }
 
+  /**
+   * Constructs a new BlockCacheKey with the file name, offset, replica and 
type only. To be used
+   * for cache lookups only, DO NOT use this for creating keys when inserting 
into the cache. Use
+   * either the overriding constructors with the path parameter or the region 
and cf parameters,
+   * otherwise, region cache metrics won't be recorded properly.
+   * @param hfileName        The name of the HFile this block belongs to.
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
   public BlockCacheKey(String hfileName, long offset, boolean isPrimaryReplica,
     BlockType blockType) {
+    this(hfileName, null, null, offset, isPrimaryReplica, blockType, false);
+  }
+
+  /**
+   * Construct a new BlockCacheKey, with file, column family and region 
information. This should be
+   * used when inserting keys into the cache, so that region cache metrics are 
recorded properly.
+   * @param hfileName        The name of the HFile this block belongs to.
+   * @param cfName           The column family name
+   * @param regionName       The region name
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
+  public BlockCacheKey(String hfileName, String cfName, String regionName, 
long offset,
+    boolean isPrimaryReplica, BlockType blockType, boolean archived) {
     this.isPrimaryReplicaBlock = isPrimaryReplica;
-    this.hfileName = hfileName;
     this.offset = offset;
     this.blockType = blockType;
+    this.stringPool = FilePathStringPool.getInstance();
+    // Use string pool for file, region and cf values
+    this.hfileNameId = stringPool.encode(hfileName);
+    this.regionId = (regionName != null) ? stringPool.encode(regionName) : -1;
+    this.cfId = (cfName != null) ? stringPool.encode(cfName) : -1;
+    this.archived = archived;
   }
 
+  /**
+   * Construct a new BlockCacheKey using a file path. File, column family and 
region information
+   * will be extracted from the passed path. This should be used when 
inserting keys into the cache,
+   * so that region cache metrics are recorded properly.
+   * @param hfilePath        The path to the HFile
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
   public BlockCacheKey(Path hfilePath, long offset, boolean isPrimaryReplica, 
BlockType blockType) {
-    this.filePath = hfilePath;
-    this.isPrimaryReplicaBlock = isPrimaryReplica;
-    this.hfileName = hfilePath.getName();
-    this.offset = offset;
-    this.blockType = blockType;
+    this(hfilePath.getName(), hfilePath.getParent().getName(),
+      hfilePath.getParent().getParent().getName(), offset, isPrimaryReplica, 
blockType,
+      HFileArchiveUtil.isHFileArchived(hfilePath));
   }
 
   @Override
   public int hashCode() {
-    return hfileName.hashCode() * 127 + (int) (offset ^ (offset >>> 32));
+    return hfileNameId * 127 + (int) (offset ^ (offset >>> 32));

Review Comment:
   [nit] how do we make sure this won't case any conflict in this encoded ID ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] HBASE-29727 Introduce a String pool for repeating filename, region and cf string fields in BlockCacheKey [hbase]

Reply via email to