Re: [PR] HBASE-29727 Introduce a String pool for repeating filename, region and cf string fields in BlockCacheKey [hbase]

via GitHub Wed, 26 Nov 2025 14:13:54 -0800


Copilot commented on code in PR #7477:
URL: https://github.com/apache/hbase/pull/7477#discussion_r2566612897



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DataTieringManager.java:
##########
@@ -96,11 +95,14 @@ public static synchronized DataTieringManager getInstance() 
{
    * @throws DataTieringException if there is an error retrieving the HFile 
path or configuration
    */
   public boolean isDataTieringEnabled(BlockCacheKey key) throws 
DataTieringException {
-    Path hFilePath = key.getFilePath();
-    if (hFilePath == null) {
-      throw new DataTieringException("BlockCacheKey Doesn't Contain HFile 
Path");
+    if (key.getCfName() == null || key.getRegionName() == null) {
+      throw new DataTieringException(
+        "BlockCacheKey doesn't contain " + "Column Family Name or Region 
Name");

Review Comment:
   The error message has an unnecessary string concatenation with a space: 
`"BlockCacheKey doesn't contain " + "Column Family Name or Region Name"`. This 
should be a single string: `"BlockCacheKey doesn't contain Column Family Name 
or Region Name"`.
   ```suggestion
           "BlockCacheKey doesn't contain Column Family Name or Region Name");
   ```



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:
##########
@@ -19,83 +19,164 @@
 
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool;
 import org.apache.hadoop.hbase.util.ClassSize;
+import org.apache.hadoop.hbase.util.HFileArchiveUtil;
 import org.apache.yetus.audience.InterfaceAudience;
 
 /**
  * Cache Key for use with implementations of {@link BlockCache}
  */
 @InterfaceAudience.Private
 public class BlockCacheKey implements HeapSize, java.io.Serializable {
-  private static final long serialVersionUID = -5199992013113130534L;
-  private final String hfileName;
+  private static final long serialVersionUID = -5199992013113130535L; // 
Changed due to format
+                                                                      // change
+
+  // New compressed format using integer file ID (when codec is available)
+  private final int hfileNameId;
+
+  private transient final FilePathStringPool stringPool;

Review Comment:
   The `stringPool` field is marked as both `transient` and `final`, which can 
cause issues during deserialization. When an object is deserialized, transient 
fields are not restored, and final fields cannot be reassigned. This will 
result in `stringPool` being null after deserialization, causing 
NullPointerExceptions when calling methods like `getHfileName()`, 
`getRegionName()`, or `getCfName()`. 
   
   Consider implementing custom `readObject()` method to reinitialize the 
`stringPool` field after deserialization, or remove the `transient` modifier 
and implement proper serialization for the singleton pool reference.



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/FilePathStringPool.java:
##########
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.hfile.bucket;
+
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Pool of string values encoded to integer IDs for use in BlockCacheKey. This 
allows for avoiding
+ * duplicating string values for file names, region and CF values on various 
BlockCacheKey
+ * instances. Normally, single hfiles have many blocks. This means all blocks 
from the same file
+ * will have the very same file, region and CF names. On very large 
BucketCache setups (i.e. file
+ * based cache with TB size order), can save few GBs of memory by avoiding 
repeating these common
+ * string values on blocks from the same file. The FilePathStringPool is 
implemented as a singleton,
+ * since the same pool should be shared by all BlockCacheKey instances, as 
well as the BucketCache
+ * object itself. The Id for an encoded string is an integer. Any new String 
added to the pool is
+ * assigned the next available integer ID, starting from 0 upwards. That sets 
the total pool
+ * capacity to Integer.MAX_VALUE. In the event of ID exhaustion (integer 
overflow when Id values
+ * reach Integer.MAX_VALUE), the encode() method will restart iterating over 
int values
+ * incrementally from 0 until it finds an unused ID. Strings can be removed 
from the pool using the
+ * remove() method. BucketCache should call this when evicting all blocks for 
a given file (see
+ * BucketCache.evictFileBlocksFromCache()).
+ * <p>
+ * Thread-safe implementation that maintains bidirectional mappings between 
strings and IDs.
+ * </p>
+ */
[email protected]
+public class FilePathStringPool {
+  private static final Logger LOG = 
LoggerFactory.getLogger(FilePathStringPool.class);
+
+  // Bidirectional mappings for string objects re-use
+  private final ConcurrentHashMap<String, Integer> stringToId = new 
ConcurrentHashMap<>();
+  private final ConcurrentHashMap<Integer, String> idToString = new 
ConcurrentHashMap<>();
+  private final AtomicInteger nexId = new AtomicInteger(0);

Review Comment:
   Typo in variable name: `nexId` should be `nextId`.



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/FilePathStringPool.java:
##########
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.hfile.bucket;
+
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Pool of string values encoded to integer IDs for use in BlockCacheKey. This 
allows for avoiding
+ * duplicating string values for file names, region and CF values on various 
BlockCacheKey
+ * instances. Normally, single hfiles have many blocks. This means all blocks 
from the same file
+ * will have the very same file, region and CF names. On very large 
BucketCache setups (i.e. file
+ * based cache with TB size order), can save few GBs of memory by avoiding 
repeating these common
+ * string values on blocks from the same file. The FilePathStringPool is 
implemented as a singleton,
+ * since the same pool should be shared by all BlockCacheKey instances, as 
well as the BucketCache
+ * object itself. The Id for an encoded string is an integer. Any new String 
added to the pool is
+ * assigned the next available integer ID, starting from 0 upwards. That sets 
the total pool
+ * capacity to Integer.MAX_VALUE. In the event of ID exhaustion (integer 
overflow when Id values
+ * reach Integer.MAX_VALUE), the encode() method will restart iterating over 
int values
+ * incrementally from 0 until it finds an unused ID. Strings can be removed 
from the pool using the
+ * remove() method. BucketCache should call this when evicting all blocks for 
a given file (see
+ * BucketCache.evictFileBlocksFromCache()).
+ * <p>
+ * Thread-safe implementation that maintains bidirectional mappings between 
strings and IDs.
+ * </p>
+ */
[email protected]
+public class FilePathStringPool {
+  private static final Logger LOG = 
LoggerFactory.getLogger(FilePathStringPool.class);
+
+  // Bidirectional mappings for string objects re-use
+  private final ConcurrentHashMap<String, Integer> stringToId = new 
ConcurrentHashMap<>();
+  private final ConcurrentHashMap<Integer, String> idToString = new 
ConcurrentHashMap<>();
+  private final AtomicInteger nexId = new AtomicInteger(0);
+
+  private static FilePathStringPool instance;
+
+  public static FilePathStringPool getInstance() {
+    synchronized (FilePathStringPool.class) {
+      if (instance == null) {
+        instance = new FilePathStringPool();
+      }
+    }
+    return instance;
+  }
+
+  private FilePathStringPool() {
+    // Private constructor for singleton
+  }
+
+  /**
+   * Gets or creates an integer ID for the given String.
+   * @param string value for the file/region/CF name.
+   * @return the integer ID encoding this string in the pool.
+   */
+  public int encode(String string) {
+    if (string == null) {
+      throw new IllegalArgumentException("hfileName cannot be null");

Review Comment:
   The error message says "hfileName cannot be null" but this method is generic 
and encodes any string (file names, region names, CF names). The error message 
should be more generic like "string cannot be null".
   ```suggestion
         throw new IllegalArgumentException("string cannot be null");
   ```



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:
##########
@@ -19,83 +19,164 @@
 
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool;
 import org.apache.hadoop.hbase.util.ClassSize;
+import org.apache.hadoop.hbase.util.HFileArchiveUtil;
 import org.apache.yetus.audience.InterfaceAudience;
 
 /**
  * Cache Key for use with implementations of {@link BlockCache}
  */
 @InterfaceAudience.Private
 public class BlockCacheKey implements HeapSize, java.io.Serializable {
-  private static final long serialVersionUID = -5199992013113130534L;
-  private final String hfileName;
+  private static final long serialVersionUID = -5199992013113130535L; // 
Changed due to format
+                                                                      // change
+
+  // New compressed format using integer file ID (when codec is available)
+  private final int hfileNameId;
+
+  private transient final FilePathStringPool stringPool;
+
+  private final int regionId;
+
+  private final int cfId;
+
   private final long offset;
+
   private BlockType blockType;
+
   private final boolean isPrimaryReplicaBlock;
 
-  private Path filePath;
+  private final boolean archived;
 
   /**
-   * Construct a new BlockCacheKey
+   * Constructs a new BlockCacheKey with the file name and offset only. To be 
used for cache lookups
+   * only, DO NOT use this for creating keys when inserting into the cache. 
Use either the
+   * overriding constructors with the path parameter or the region and cf 
parameters, otherwise,
+   * region cache metrics won't be recorded properly.
    * @param hfileName The name of the HFile this block belongs to.
    * @param offset    Offset of the block into the file
    */
   public BlockCacheKey(String hfileName, long offset) {
-    this(hfileName, offset, true, BlockType.DATA);
+    this(hfileName, null, null, offset, true, BlockType.DATA, false);
   }
 
+  /**
+   * Constructs a new BlockCacheKey with the file name, offset, replica and 
type only. To be used
+   * for cache lookups only, DO NOT use this for creating keys when inserting 
into the cache. Use
+   * either the overriding constructors with the path parameter or the region 
and cf parameters,
+   * otherwise, region cache metrics won't be recorded properly.
+   * @param hfileName        The name of the HFile this block belongs to.
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
   public BlockCacheKey(String hfileName, long offset, boolean isPrimaryReplica,
     BlockType blockType) {
+    this(hfileName, null, null, offset, isPrimaryReplica, blockType, false);
+  }
+
+  /**
+   * Construct a new BlockCacheKey, with file, column family and region 
information. This should be
+   * used when inserting keys into the cache, so that region cache metrics are 
recorded properly.
+   * @param hfileName        The name of the HFile this block belongs to.
+   * @param cfName           The column family name
+   * @param regionName       The region name
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
+  public BlockCacheKey(String hfileName, String cfName, String regionName, 
long offset,
+    boolean isPrimaryReplica, BlockType blockType, boolean archived) {
     this.isPrimaryReplicaBlock = isPrimaryReplica;
-    this.hfileName = hfileName;
     this.offset = offset;
     this.blockType = blockType;
+    this.stringPool = FilePathStringPool.getInstance();
+    // Use string pool for file, region and cf values
+    this.hfileNameId = stringPool.encode(hfileName);
+    this.regionId = (regionName != null) ? stringPool.encode(regionName) : -1;
+    this.cfId = (cfName != null) ? stringPool.encode(cfName) : -1;
+    this.archived = archived;
   }
 
+  /**
+   * Construct a new BlockCacheKey using a file path. File, column family and 
region information
+   * will be extracted from the passed path. This should be used when 
inserting keys into the cache,
+   * so that region cache metrics are recorded properly.
+   * @param hfilePath        The path to the HFile
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
   public BlockCacheKey(Path hfilePath, long offset, boolean isPrimaryReplica, 
BlockType blockType) {
-    this.filePath = hfilePath;
-    this.isPrimaryReplicaBlock = isPrimaryReplica;
-    this.hfileName = hfilePath.getName();
-    this.offset = offset;
-    this.blockType = blockType;
+    this(hfilePath.getName(), hfilePath.getParent().getName(),
+      hfilePath.getParent().getParent().getName(), offset, isPrimaryReplica, 
blockType,
+      HFileArchiveUtil.isHFileArchived(hfilePath));
   }
 
   @Override
   public int hashCode() {
-    return hfileName.hashCode() * 127 + (int) (offset ^ (offset >>> 32));
+    return hfileNameId * 127 + (int) (offset ^ (offset >>> 32));
   }
 
   @Override
   public boolean equals(Object o) {
     if (o instanceof BlockCacheKey) {
       BlockCacheKey k = (BlockCacheKey) o;
-      return offset == k.offset
-        && (hfileName == null ? k.hfileName == null : 
hfileName.equals(k.hfileName));
-    } else {
-      return false;
+      if (offset != k.offset) {
+        return false;
+      }
+      return getHfileName().equals(k.getHfileName());
     }
+    return false;
   }
 
   @Override
   public String toString() {
-    return this.hfileName + '_' + this.offset;
+    return getHfileName() + '_' + this.offset;
   }
 
   public static final long FIXED_OVERHEAD = 
ClassSize.estimateBase(BlockCacheKey.class, false);
 
   /**
-   * Strings have two bytes per character due to default Java Unicode encoding 
(hence length times
-   * 2).
+   * With the compressed format using integer file IDs, the heap size is 
significantly reduced. We
+   * now only store a 4-byte integer instead of the full file name string.
    */
   @Override
   public long heapSize() {
-    return ClassSize.align(FIXED_OVERHEAD + ClassSize.STRING + 2 * 
hfileName.length());
+    return ClassSize.align(FIXED_OVERHEAD);
   }
 
-  // can't avoid this unfortunately
-  /** Returns The hfileName portion of this cache key */
+  /**
+   * Returns the hfileName portion of this cache key.
+   * @return The file name
+   */
   public String getHfileName() {
-    return hfileName;
+    return stringPool.decode(hfileNameId);
+  }
+
+  /**
+   * Returns the hfileName portion of this cache key.

Review Comment:
   The JavaDoc comment says "Returns the hfileName portion of this cache key" 
but the method returns the region name. The comment should say "Returns the 
region name portion of this cache key."
   ```suggestion
      * Returns the region name portion of this cache key.
   ```



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:
##########
@@ -19,83 +19,164 @@
 
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool;
 import org.apache.hadoop.hbase.util.ClassSize;
+import org.apache.hadoop.hbase.util.HFileArchiveUtil;
 import org.apache.yetus.audience.InterfaceAudience;
 
 /**
  * Cache Key for use with implementations of {@link BlockCache}
  */
 @InterfaceAudience.Private
 public class BlockCacheKey implements HeapSize, java.io.Serializable {
-  private static final long serialVersionUID = -5199992013113130534L;
-  private final String hfileName;
+  private static final long serialVersionUID = -5199992013113130535L; // 
Changed due to format
+                                                                      // change
+
+  // New compressed format using integer file ID (when codec is available)
+  private final int hfileNameId;
+
+  private transient final FilePathStringPool stringPool;
+
+  private final int regionId;
+
+  private final int cfId;
+
   private final long offset;
+
   private BlockType blockType;
+
   private final boolean isPrimaryReplicaBlock;
 
-  private Path filePath;
+  private final boolean archived;
 
   /**
-   * Construct a new BlockCacheKey
+   * Constructs a new BlockCacheKey with the file name and offset only. To be 
used for cache lookups
+   * only, DO NOT use this for creating keys when inserting into the cache. 
Use either the
+   * overriding constructors with the path parameter or the region and cf 
parameters, otherwise,
+   * region cache metrics won't be recorded properly.
    * @param hfileName The name of the HFile this block belongs to.
    * @param offset    Offset of the block into the file
    */
   public BlockCacheKey(String hfileName, long offset) {
-    this(hfileName, offset, true, BlockType.DATA);
+    this(hfileName, null, null, offset, true, BlockType.DATA, false);
   }
 
+  /**
+   * Constructs a new BlockCacheKey with the file name, offset, replica and 
type only. To be used
+   * for cache lookups only, DO NOT use this for creating keys when inserting 
into the cache. Use
+   * either the overriding constructors with the path parameter or the region 
and cf parameters,
+   * otherwise, region cache metrics won't be recorded properly.
+   * @param hfileName        The name of the HFile this block belongs to.
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
   public BlockCacheKey(String hfileName, long offset, boolean isPrimaryReplica,
     BlockType blockType) {
+    this(hfileName, null, null, offset, isPrimaryReplica, blockType, false);
+  }
+
+  /**
+   * Construct a new BlockCacheKey, with file, column family and region 
information. This should be
+   * used when inserting keys into the cache, so that region cache metrics are 
recorded properly.
+   * @param hfileName        The name of the HFile this block belongs to.
+   * @param cfName           The column family name
+   * @param regionName       The region name
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
+  public BlockCacheKey(String hfileName, String cfName, String regionName, 
long offset,
+    boolean isPrimaryReplica, BlockType blockType, boolean archived) {
     this.isPrimaryReplicaBlock = isPrimaryReplica;
-    this.hfileName = hfileName;
     this.offset = offset;
     this.blockType = blockType;
+    this.stringPool = FilePathStringPool.getInstance();
+    // Use string pool for file, region and cf values
+    this.hfileNameId = stringPool.encode(hfileName);
+    this.regionId = (regionName != null) ? stringPool.encode(regionName) : -1;
+    this.cfId = (cfName != null) ? stringPool.encode(cfName) : -1;
+    this.archived = archived;
   }
 
+  /**
+   * Construct a new BlockCacheKey using a file path. File, column family and 
region information
+   * will be extracted from the passed path. This should be used when 
inserting keys into the cache,
+   * so that region cache metrics are recorded properly.
+   * @param hfilePath        The path to the HFile
+   * @param offset           Offset of the block into the file
+   * @param isPrimaryReplica Whether this is from primary replica
+   * @param blockType        Type of block
+   */
   public BlockCacheKey(Path hfilePath, long offset, boolean isPrimaryReplica, 
BlockType blockType) {
-    this.filePath = hfilePath;
-    this.isPrimaryReplicaBlock = isPrimaryReplica;
-    this.hfileName = hfilePath.getName();
-    this.offset = offset;
-    this.blockType = blockType;
+    this(hfilePath.getName(), hfilePath.getParent().getName(),
+      hfilePath.getParent().getParent().getName(), offset, isPrimaryReplica, 
blockType,
+      HFileArchiveUtil.isHFileArchived(hfilePath));
   }
 
   @Override
   public int hashCode() {
-    return hfileName.hashCode() * 127 + (int) (offset ^ (offset >>> 32));
+    return hfileNameId * 127 + (int) (offset ^ (offset >>> 32));
   }
 
   @Override
   public boolean equals(Object o) {
     if (o instanceof BlockCacheKey) {
       BlockCacheKey k = (BlockCacheKey) o;
-      return offset == k.offset
-        && (hfileName == null ? k.hfileName == null : 
hfileName.equals(k.hfileName));
-    } else {
-      return false;
+      if (offset != k.offset) {
+        return false;
+      }
+      return getHfileName().equals(k.getHfileName());
     }
+    return false;
   }
 
   @Override
   public String toString() {
-    return this.hfileName + '_' + this.offset;
+    return getHfileName() + '_' + this.offset;
   }
 
   public static final long FIXED_OVERHEAD = 
ClassSize.estimateBase(BlockCacheKey.class, false);
 
   /**
-   * Strings have two bytes per character due to default Java Unicode encoding 
(hence length times
-   * 2).
+   * With the compressed format using integer file IDs, the heap size is 
significantly reduced. We
+   * now only store a 4-byte integer instead of the full file name string.
    */
   @Override
   public long heapSize() {
-    return ClassSize.align(FIXED_OVERHEAD + ClassSize.STRING + 2 * 
hfileName.length());
+    return ClassSize.align(FIXED_OVERHEAD);
   }
 
-  // can't avoid this unfortunately
-  /** Returns The hfileName portion of this cache key */
+  /**
+   * Returns the hfileName portion of this cache key.
+   * @return The file name
+   */
   public String getHfileName() {
-    return hfileName;
+    return stringPool.decode(hfileNameId);
+  }
+
+  /**
+   * Returns the hfileName portion of this cache key.
+   * @return The region name
+   */
+  public String getRegionName() {
+    return stringPool.decode(regionId);
+  }
+
+  /**
+   * Returns the compressed file ID.
+   * @return the integer file ID

Review Comment:
   The JavaDoc comment says "Returns the compressed file ID" but the method 
returns the region ID, not the file ID. The comment should say "Returns the 
compressed region ID."
   ```suggestion
      * Returns the compressed region ID.
      * @return the integer region ID
   ```



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java:
##########
@@ -1828,24 +1837,27 @@ private void join() throws InterruptedException {
 
   @Override
   public void shutdown() {
-    disableCache();
-    LOG.info("Shutdown bucket cache: IO persistent=" + ioEngine.isPersistent() 
+ "; path to write="
-      + persistencePath);
-    if (ioEngine.isPersistent() && persistencePath != null) {
-      try {
-        join();
-        if (cachePersister != null) {
-          LOG.info("Shutting down cache persister thread.");
-          cachePersister.shutdown();
-          while (cachePersister.isAlive()) {
-            Thread.sleep(10);
+    if (isCacheEnabled()) {
+      disableCache();
+      LOG.info("Shutdown bucket cache: IO persistent=" + 
ioEngine.isPersistent()
+        + "; path to write=" + persistencePath);
+      if (ioEngine.isPersistent() && persistencePath != null) {
+        try {
+          join();
+          if (cachePersister != null) {
+            LOG.info("Shutting down cache persister thread.");
+            cachePersister.shutdown();
+            while (cachePersister.isAlive()) {
+              Thread.sleep(10);
+            }
           }
+          persistToFile();

Review Comment:
   The `FilePathStringPool.getInstance().clear()` is called at line 1855 after 
`persistToFile()` at line 1854. However, the string pool data is needed for 
serialization during `persistToFile()` because BlockCacheKeys need to decode 
their string IDs when being serialized. Clearing the pool before persistence 
completes could cause issues. The clear should happen only after successful 
persistence, or not at all if the pool will be reused on restart.
   ```suggestion
             persistToFile();
             // Only clear the string pool after successful persistence
   ```



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java:
##########
@@ -1927,33 +1939,28 @@ public int evictBlocksByHfileName(String hfileName) {
   }
 
   @Override
-  public int evictBlocksByHfilePath(Path hfilePath) {
-    return evictBlocksRangeByHfileName(hfilePath.getName(), hfilePath, 0, 
Long.MAX_VALUE);
-  }
-
-  public int evictBlocksRangeByHfileName(String hfileName, Path filePath, long 
initOffset,
-    long endOffset) {
+  public int evictBlocksRangeByHfileName(String hfileName, long initOffset, 
long endOffset) {
     Set<BlockCacheKey> keySet = getAllCacheKeysForFile(hfileName, initOffset, 
endOffset);
     LOG.debug("found {} blocks for file {}, starting offset: {}, end offset: 
{}", keySet.size(),
       hfileName, initOffset, endOffset);
     int numEvicted = 0;
     for (BlockCacheKey key : keySet) {
-      if (filePath != null) {
-        key.setFilePath(filePath);
-      }
       if (evictBlock(key)) {
         ++numEvicted;
       }
     }
+    if (numEvicted > 0) {
+      // We need to make sure whether we are evicting all blocks for this 
given file
+      int totalFileKeys = getAllCacheKeysForFile(hfileName, 0, 
Long.MAX_VALUE).size();
+      if (totalFileKeys == numEvicted) {
+        FilePathStringPool.getInstance().remove(hfileName);

Review Comment:
   The logic at lines 1952-1958 checks if all blocks for a file have been 
evicted and only then removes the filename from the string pool. However, 
there's a potential issue: if the method is called with a restricted range 
(`initOffset` to `endOffset`), it might evict only a subset of blocks, but then 
check against the full range to determine if removal is appropriate. 
   
   The check at line 1954 queries `getAllCacheKeysForFile(hfileName, 0, 
Long.MAX_VALUE)` after eviction to see if all blocks are gone. But this creates 
a window where:
   1. Some blocks in range [initOffset, endOffset] are evicted (numEvicted 
blocks)
   2. The check compares numEvicted against totalFileKeys from the full range
   3. If they match, it removes the filename from the pool
   
   However, if `evictBlocksRangeByHfileName` is called with a partial range and 
that partial range happens to contain all remaining blocks (because others were 
evicted previously), the condition will be true and the filename will be 
removed, even though more blocks might be added later for the same file. 
Additionally, other BlockCacheKeys for this file (that aren't evicted) will 
still have the ID encoded, so their `getHfileName()` calls will return null 
after the pool entry is removed.
   ```suggestion
         // Only remove the filename from the pool if we are evicting the full 
range
         if (initOffset == 0 && endOffset == Long.MAX_VALUE) {
           int totalFileKeys = getAllCacheKeysForFile(hfileName, 0, 
Long.MAX_VALUE).size();
           if (totalFileKeys == numEvicted) {
             FilePathStringPool.getInstance().remove(hfileName);
           }
   ```



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/FilePathStringPool.java:
##########
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.hfile.bucket;
+
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Pool of string values encoded to integer IDs for use in BlockCacheKey. This 
allows for avoiding
+ * duplicating string values for file names, region and CF values on various 
BlockCacheKey
+ * instances. Normally, single hfiles have many blocks. This means all blocks 
from the same file
+ * will have the very same file, region and CF names. On very large 
BucketCache setups (i.e. file
+ * based cache with TB size order), can save few GBs of memory by avoiding 
repeating these common
+ * string values on blocks from the same file. The FilePathStringPool is 
implemented as a singleton,
+ * since the same pool should be shared by all BlockCacheKey instances, as 
well as the BucketCache
+ * object itself. The Id for an encoded string is an integer. Any new String 
added to the pool is
+ * assigned the next available integer ID, starting from 0 upwards. That sets 
the total pool
+ * capacity to Integer.MAX_VALUE. In the event of ID exhaustion (integer 
overflow when Id values
+ * reach Integer.MAX_VALUE), the encode() method will restart iterating over 
int values
+ * incrementally from 0 until it finds an unused ID. Strings can be removed 
from the pool using the
+ * remove() method. BucketCache should call this when evicting all blocks for 
a given file (see
+ * BucketCache.evictFileBlocksFromCache()).
+ * <p>
+ * Thread-safe implementation that maintains bidirectional mappings between 
strings and IDs.
+ * </p>
+ */
[email protected]
+public class FilePathStringPool {
+  private static final Logger LOG = 
LoggerFactory.getLogger(FilePathStringPool.class);
+
+  // Bidirectional mappings for string objects re-use
+  private final ConcurrentHashMap<String, Integer> stringToId = new 
ConcurrentHashMap<>();
+  private final ConcurrentHashMap<Integer, String> idToString = new 
ConcurrentHashMap<>();
+  private final AtomicInteger nexId = new AtomicInteger(0);
+
+  private static FilePathStringPool instance;
+
+  public static FilePathStringPool getInstance() {
+    synchronized (FilePathStringPool.class) {
+      if (instance == null) {
+        instance = new FilePathStringPool();
+      }
+    }
+    return instance;
+  }
+
+  private FilePathStringPool() {
+    // Private constructor for singleton
+  }
+
+  /**
+   * Gets or creates an integer ID for the given String.
+   * @param string value for the file/region/CF name.
+   * @return the integer ID encoding this string in the pool.
+   */
+  public int encode(String string) {
+    if (string == null) {
+      throw new IllegalArgumentException("hfileName cannot be null");
+    }
+    if (stringToId.size() == Integer.MAX_VALUE) {
+      throw new IllegalStateException(
+        "String pool has reached maximum capacity of " + Integer.MAX_VALUE + " 
unique strings.");
+    }
+    return stringToId.computeIfAbsent(string, name -> {

Review Comment:
   There's a potential race condition where the check `stringToId.size() == 
Integer.MAX_VALUE` could pass, but then multiple threads could enter 
`computeIfAbsent` and try to insert new strings, exceeding the maximum 
capacity. The size check should either be inside the computeIfAbsent lambda or 
use a more robust mechanism to prevent overflow.
   ```suggestion
       return stringToId.computeIfAbsent(string, name -> {
         if (stringToId.size() == Integer.MAX_VALUE) {
           throw new IllegalStateException(
             "String pool has reached maximum capacity of " + Integer.MAX_VALUE 
+ " unique strings.");
         }
   ```



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java:
##########
@@ -775,19 +779,21 @@ public void fileCacheCompleted(Path filePath, long size) {
     fullyCachedFiles.put(filePath.getName(), pair);
   }
 
-  private void updateRegionCachedSize(Path filePath, long cachedSize) {
-    if (filePath != null) {
-      if (HFileArchiveUtil.isHFileArchived(filePath)) {
-        LOG.trace("Skipping region cached size update for archived file: {}", 
filePath);
+  private void updateRegionCachedSize(BlockCacheKey key, long cachedSize) {
+    if (key.getRegionName() != null) {
+      if (key.isArchived()) {
+        LOG.trace("Skipping region cached size update for archived file:{} 
from region: {}",
+          key.getHfileName(), key.getRegionName());
       } else {
-        String regionName = filePath.getParent().getParent().getName();
+        String regionName = key.getRegionName();
         regionCachedSize.merge(regionName, cachedSize,
           (previousSize, newBlockSize) -> previousSize + newBlockSize);
         LOG.trace("Updating region cached size for region: {}", regionName);
         // If all the blocks for a region are evicted from the cache,
         // remove the entry for that region from regionCachedSize map.
         if (regionCachedSize.get(regionName) <= 0) {
           regionCachedSize.remove(regionName);
+          FilePathStringPool.getInstance().remove(regionName);

Review Comment:
   Removing the region name from the string pool when 
`regionCachedSize.get(regionName) <= 0` (line 796) is problematic. The region 
name might still be referenced by BlockCacheKeys that are in the cache but 
haven't been evicted yet. Additionally, other regions might share the same 
encoded region name string in the pool. Once removed from the pool, any 
subsequent calls to `getRegionName()` on existing BlockCacheKeys will return 
null, breaking functionality that depends on the region name being available.
   
   Consider tracking reference counts for each pooled string across all 
BlockCacheKeys, or only remove region names during cache shutdown/clear 
operations.
   ```suggestion
             // Do not remove regionName from FilePathStringPool here; it may 
still be referenced.
   ```



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java:
##########
@@ -599,7 +599,8 @@ protected void cacheBlockWithWaitInternal(BlockCacheKey 
cacheKey, Cacheable cach
     if (cacheKey.getBlockType() == null && cachedItem.getBlockType() != null) {
       cacheKey.setBlockType(cachedItem.getBlockType());
     }
-    LOG.trace("Caching key={}, item={}", cacheKey, cachedItem);
+    LOG.debug("Caching key={}, item={}, key keap size={}", cacheKey, 
cachedItem,

Review Comment:
   Typo in log message: "keap" should be "heap".
   ```suggestion
       LOG.debug("Caching key={}, item={}, key heap size={}", cacheKey, 
cachedItem,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] HBASE-29727 Introduce a String pool for repeating filename, region and cf string fields in BlockCacheKey [hbase]

Reply via email to