Copilot commented on code in PR #7477:
URL: https://github.com/apache/hbase/pull/7477#discussion_r2566612897
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DataTieringManager.java:
##########
@@ -96,11 +95,14 @@ public static synchronized DataTieringManager getInstance()
{
* @throws DataTieringException if there is an error retrieving the HFile
path or configuration
*/
public boolean isDataTieringEnabled(BlockCacheKey key) throws
DataTieringException {
- Path hFilePath = key.getFilePath();
- if (hFilePath == null) {
- throw new DataTieringException("BlockCacheKey Doesn't Contain HFile
Path");
+ if (key.getCfName() == null || key.getRegionName() == null) {
+ throw new DataTieringException(
+ "BlockCacheKey doesn't contain " + "Column Family Name or Region
Name");
Review Comment:
The error message has an unnecessary string concatenation with a space:
`"BlockCacheKey doesn't contain " + "Column Family Name or Region Name"`. This
should be a single string: `"BlockCacheKey doesn't contain Column Family Name
or Region Name"`.
```suggestion
"BlockCacheKey doesn't contain Column Family Name or Region Name");
```
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:
##########
@@ -19,83 +19,164 @@
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool;
import org.apache.hadoop.hbase.util.ClassSize;
+import org.apache.hadoop.hbase.util.HFileArchiveUtil;
import org.apache.yetus.audience.InterfaceAudience;
/**
* Cache Key for use with implementations of {@link BlockCache}
*/
@InterfaceAudience.Private
public class BlockCacheKey implements HeapSize, java.io.Serializable {
- private static final long serialVersionUID = -5199992013113130534L;
- private final String hfileName;
+ private static final long serialVersionUID = -5199992013113130535L; //
Changed due to format
+ // change
+
+ // New compressed format using integer file ID (when codec is available)
+ private final int hfileNameId;
+
+ private transient final FilePathStringPool stringPool;
Review Comment:
The `stringPool` field is marked as both `transient` and `final`, which can
cause issues during deserialization. When an object is deserialized, transient
fields are not restored, and final fields cannot be reassigned. This will
result in `stringPool` being null after deserialization, causing
NullPointerExceptions when calling methods like `getHfileName()`,
`getRegionName()`, or `getCfName()`.
Consider implementing custom `readObject()` method to reinitialize the
`stringPool` field after deserialization, or remove the `transient` modifier
and implement proper serialization for the singleton pool reference.
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/FilePathStringPool.java:
##########
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.hfile.bucket;
+
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Pool of string values encoded to integer IDs for use in BlockCacheKey. This
allows for avoiding
+ * duplicating string values for file names, region and CF values on various
BlockCacheKey
+ * instances. Normally, single hfiles have many blocks. This means all blocks
from the same file
+ * will have the very same file, region and CF names. On very large
BucketCache setups (i.e. file
+ * based cache with TB size order), can save few GBs of memory by avoiding
repeating these common
+ * string values on blocks from the same file. The FilePathStringPool is
implemented as a singleton,
+ * since the same pool should be shared by all BlockCacheKey instances, as
well as the BucketCache
+ * object itself. The Id for an encoded string is an integer. Any new String
added to the pool is
+ * assigned the next available integer ID, starting from 0 upwards. That sets
the total pool
+ * capacity to Integer.MAX_VALUE. In the event of ID exhaustion (integer
overflow when Id values
+ * reach Integer.MAX_VALUE), the encode() method will restart iterating over
int values
+ * incrementally from 0 until it finds an unused ID. Strings can be removed
from the pool using the
+ * remove() method. BucketCache should call this when evicting all blocks for
a given file (see
+ * BucketCache.evictFileBlocksFromCache()).
+ * <p>
+ * Thread-safe implementation that maintains bidirectional mappings between
strings and IDs.
+ * </p>
+ */
[email protected]
+public class FilePathStringPool {
+ private static final Logger LOG =
LoggerFactory.getLogger(FilePathStringPool.class);
+
+ // Bidirectional mappings for string objects re-use
+ private final ConcurrentHashMap<String, Integer> stringToId = new
ConcurrentHashMap<>();
+ private final ConcurrentHashMap<Integer, String> idToString = new
ConcurrentHashMap<>();
+ private final AtomicInteger nexId = new AtomicInteger(0);
Review Comment:
Typo in variable name: `nexId` should be `nextId`.
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/FilePathStringPool.java:
##########
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.hfile.bucket;
+
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Pool of string values encoded to integer IDs for use in BlockCacheKey. This
allows for avoiding
+ * duplicating string values for file names, region and CF values on various
BlockCacheKey
+ * instances. Normally, single hfiles have many blocks. This means all blocks
from the same file
+ * will have the very same file, region and CF names. On very large
BucketCache setups (i.e. file
+ * based cache with TB size order), can save few GBs of memory by avoiding
repeating these common
+ * string values on blocks from the same file. The FilePathStringPool is
implemented as a singleton,
+ * since the same pool should be shared by all BlockCacheKey instances, as
well as the BucketCache
+ * object itself. The Id for an encoded string is an integer. Any new String
added to the pool is
+ * assigned the next available integer ID, starting from 0 upwards. That sets
the total pool
+ * capacity to Integer.MAX_VALUE. In the event of ID exhaustion (integer
overflow when Id values
+ * reach Integer.MAX_VALUE), the encode() method will restart iterating over
int values
+ * incrementally from 0 until it finds an unused ID. Strings can be removed
from the pool using the
+ * remove() method. BucketCache should call this when evicting all blocks for
a given file (see
+ * BucketCache.evictFileBlocksFromCache()).
+ * <p>
+ * Thread-safe implementation that maintains bidirectional mappings between
strings and IDs.
+ * </p>
+ */
[email protected]
+public class FilePathStringPool {
+ private static final Logger LOG =
LoggerFactory.getLogger(FilePathStringPool.class);
+
+ // Bidirectional mappings for string objects re-use
+ private final ConcurrentHashMap<String, Integer> stringToId = new
ConcurrentHashMap<>();
+ private final ConcurrentHashMap<Integer, String> idToString = new
ConcurrentHashMap<>();
+ private final AtomicInteger nexId = new AtomicInteger(0);
+
+ private static FilePathStringPool instance;
+
+ public static FilePathStringPool getInstance() {
+ synchronized (FilePathStringPool.class) {
+ if (instance == null) {
+ instance = new FilePathStringPool();
+ }
+ }
+ return instance;
+ }
+
+ private FilePathStringPool() {
+ // Private constructor for singleton
+ }
+
+ /**
+ * Gets or creates an integer ID for the given String.
+ * @param string value for the file/region/CF name.
+ * @return the integer ID encoding this string in the pool.
+ */
+ public int encode(String string) {
+ if (string == null) {
+ throw new IllegalArgumentException("hfileName cannot be null");
Review Comment:
The error message says "hfileName cannot be null" but this method is generic
and encodes any string (file names, region names, CF names). The error message
should be more generic like "string cannot be null".
```suggestion
throw new IllegalArgumentException("string cannot be null");
```
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:
##########
@@ -19,83 +19,164 @@
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool;
import org.apache.hadoop.hbase.util.ClassSize;
+import org.apache.hadoop.hbase.util.HFileArchiveUtil;
import org.apache.yetus.audience.InterfaceAudience;
/**
* Cache Key for use with implementations of {@link BlockCache}
*/
@InterfaceAudience.Private
public class BlockCacheKey implements HeapSize, java.io.Serializable {
- private static final long serialVersionUID = -5199992013113130534L;
- private final String hfileName;
+ private static final long serialVersionUID = -5199992013113130535L; //
Changed due to format
+ // change
+
+ // New compressed format using integer file ID (when codec is available)
+ private final int hfileNameId;
+
+ private transient final FilePathStringPool stringPool;
+
+ private final int regionId;
+
+ private final int cfId;
+
private final long offset;
+
private BlockType blockType;
+
private final boolean isPrimaryReplicaBlock;
- private Path filePath;
+ private final boolean archived;
/**
- * Construct a new BlockCacheKey
+ * Constructs a new BlockCacheKey with the file name and offset only. To be
used for cache lookups
+ * only, DO NOT use this for creating keys when inserting into the cache.
Use either the
+ * overriding constructors with the path parameter or the region and cf
parameters, otherwise,
+ * region cache metrics won't be recorded properly.
* @param hfileName The name of the HFile this block belongs to.
* @param offset Offset of the block into the file
*/
public BlockCacheKey(String hfileName, long offset) {
- this(hfileName, offset, true, BlockType.DATA);
+ this(hfileName, null, null, offset, true, BlockType.DATA, false);
}
+ /**
+ * Constructs a new BlockCacheKey with the file name, offset, replica and
type only. To be used
+ * for cache lookups only, DO NOT use this for creating keys when inserting
into the cache. Use
+ * either the overriding constructors with the path parameter or the region
and cf parameters,
+ * otherwise, region cache metrics won't be recorded properly.
+ * @param hfileName The name of the HFile this block belongs to.
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
public BlockCacheKey(String hfileName, long offset, boolean isPrimaryReplica,
BlockType blockType) {
+ this(hfileName, null, null, offset, isPrimaryReplica, blockType, false);
+ }
+
+ /**
+ * Construct a new BlockCacheKey, with file, column family and region
information. This should be
+ * used when inserting keys into the cache, so that region cache metrics are
recorded properly.
+ * @param hfileName The name of the HFile this block belongs to.
+ * @param cfName The column family name
+ * @param regionName The region name
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
+ public BlockCacheKey(String hfileName, String cfName, String regionName,
long offset,
+ boolean isPrimaryReplica, BlockType blockType, boolean archived) {
this.isPrimaryReplicaBlock = isPrimaryReplica;
- this.hfileName = hfileName;
this.offset = offset;
this.blockType = blockType;
+ this.stringPool = FilePathStringPool.getInstance();
+ // Use string pool for file, region and cf values
+ this.hfileNameId = stringPool.encode(hfileName);
+ this.regionId = (regionName != null) ? stringPool.encode(regionName) : -1;
+ this.cfId = (cfName != null) ? stringPool.encode(cfName) : -1;
+ this.archived = archived;
}
+ /**
+ * Construct a new BlockCacheKey using a file path. File, column family and
region information
+ * will be extracted from the passed path. This should be used when
inserting keys into the cache,
+ * so that region cache metrics are recorded properly.
+ * @param hfilePath The path to the HFile
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
public BlockCacheKey(Path hfilePath, long offset, boolean isPrimaryReplica,
BlockType blockType) {
- this.filePath = hfilePath;
- this.isPrimaryReplicaBlock = isPrimaryReplica;
- this.hfileName = hfilePath.getName();
- this.offset = offset;
- this.blockType = blockType;
+ this(hfilePath.getName(), hfilePath.getParent().getName(),
+ hfilePath.getParent().getParent().getName(), offset, isPrimaryReplica,
blockType,
+ HFileArchiveUtil.isHFileArchived(hfilePath));
}
@Override
public int hashCode() {
- return hfileName.hashCode() * 127 + (int) (offset ^ (offset >>> 32));
+ return hfileNameId * 127 + (int) (offset ^ (offset >>> 32));
}
@Override
public boolean equals(Object o) {
if (o instanceof BlockCacheKey) {
BlockCacheKey k = (BlockCacheKey) o;
- return offset == k.offset
- && (hfileName == null ? k.hfileName == null :
hfileName.equals(k.hfileName));
- } else {
- return false;
+ if (offset != k.offset) {
+ return false;
+ }
+ return getHfileName().equals(k.getHfileName());
}
+ return false;
}
@Override
public String toString() {
- return this.hfileName + '_' + this.offset;
+ return getHfileName() + '_' + this.offset;
}
public static final long FIXED_OVERHEAD =
ClassSize.estimateBase(BlockCacheKey.class, false);
/**
- * Strings have two bytes per character due to default Java Unicode encoding
(hence length times
- * 2).
+ * With the compressed format using integer file IDs, the heap size is
significantly reduced. We
+ * now only store a 4-byte integer instead of the full file name string.
*/
@Override
public long heapSize() {
- return ClassSize.align(FIXED_OVERHEAD + ClassSize.STRING + 2 *
hfileName.length());
+ return ClassSize.align(FIXED_OVERHEAD);
}
- // can't avoid this unfortunately
- /** Returns The hfileName portion of this cache key */
+ /**
+ * Returns the hfileName portion of this cache key.
+ * @return The file name
+ */
public String getHfileName() {
- return hfileName;
+ return stringPool.decode(hfileNameId);
+ }
+
+ /**
+ * Returns the hfileName portion of this cache key.
Review Comment:
The JavaDoc comment says "Returns the hfileName portion of this cache key"
but the method returns the region name. The comment should say "Returns the
region name portion of this cache key."
```suggestion
* Returns the region name portion of this cache key.
```
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:
##########
@@ -19,83 +19,164 @@
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool;
import org.apache.hadoop.hbase.util.ClassSize;
+import org.apache.hadoop.hbase.util.HFileArchiveUtil;
import org.apache.yetus.audience.InterfaceAudience;
/**
* Cache Key for use with implementations of {@link BlockCache}
*/
@InterfaceAudience.Private
public class BlockCacheKey implements HeapSize, java.io.Serializable {
- private static final long serialVersionUID = -5199992013113130534L;
- private final String hfileName;
+ private static final long serialVersionUID = -5199992013113130535L; //
Changed due to format
+ // change
+
+ // New compressed format using integer file ID (when codec is available)
+ private final int hfileNameId;
+
+ private transient final FilePathStringPool stringPool;
+
+ private final int regionId;
+
+ private final int cfId;
+
private final long offset;
+
private BlockType blockType;
+
private final boolean isPrimaryReplicaBlock;
- private Path filePath;
+ private final boolean archived;
/**
- * Construct a new BlockCacheKey
+ * Constructs a new BlockCacheKey with the file name and offset only. To be
used for cache lookups
+ * only, DO NOT use this for creating keys when inserting into the cache.
Use either the
+ * overriding constructors with the path parameter or the region and cf
parameters, otherwise,
+ * region cache metrics won't be recorded properly.
* @param hfileName The name of the HFile this block belongs to.
* @param offset Offset of the block into the file
*/
public BlockCacheKey(String hfileName, long offset) {
- this(hfileName, offset, true, BlockType.DATA);
+ this(hfileName, null, null, offset, true, BlockType.DATA, false);
}
+ /**
+ * Constructs a new BlockCacheKey with the file name, offset, replica and
type only. To be used
+ * for cache lookups only, DO NOT use this for creating keys when inserting
into the cache. Use
+ * either the overriding constructors with the path parameter or the region
and cf parameters,
+ * otherwise, region cache metrics won't be recorded properly.
+ * @param hfileName The name of the HFile this block belongs to.
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
public BlockCacheKey(String hfileName, long offset, boolean isPrimaryReplica,
BlockType blockType) {
+ this(hfileName, null, null, offset, isPrimaryReplica, blockType, false);
+ }
+
+ /**
+ * Construct a new BlockCacheKey, with file, column family and region
information. This should be
+ * used when inserting keys into the cache, so that region cache metrics are
recorded properly.
+ * @param hfileName The name of the HFile this block belongs to.
+ * @param cfName The column family name
+ * @param regionName The region name
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
+ public BlockCacheKey(String hfileName, String cfName, String regionName,
long offset,
+ boolean isPrimaryReplica, BlockType blockType, boolean archived) {
this.isPrimaryReplicaBlock = isPrimaryReplica;
- this.hfileName = hfileName;
this.offset = offset;
this.blockType = blockType;
+ this.stringPool = FilePathStringPool.getInstance();
+ // Use string pool for file, region and cf values
+ this.hfileNameId = stringPool.encode(hfileName);
+ this.regionId = (regionName != null) ? stringPool.encode(regionName) : -1;
+ this.cfId = (cfName != null) ? stringPool.encode(cfName) : -1;
+ this.archived = archived;
}
+ /**
+ * Construct a new BlockCacheKey using a file path. File, column family and
region information
+ * will be extracted from the passed path. This should be used when
inserting keys into the cache,
+ * so that region cache metrics are recorded properly.
+ * @param hfilePath The path to the HFile
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
public BlockCacheKey(Path hfilePath, long offset, boolean isPrimaryReplica,
BlockType blockType) {
- this.filePath = hfilePath;
- this.isPrimaryReplicaBlock = isPrimaryReplica;
- this.hfileName = hfilePath.getName();
- this.offset = offset;
- this.blockType = blockType;
+ this(hfilePath.getName(), hfilePath.getParent().getName(),
+ hfilePath.getParent().getParent().getName(), offset, isPrimaryReplica,
blockType,
+ HFileArchiveUtil.isHFileArchived(hfilePath));
}
@Override
public int hashCode() {
- return hfileName.hashCode() * 127 + (int) (offset ^ (offset >>> 32));
+ return hfileNameId * 127 + (int) (offset ^ (offset >>> 32));
}
@Override
public boolean equals(Object o) {
if (o instanceof BlockCacheKey) {
BlockCacheKey k = (BlockCacheKey) o;
- return offset == k.offset
- && (hfileName == null ? k.hfileName == null :
hfileName.equals(k.hfileName));
- } else {
- return false;
+ if (offset != k.offset) {
+ return false;
+ }
+ return getHfileName().equals(k.getHfileName());
}
+ return false;
}
@Override
public String toString() {
- return this.hfileName + '_' + this.offset;
+ return getHfileName() + '_' + this.offset;
}
public static final long FIXED_OVERHEAD =
ClassSize.estimateBase(BlockCacheKey.class, false);
/**
- * Strings have two bytes per character due to default Java Unicode encoding
(hence length times
- * 2).
+ * With the compressed format using integer file IDs, the heap size is
significantly reduced. We
+ * now only store a 4-byte integer instead of the full file name string.
*/
@Override
public long heapSize() {
- return ClassSize.align(FIXED_OVERHEAD + ClassSize.STRING + 2 *
hfileName.length());
+ return ClassSize.align(FIXED_OVERHEAD);
}
- // can't avoid this unfortunately
- /** Returns The hfileName portion of this cache key */
+ /**
+ * Returns the hfileName portion of this cache key.
+ * @return The file name
+ */
public String getHfileName() {
- return hfileName;
+ return stringPool.decode(hfileNameId);
+ }
+
+ /**
+ * Returns the hfileName portion of this cache key.
+ * @return The region name
+ */
+ public String getRegionName() {
+ return stringPool.decode(regionId);
+ }
+
+ /**
+ * Returns the compressed file ID.
+ * @return the integer file ID
Review Comment:
The JavaDoc comment says "Returns the compressed file ID" but the method
returns the region ID, not the file ID. The comment should say "Returns the
compressed region ID."
```suggestion
* Returns the compressed region ID.
* @return the integer region ID
```
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java:
##########
@@ -1828,24 +1837,27 @@ private void join() throws InterruptedException {
@Override
public void shutdown() {
- disableCache();
- LOG.info("Shutdown bucket cache: IO persistent=" + ioEngine.isPersistent()
+ "; path to write="
- + persistencePath);
- if (ioEngine.isPersistent() && persistencePath != null) {
- try {
- join();
- if (cachePersister != null) {
- LOG.info("Shutting down cache persister thread.");
- cachePersister.shutdown();
- while (cachePersister.isAlive()) {
- Thread.sleep(10);
+ if (isCacheEnabled()) {
+ disableCache();
+ LOG.info("Shutdown bucket cache: IO persistent=" +
ioEngine.isPersistent()
+ + "; path to write=" + persistencePath);
+ if (ioEngine.isPersistent() && persistencePath != null) {
+ try {
+ join();
+ if (cachePersister != null) {
+ LOG.info("Shutting down cache persister thread.");
+ cachePersister.shutdown();
+ while (cachePersister.isAlive()) {
+ Thread.sleep(10);
+ }
}
+ persistToFile();
Review Comment:
The `FilePathStringPool.getInstance().clear()` is called at line 1855 after
`persistToFile()` at line 1854. However, the string pool data is needed for
serialization during `persistToFile()` because BlockCacheKeys need to decode
their string IDs when being serialized. Clearing the pool before persistence
completes could cause issues. The clear should happen only after successful
persistence, or not at all if the pool will be reused on restart.
```suggestion
persistToFile();
// Only clear the string pool after successful persistence
```
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java:
##########
@@ -1927,33 +1939,28 @@ public int evictBlocksByHfileName(String hfileName) {
}
@Override
- public int evictBlocksByHfilePath(Path hfilePath) {
- return evictBlocksRangeByHfileName(hfilePath.getName(), hfilePath, 0,
Long.MAX_VALUE);
- }
-
- public int evictBlocksRangeByHfileName(String hfileName, Path filePath, long
initOffset,
- long endOffset) {
+ public int evictBlocksRangeByHfileName(String hfileName, long initOffset,
long endOffset) {
Set<BlockCacheKey> keySet = getAllCacheKeysForFile(hfileName, initOffset,
endOffset);
LOG.debug("found {} blocks for file {}, starting offset: {}, end offset:
{}", keySet.size(),
hfileName, initOffset, endOffset);
int numEvicted = 0;
for (BlockCacheKey key : keySet) {
- if (filePath != null) {
- key.setFilePath(filePath);
- }
if (evictBlock(key)) {
++numEvicted;
}
}
+ if (numEvicted > 0) {
+ // We need to make sure whether we are evicting all blocks for this
given file
+ int totalFileKeys = getAllCacheKeysForFile(hfileName, 0,
Long.MAX_VALUE).size();
+ if (totalFileKeys == numEvicted) {
+ FilePathStringPool.getInstance().remove(hfileName);
Review Comment:
The logic at lines 1952-1958 checks if all blocks for a file have been
evicted and only then removes the filename from the string pool. However,
there's a potential issue: if the method is called with a restricted range
(`initOffset` to `endOffset`), it might evict only a subset of blocks, but then
check against the full range to determine if removal is appropriate.
The check at line 1954 queries `getAllCacheKeysForFile(hfileName, 0,
Long.MAX_VALUE)` after eviction to see if all blocks are gone. But this creates
a window where:
1. Some blocks in range [initOffset, endOffset] are evicted (numEvicted
blocks)
2. The check compares numEvicted against totalFileKeys from the full range
3. If they match, it removes the filename from the pool
However, if `evictBlocksRangeByHfileName` is called with a partial range and
that partial range happens to contain all remaining blocks (because others were
evicted previously), the condition will be true and the filename will be
removed, even though more blocks might be added later for the same file.
Additionally, other BlockCacheKeys for this file (that aren't evicted) will
still have the ID encoded, so their `getHfileName()` calls will return null
after the pool entry is removed.
```suggestion
// Only remove the filename from the pool if we are evicting the full
range
if (initOffset == 0 && endOffset == Long.MAX_VALUE) {
int totalFileKeys = getAllCacheKeysForFile(hfileName, 0,
Long.MAX_VALUE).size();
if (totalFileKeys == numEvicted) {
FilePathStringPool.getInstance().remove(hfileName);
}
```
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/FilePathStringPool.java:
##########
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.hfile.bucket;
+
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Pool of string values encoded to integer IDs for use in BlockCacheKey. This
allows for avoiding
+ * duplicating string values for file names, region and CF values on various
BlockCacheKey
+ * instances. Normally, single hfiles have many blocks. This means all blocks
from the same file
+ * will have the very same file, region and CF names. On very large
BucketCache setups (i.e. file
+ * based cache with TB size order), can save few GBs of memory by avoiding
repeating these common
+ * string values on blocks from the same file. The FilePathStringPool is
implemented as a singleton,
+ * since the same pool should be shared by all BlockCacheKey instances, as
well as the BucketCache
+ * object itself. The Id for an encoded string is an integer. Any new String
added to the pool is
+ * assigned the next available integer ID, starting from 0 upwards. That sets
the total pool
+ * capacity to Integer.MAX_VALUE. In the event of ID exhaustion (integer
overflow when Id values
+ * reach Integer.MAX_VALUE), the encode() method will restart iterating over
int values
+ * incrementally from 0 until it finds an unused ID. Strings can be removed
from the pool using the
+ * remove() method. BucketCache should call this when evicting all blocks for
a given file (see
+ * BucketCache.evictFileBlocksFromCache()).
+ * <p>
+ * Thread-safe implementation that maintains bidirectional mappings between
strings and IDs.
+ * </p>
+ */
[email protected]
+public class FilePathStringPool {
+ private static final Logger LOG =
LoggerFactory.getLogger(FilePathStringPool.class);
+
+ // Bidirectional mappings for string objects re-use
+ private final ConcurrentHashMap<String, Integer> stringToId = new
ConcurrentHashMap<>();
+ private final ConcurrentHashMap<Integer, String> idToString = new
ConcurrentHashMap<>();
+ private final AtomicInteger nexId = new AtomicInteger(0);
+
+ private static FilePathStringPool instance;
+
+ public static FilePathStringPool getInstance() {
+ synchronized (FilePathStringPool.class) {
+ if (instance == null) {
+ instance = new FilePathStringPool();
+ }
+ }
+ return instance;
+ }
+
+ private FilePathStringPool() {
+ // Private constructor for singleton
+ }
+
+ /**
+ * Gets or creates an integer ID for the given String.
+ * @param string value for the file/region/CF name.
+ * @return the integer ID encoding this string in the pool.
+ */
+ public int encode(String string) {
+ if (string == null) {
+ throw new IllegalArgumentException("hfileName cannot be null");
+ }
+ if (stringToId.size() == Integer.MAX_VALUE) {
+ throw new IllegalStateException(
+ "String pool has reached maximum capacity of " + Integer.MAX_VALUE + "
unique strings.");
+ }
+ return stringToId.computeIfAbsent(string, name -> {
Review Comment:
There's a potential race condition where the check `stringToId.size() ==
Integer.MAX_VALUE` could pass, but then multiple threads could enter
`computeIfAbsent` and try to insert new strings, exceeding the maximum
capacity. The size check should either be inside the computeIfAbsent lambda or
use a more robust mechanism to prevent overflow.
```suggestion
return stringToId.computeIfAbsent(string, name -> {
if (stringToId.size() == Integer.MAX_VALUE) {
throw new IllegalStateException(
"String pool has reached maximum capacity of " + Integer.MAX_VALUE
+ " unique strings.");
}
```
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java:
##########
@@ -775,19 +779,21 @@ public void fileCacheCompleted(Path filePath, long size) {
fullyCachedFiles.put(filePath.getName(), pair);
}
- private void updateRegionCachedSize(Path filePath, long cachedSize) {
- if (filePath != null) {
- if (HFileArchiveUtil.isHFileArchived(filePath)) {
- LOG.trace("Skipping region cached size update for archived file: {}",
filePath);
+ private void updateRegionCachedSize(BlockCacheKey key, long cachedSize) {
+ if (key.getRegionName() != null) {
+ if (key.isArchived()) {
+ LOG.trace("Skipping region cached size update for archived file:{}
from region: {}",
+ key.getHfileName(), key.getRegionName());
} else {
- String regionName = filePath.getParent().getParent().getName();
+ String regionName = key.getRegionName();
regionCachedSize.merge(regionName, cachedSize,
(previousSize, newBlockSize) -> previousSize + newBlockSize);
LOG.trace("Updating region cached size for region: {}", regionName);
// If all the blocks for a region are evicted from the cache,
// remove the entry for that region from regionCachedSize map.
if (regionCachedSize.get(regionName) <= 0) {
regionCachedSize.remove(regionName);
+ FilePathStringPool.getInstance().remove(regionName);
Review Comment:
Removing the region name from the string pool when
`regionCachedSize.get(regionName) <= 0` (line 796) is problematic. The region
name might still be referenced by BlockCacheKeys that are in the cache but
haven't been evicted yet. Additionally, other regions might share the same
encoded region name string in the pool. Once removed from the pool, any
subsequent calls to `getRegionName()` on existing BlockCacheKeys will return
null, breaking functionality that depends on the region name being available.
Consider tracking reference counts for each pooled string across all
BlockCacheKeys, or only remove region names during cache shutdown/clear
operations.
```suggestion
// Do not remove regionName from FilePathStringPool here; it may
still be referenced.
```
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java:
##########
@@ -599,7 +599,8 @@ protected void cacheBlockWithWaitInternal(BlockCacheKey
cacheKey, Cacheable cach
if (cacheKey.getBlockType() == null && cachedItem.getBlockType() != null) {
cacheKey.setBlockType(cachedItem.getBlockType());
}
- LOG.trace("Caching key={}, item={}", cacheKey, cachedItem);
+ LOG.debug("Caching key={}, item={}, key keap size={}", cacheKey,
cachedItem,
Review Comment:
Typo in log message: "keap" should be "heap".
```suggestion
LOG.debug("Caching key={}, item={}, key heap size={}", cacheKey,
cachedItem,
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]