taklwu commented on code in PR #7477:
URL: https://github.com/apache/hbase/pull/7477#discussion_r2566597313
##########
dev-support/spotbugs-exclude.xml:
##########
@@ -271,4 +271,12 @@
<Bug pattern="ML_SYNC_ON_UPDATED_FIELD"/>
</Match>
+ <Match>
+ <Class name="org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool"/>
+ <Or>
+ <Method name="getInstance"/>
+ </Or>
+ <Bug pattern="MS_EXPOSE_REP"/>
Review Comment:
[nit] can we add some comment why this should be ignore?
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/FilePathStringPool.java:
##########
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.hfile.bucket;
+
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Pool of string values encoded to integer IDs for use in BlockCacheKey. This
allows for avoiding
+ * duplicating string values for file names, region and CF values on various
BlockCacheKey
+ * instances. Normally, single hfiles have many blocks. This means all blocks
from the same file
+ * will have the very same file, region and CF names. On very large
BucketCache setups (i.e. file
+ * based cache with TB size order), can save few GBs of memory by avoiding
repeating these common
+ * string values on blocks from the same file. The FilePathStringPool is
implemented as a singleton,
+ * since the same pool should be shared by all BlockCacheKey instances, as
well as the BucketCache
+ * object itself. The Id for an encoded string is an integer. Any new String
added to the pool is
+ * assigned the next available integer ID, starting from 0 upwards. That sets
the total pool
+ * capacity to Integer.MAX_VALUE. In the event of ID exhaustion (integer
overflow when Id values
+ * reach Integer.MAX_VALUE), the encode() method will restart iterating over
int values
+ * incrementally from 0 until it finds an unused ID. Strings can be removed
from the pool using the
+ * remove() method. BucketCache should call this when evicting all blocks for
a given file (see
+ * BucketCache.evictFileBlocksFromCache()).
+ * <p>
+ * Thread-safe implementation that maintains bidirectional mappings between
strings and IDs.
+ * </p>
+ */
[email protected]
+public class FilePathStringPool {
+ private static final Logger LOG =
LoggerFactory.getLogger(FilePathStringPool.class);
+
+ // Bidirectional mappings for string objects re-use
+ private final ConcurrentHashMap<String, Integer> stringToId = new
ConcurrentHashMap<>();
+ private final ConcurrentHashMap<Integer, String> idToString = new
ConcurrentHashMap<>();
+ private final AtomicInteger nexId = new AtomicInteger(0);
Review Comment:
[nit] typos?
```suggestion
private final AtomicInteger nextId = new AtomicInteger(0);
```
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/FilePathStringPool.java:
##########
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.hfile.bucket;
+
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Pool of string values encoded to integer IDs for use in BlockCacheKey. This
allows for avoiding
+ * duplicating string values for file names, region and CF values on various
BlockCacheKey
+ * instances. Normally, single hfiles have many blocks. This means all blocks
from the same file
+ * will have the very same file, region and CF names. On very large
BucketCache setups (i.e. file
+ * based cache with TB size order), can save few GBs of memory by avoiding
repeating these common
Review Comment:
I found the answer is few GB saving for heap, thanks.
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CacheConfig.java:
##########
@@ -398,6 +398,9 @@ public long getCacheCompactedBlocksOnWriteThreshold() {
* configuration.
*/
public boolean shouldReadBlockFromCache(BlockType blockType) {
+ if (!getBlockCache().isPresent()) {
Review Comment:
[nit] do we hit any bug, why does need to check blockcache instance exist
after this change?
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:
##########
@@ -19,83 +19,164 @@
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool;
import org.apache.hadoop.hbase.util.ClassSize;
+import org.apache.hadoop.hbase.util.HFileArchiveUtil;
import org.apache.yetus.audience.InterfaceAudience;
/**
* Cache Key for use with implementations of {@link BlockCache}
*/
@InterfaceAudience.Private
public class BlockCacheKey implements HeapSize, java.io.Serializable {
- private static final long serialVersionUID = -5199992013113130534L;
- private final String hfileName;
+ private static final long serialVersionUID = -5199992013113130535L; //
Changed due to format
+ // change
+
+ // New compressed format using integer file ID (when codec is available)
+ private final int hfileNameId;
+
+ private transient final FilePathStringPool stringPool;
+
+ private final int regionId;
+
+ private final int cfId;
+
private final long offset;
+
private BlockType blockType;
+
private final boolean isPrimaryReplicaBlock;
- private Path filePath;
+ private final boolean archived;
/**
- * Construct a new BlockCacheKey
+ * Constructs a new BlockCacheKey with the file name and offset only. To be
used for cache lookups
+ * only, DO NOT use this for creating keys when inserting into the cache.
Use either the
+ * overriding constructors with the path parameter or the region and cf
parameters, otherwise,
+ * region cache metrics won't be recorded properly.
* @param hfileName The name of the HFile this block belongs to.
* @param offset Offset of the block into the file
*/
public BlockCacheKey(String hfileName, long offset) {
- this(hfileName, offset, true, BlockType.DATA);
+ this(hfileName, null, null, offset, true, BlockType.DATA, false);
}
+ /**
+ * Constructs a new BlockCacheKey with the file name, offset, replica and
type only. To be used
+ * for cache lookups only, DO NOT use this for creating keys when inserting
into the cache. Use
+ * either the overriding constructors with the path parameter or the region
and cf parameters,
+ * otherwise, region cache metrics won't be recorded properly.
+ * @param hfileName The name of the HFile this block belongs to.
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
public BlockCacheKey(String hfileName, long offset, boolean isPrimaryReplica,
BlockType blockType) {
+ this(hfileName, null, null, offset, isPrimaryReplica, blockType, false);
+ }
+
+ /**
+ * Construct a new BlockCacheKey, with file, column family and region
information. This should be
+ * used when inserting keys into the cache, so that region cache metrics are
recorded properly.
+ * @param hfileName The name of the HFile this block belongs to.
+ * @param cfName The column family name
+ * @param regionName The region name
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
+ public BlockCacheKey(String hfileName, String cfName, String regionName,
long offset,
+ boolean isPrimaryReplica, BlockType blockType, boolean archived) {
this.isPrimaryReplicaBlock = isPrimaryReplica;
- this.hfileName = hfileName;
this.offset = offset;
this.blockType = blockType;
+ this.stringPool = FilePathStringPool.getInstance();
+ // Use string pool for file, region and cf values
+ this.hfileNameId = stringPool.encode(hfileName);
+ this.regionId = (regionName != null) ? stringPool.encode(regionName) : -1;
+ this.cfId = (cfName != null) ? stringPool.encode(cfName) : -1;
+ this.archived = archived;
}
+ /**
+ * Construct a new BlockCacheKey using a file path. File, column family and
region information
+ * will be extracted from the passed path. This should be used when
inserting keys into the cache,
+ * so that region cache metrics are recorded properly.
+ * @param hfilePath The path to the HFile
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
public BlockCacheKey(Path hfilePath, long offset, boolean isPrimaryReplica,
BlockType blockType) {
- this.filePath = hfilePath;
- this.isPrimaryReplicaBlock = isPrimaryReplica;
- this.hfileName = hfilePath.getName();
- this.offset = offset;
- this.blockType = blockType;
+ this(hfilePath.getName(), hfilePath.getParent().getName(),
+ hfilePath.getParent().getParent().getName(), offset, isPrimaryReplica,
blockType,
+ HFileArchiveUtil.isHFileArchived(hfilePath));
}
@Override
public int hashCode() {
- return hfileName.hashCode() * 127 + (int) (offset ^ (offset >>> 32));
+ return hfileNameId * 127 + (int) (offset ^ (offset >>> 32));
}
@Override
public boolean equals(Object o) {
if (o instanceof BlockCacheKey) {
BlockCacheKey k = (BlockCacheKey) o;
- return offset == k.offset
- && (hfileName == null ? k.hfileName == null :
hfileName.equals(k.hfileName));
- } else {
- return false;
+ if (offset != k.offset) {
+ return false;
+ }
+ return getHfileName().equals(k.getHfileName());
}
+ return false;
}
@Override
public String toString() {
- return this.hfileName + '_' + this.offset;
+ return getHfileName() + '_' + this.offset;
}
public static final long FIXED_OVERHEAD =
ClassSize.estimateBase(BlockCacheKey.class, false);
/**
- * Strings have two bytes per character due to default Java Unicode encoding
(hence length times
- * 2).
+ * With the compressed format using integer file IDs, the heap size is
significantly reduced. We
+ * now only store a 4-byte integer instead of the full file name string.
*/
@Override
public long heapSize() {
- return ClassSize.align(FIXED_OVERHEAD + ClassSize.STRING + 2 *
hfileName.length());
+ return ClassSize.align(FIXED_OVERHEAD);
}
- // can't avoid this unfortunately
- /** Returns The hfileName portion of this cache key */
+ /**
+ * Returns the hfileName portion of this cache key.
+ * @return The file name
+ */
public String getHfileName() {
- return hfileName;
+ return stringPool.decode(hfileNameId);
+ }
+
+ /**
+ * Returns the hfileName portion of this cache key.
+ * @return The region name
+ */
+ public String getRegionName() {
+ return stringPool.decode(regionId);
+ }
+
+ /**
+ * Returns the compressed file ID.
+ * @return the integer file ID
+ */
+ public int getRegionId() {
+ return regionId;
+ }
Review Comment:
[nit] this is unused
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DataTieringManager.java:
##########
@@ -96,11 +95,14 @@ public static synchronized DataTieringManager getInstance()
{
* @throws DataTieringException if there is an error retrieving the HFile
path or configuration
*/
public boolean isDataTieringEnabled(BlockCacheKey key) throws
DataTieringException {
- Path hFilePath = key.getFilePath();
- if (hFilePath == null) {
- throw new DataTieringException("BlockCacheKey Doesn't Contain HFile
Path");
+ if (key.getCfName() == null || key.getRegionName() == null) {
+ throw new DataTieringException(
+ "BlockCacheKey doesn't contain " + "Column Family Name or Region
Name");
Review Comment:
```suggestion
"BlockCacheKey doesn't contain Column Family Name or Region Name");
```
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:
##########
@@ -19,83 +19,164 @@
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.io.hfile.bucket.FilePathStringPool;
import org.apache.hadoop.hbase.util.ClassSize;
+import org.apache.hadoop.hbase.util.HFileArchiveUtil;
import org.apache.yetus.audience.InterfaceAudience;
/**
* Cache Key for use with implementations of {@link BlockCache}
*/
@InterfaceAudience.Private
public class BlockCacheKey implements HeapSize, java.io.Serializable {
- private static final long serialVersionUID = -5199992013113130534L;
- private final String hfileName;
+ private static final long serialVersionUID = -5199992013113130535L; //
Changed due to format
+ // change
+
+ // New compressed format using integer file ID (when codec is available)
+ private final int hfileNameId;
+
+ private transient final FilePathStringPool stringPool;
+
+ private final int regionId;
+
+ private final int cfId;
+
private final long offset;
+
private BlockType blockType;
+
private final boolean isPrimaryReplicaBlock;
- private Path filePath;
+ private final boolean archived;
/**
- * Construct a new BlockCacheKey
+ * Constructs a new BlockCacheKey with the file name and offset only. To be
used for cache lookups
+ * only, DO NOT use this for creating keys when inserting into the cache.
Use either the
+ * overriding constructors with the path parameter or the region and cf
parameters, otherwise,
+ * region cache metrics won't be recorded properly.
* @param hfileName The name of the HFile this block belongs to.
* @param offset Offset of the block into the file
*/
public BlockCacheKey(String hfileName, long offset) {
- this(hfileName, offset, true, BlockType.DATA);
+ this(hfileName, null, null, offset, true, BlockType.DATA, false);
}
+ /**
+ * Constructs a new BlockCacheKey with the file name, offset, replica and
type only. To be used
+ * for cache lookups only, DO NOT use this for creating keys when inserting
into the cache. Use
+ * either the overriding constructors with the path parameter or the region
and cf parameters,
+ * otherwise, region cache metrics won't be recorded properly.
+ * @param hfileName The name of the HFile this block belongs to.
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
public BlockCacheKey(String hfileName, long offset, boolean isPrimaryReplica,
BlockType blockType) {
+ this(hfileName, null, null, offset, isPrimaryReplica, blockType, false);
+ }
+
+ /**
+ * Construct a new BlockCacheKey, with file, column family and region
information. This should be
+ * used when inserting keys into the cache, so that region cache metrics are
recorded properly.
+ * @param hfileName The name of the HFile this block belongs to.
+ * @param cfName The column family name
+ * @param regionName The region name
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
+ public BlockCacheKey(String hfileName, String cfName, String regionName,
long offset,
+ boolean isPrimaryReplica, BlockType blockType, boolean archived) {
this.isPrimaryReplicaBlock = isPrimaryReplica;
- this.hfileName = hfileName;
this.offset = offset;
this.blockType = blockType;
+ this.stringPool = FilePathStringPool.getInstance();
+ // Use string pool for file, region and cf values
+ this.hfileNameId = stringPool.encode(hfileName);
+ this.regionId = (regionName != null) ? stringPool.encode(regionName) : -1;
+ this.cfId = (cfName != null) ? stringPool.encode(cfName) : -1;
+ this.archived = archived;
}
+ /**
+ * Construct a new BlockCacheKey using a file path. File, column family and
region information
+ * will be extracted from the passed path. This should be used when
inserting keys into the cache,
+ * so that region cache metrics are recorded properly.
+ * @param hfilePath The path to the HFile
+ * @param offset Offset of the block into the file
+ * @param isPrimaryReplica Whether this is from primary replica
+ * @param blockType Type of block
+ */
public BlockCacheKey(Path hfilePath, long offset, boolean isPrimaryReplica,
BlockType blockType) {
- this.filePath = hfilePath;
- this.isPrimaryReplicaBlock = isPrimaryReplica;
- this.hfileName = hfilePath.getName();
- this.offset = offset;
- this.blockType = blockType;
+ this(hfilePath.getName(), hfilePath.getParent().getName(),
+ hfilePath.getParent().getParent().getName(), offset, isPrimaryReplica,
blockType,
+ HFileArchiveUtil.isHFileArchived(hfilePath));
}
@Override
public int hashCode() {
- return hfileName.hashCode() * 127 + (int) (offset ^ (offset >>> 32));
+ return hfileNameId * 127 + (int) (offset ^ (offset >>> 32));
Review Comment:
[nit] how do we make sure this won't case any conflict in this encoded ID ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]