[
https://issues.apache.org/jira/browse/GEODE-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571905#comment-17571905
]
ASF subversion and git services commented on GEODE-10401:
---------------------------------------------------------
Commit 649015b7a880c4e1cff42126bf368cad0c0ec1bc in geode's branch
refs/heads/develop from Jakov Varenina
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=649015b7a8 ]
GEODE-10401: Configurable .drf recovery HashMap overflow threshold (#7828)
Configurable with the jvm parameter:
gemfire.disk.drfHashMapOverflowThreshold
Default value: 805306368
When configured threshold value is reached, then server will overflow to
the new hashmap during the recovery of .drf files. Warning: If you set
threshold parameter over 805306368, then uneeded delay will happen due
to bug in fastutil dependency.
> Oplog recovery takes too long due to fault in fastutil library
> --------------------------------------------------------------
>
> Key: GEODE-10401
> URL: https://issues.apache.org/jira/browse/GEODE-10401
> Project: Geode
> Issue Type: Bug
> Reporter: Jakov Varenina
> Assignee: Jakov Varenina
> Priority: Major
> Labels: pull-request-available
>
> {color:#0e101a}As we already know, the .drf file delete operations only
> contain OplogEntryID. During recovery, the server reads (byte by byte) each
> OplogEntryID and stores it in a HashSet to use later when recovering .crf
> files. There are two types of HashSets: IntOpenHashSet and LongOpenHashSet.
> The OplogEntryID of type
> {color}_{color:#0e101a}integer{color}_{color:#0e101a} will be stored in
> IntOpenHashSet, and {color}_{color:#0e101a}long
> integer{color}_{color:#0e101a} in LongOpenHashSet, probably due to memory
> optimization and performance factors. OplogEntryID starts with a zero and
> increments throughout time.
> {color}
> {color:#0e101a}We have observed in logs that between exception (There is a
> large number of deleted entries) and the previous log have passed more than 4
> minutes (sometimes even more).{color}
> {code:java}
> {"timestamp":"2022-06-14T21:41:43.772+08:00","severity":"info","message":"Recovering
> oplog#271 /opt/dbservice/data/datastore/BACKUPdataDiskStore_271.drf for disk
> store dataDiskStore.","metadata":
> {"timestamp":"2022-06-14T21:46:02.152+08:00","severity":"warning","message":"There
> is a large number of deleted entries within the disk-store, please execute
> an offline
> compaction.","metadata":
> {code}
> {color:#0e101a}When the above exception occurs, that means that the limit of
> {color}_{color:#0e101a}805306401{color}_{color:#0e101a} entries in
> IntOpenHashSet has been reached. In that case, the server rolls to the new
> IntOpenHashSet, where an exception and the delay could happen again.{color}
> {color:#0e101a}The problem is that due to the fault in FastUtil dependency
> (IntOpenHashSet and LongOpenHashSet), the unnecessary rehashing happens
> multiple times before the max size is reached. The{color}
> _{color:#0e101a}rehashing starts from{color}_ {color:#0e101a}805306368
> onwards for each new entry until the max size. This rehashing adds several
> minutes to .drf Oplog recovery, but does nothing as max is already
> reached.{color}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)