Jakov Varenina created GEODE-10401:
--------------------------------------
Summary: Oplog recovery takes too long due to fault in fastutil
library
Key: GEODE-10401
URL: https://issues.apache.org/jira/browse/GEODE-10401
Project: Geode
Issue Type: Bug
Reporter: Jakov Varenina
{color:#0e101a}As we already know, the .drf file delete operations only contain
OplogEntryID. During recovery, the server reads (byte by byte) each
OplogEntryID and stores it in a HashSet to use later when recovering .crf
files. There are two types of HashSets: IntOpenHashSet and LongOpenHashSet. The
OplogEntryID of type {color}_{color:#0e101a}integer{color}_{color:#0e101a} will
be stored in IntOpenHashSet, and {color}_{color:#0e101a}long
integer{color}_{color:#0e101a} in LongOpenHashSet, probably due to memory
optimization and performance factors. OplogEntryID starts with a zero and
increments throughout time. Recovery speed could differ depending on which
HashSet is used, so please consider that when estimating .drf recovery
time.{color}
{color:#0e101a}We have observed in logs that between exception (There is a
large number of deleted entries) and the previous log have passed more than 4
minutes (sometimes even more).{color}
{color:#0e101a}{"timestamp":"2022-06-14T21:41:43.772+08:00","severity":"info","message":"Recovering
oplog#271 /opt/dbservice/data/datastore/BACKUPdataDiskStore_271.drf for disk
store dataDiskStore.","metadata":{color}
{color:#0e101a}{"timestamp":"2022-06-14T21:46:02.152+08:00","severity":"warning","message":"There
is a large number of deleted entries within the disk-store, please execute an
offline{color}
{color:#0e101a}compaction.","metadata":{color}
{color:#0e101a} {color}
{color:#0e101a}When the above exception occurs, that means that the limit of
{color}_{color:#0e101a}805306401{color}_{color:#0e101a} entries in
IntOpenHashSet has been reached. In that case, the server rolls to the new
IntOpenHashSet, where an exception and the delay could happen again.{color}
{color:#0e101a}The problem is that due to the fault in FastUtil dependency
(IntOpenHashSet and LongOpenHashSet), the unnecessary rehashing happens
multiple times before the max size is reached. The{color}
_{color:#0e101a}rehashing starts from{color}_ {color:#0e101a}805306368 onwards
for each new entry until the max size. This rehashing adds several minutes to
.drf Oplog recovery.{color}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)