[ https://issues.apache.org/jira/browse/GEODE-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jakov Varenina updated GEODE-10401: ----------------------------------- Description: {color:#0e101a}As we already know, the .drf file delete operations only contain OplogEntryID. During recovery, the server reads (byte by byte) each OplogEntryID and stores it in a HashSet to use later when recovering .crf files. There are two types of HashSets: IntOpenHashSet and LongOpenHashSet. The OplogEntryID of type {color}_{color:#0e101a}integer{color}_{color:#0e101a} will be stored in IntOpenHashSet, and {color}_{color:#0e101a}long integer{color}_{color:#0e101a} in LongOpenHashSet, probably due to memory optimization and performance factors. OplogEntryID starts with a zero and increments throughout time. Recovery speed could differ depending on which HashSet is used, so please consider that when estimating .drf recovery time.{color} {color:#0e101a}We have observed in logs that between exception (There is a large number of deleted entries) and the previous log have passed more than 4 minutes (sometimes even more).{color} {code:java} {"timestamp":"2022-06-14T21:41:43.772+08:00","severity":"info","message":"Recovering oplog#271 /opt/dbservice/data/datastore/BACKUPdataDiskStore_271.drf for disk store dataDiskStore.","metadata": {"timestamp":"2022-06-14T21:46:02.152+08:00","severity":"warning","message":"There is a large number of deleted entries within the disk-store, please execute an offline compaction.","metadata": {code} {color:#0e101a}When the above exception occurs, that means that the limit of {color}_{color:#0e101a}805306401{color}_{color:#0e101a} entries in IntOpenHashSet has been reached. In that case, the server rolls to the new IntOpenHashSet, where an exception and the delay could happen again.{color} {color:#0e101a}The problem is that due to the fault in FastUtil dependency (IntOpenHashSet and LongOpenHashSet), the unnecessary rehashing happens multiple times before the max size is reached. The{color} _{color:#0e101a}rehashing starts from{color}_ {color:#0e101a}805306368 onwards for each new entry until the max size. This rehashing adds several minutes to .drf Oplog recovery.{color} was: {color:#0e101a}As we already know, the .drf file delete operations only contain OplogEntryID. During recovery, the server reads (byte by byte) each OplogEntryID and stores it in a HashSet to use later when recovering .crf files. There are two types of HashSets: IntOpenHashSet and LongOpenHashSet. The OplogEntryID of type {color}_{color:#0e101a}integer{color}_{color:#0e101a} will be stored in IntOpenHashSet, and {color}_{color:#0e101a}long integer{color}_{color:#0e101a} in LongOpenHashSet, probably due to memory optimization and performance factors. OplogEntryID starts with a zero and increments throughout time. Recovery speed could differ depending on which HashSet is used, so please consider that when estimating .drf recovery time.{color} {color:#0e101a}We have observed in logs that between exception (There is a large number of deleted entries) and the previous log have passed more than 4 minutes (sometimes even more).{color} {color:#0e101a}{"timestamp":"2022-06-14T21:41:43.772+08:00","severity":"info","message":"Recovering oplog#271 /opt/dbservice/data/datastore/BACKUPdataDiskStore_271.drf for disk store dataDiskStore.","metadata":{color} {color:#0e101a}{"timestamp":"2022-06-14T21:46:02.152+08:00","severity":"warning","message":"There is a large number of deleted entries within the disk-store, please execute an offline{color} {color:#0e101a}compaction.","metadata":{color} {color:#0e101a} {color} {color:#0e101a}When the above exception occurs, that means that the limit of {color}_{color:#0e101a}805306401{color}_{color:#0e101a} entries in IntOpenHashSet has been reached. In that case, the server rolls to the new IntOpenHashSet, where an exception and the delay could happen again.{color} {color:#0e101a}The problem is that due to the fault in FastUtil dependency (IntOpenHashSet and LongOpenHashSet), the unnecessary rehashing happens multiple times before the max size is reached. The{color} _{color:#0e101a}rehashing starts from{color}_ {color:#0e101a}805306368 onwards for each new entry until the max size. This rehashing adds several minutes to .drf Oplog recovery.{color} > Oplog recovery takes too long due to fault in fastutil library > -------------------------------------------------------------- > > Key: GEODE-10401 > URL: https://issues.apache.org/jira/browse/GEODE-10401 > Project: Geode > Issue Type: Bug > Reporter: Jakov Varenina > Assignee: Jakov Varenina > Priority: Major > Labels: needsTriage > > {color:#0e101a}As we already know, the .drf file delete operations only > contain OplogEntryID. During recovery, the server reads (byte by byte) each > OplogEntryID and stores it in a HashSet to use later when recovering .crf > files. There are two types of HashSets: IntOpenHashSet and LongOpenHashSet. > The OplogEntryID of type > {color}_{color:#0e101a}integer{color}_{color:#0e101a} will be stored in > IntOpenHashSet, and {color}_{color:#0e101a}long > integer{color}_{color:#0e101a} in LongOpenHashSet, probably due to memory > optimization and performance factors. OplogEntryID starts with a zero and > increments throughout time. Recovery speed could differ depending on which > HashSet is used, so please consider that when estimating .drf recovery > time.{color} > {color:#0e101a}We have observed in logs that between exception (There is a > large number of deleted entries) and the previous log have passed more than 4 > minutes (sometimes even more).{color} > > {code:java} > {"timestamp":"2022-06-14T21:41:43.772+08:00","severity":"info","message":"Recovering > oplog#271 /opt/dbservice/data/datastore/BACKUPdataDiskStore_271.drf for disk > store dataDiskStore.","metadata": > {"timestamp":"2022-06-14T21:46:02.152+08:00","severity":"warning","message":"There > is a large number of deleted entries within the disk-store, please execute > an offline > compaction.","metadata": > {code} > {color:#0e101a}When the above exception occurs, that means that the limit of > {color}_{color:#0e101a}805306401{color}_{color:#0e101a} entries in > IntOpenHashSet has been reached. In that case, the server rolls to the new > IntOpenHashSet, where an exception and the delay could happen again.{color} > {color:#0e101a}The problem is that due to the fault in FastUtil dependency > (IntOpenHashSet and LongOpenHashSet), the unnecessary rehashing happens > multiple times before the max size is reached. The{color} > _{color:#0e101a}rehashing starts from{color}_ {color:#0e101a}805306368 > onwards for each new entry until the max size. This rehashing adds several > minutes to .drf Oplog recovery.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)