wchevreuil commented on PR #6661: URL: https://github.com/apache/hbase/pull/6661#issuecomment-2782754537
> > > I have a few points on this: > > > > > > 1. Class names should follow @anmolnar suggestions; > > > 2. Doing bulkload at the mapper writer would be extremely costly for medium to large bulkoads. I wonder if that could cause wal player jobs to time out and retry, which would be a disaster. Rather then embedding this in the wal player, I would rather do it as a separate, independent tool, that just "scan" the wals searching for bulkload markers, then maybe join all related files and trigger a single, bigger bulkload operation? > > > > > > > > 1. Addressed > > 2. IMHO Restoring bulkloads along with Put/Delete mutations is essential to maintaining the original order of WAL entries. If an entry in a bulkloaded HFile is later modified or deleted, the restore process must follow the same sequence—first applying the bulkload, then executing Put/Delete mutations in their original order. However, a potential issue with this approach is that bulkload operations take time to complete, and during this period, incoming Put/Delete mutations might be ignored if the corresponding HFiles have not yet been fully loaded. @anmolnar @vinayakphegde @wchevreuil thoughts please > > This is the same as currently happens for replication. Because we rely on cell timestamps, it's only a problem for DELETE operations, if a major compaction runs between a DELETE was applied and the bulkload completed. That's mitigated by the enabling of the KEEP_DELETED_CELLS flag. > > IMO, bulkload should be done independently of the normal wal replay. Maybe also be made optional in PITR? Whilst having a separate tool to aggregate bulkload markers in the wal in a single bulkload event (as I had described above), replaying bulkloads can be extremely costly and slow down the PITR. I think the better approach for PITR is to monitor for bulkload operations, so that it can take a snapshot whenever a bulkload occurs, so that it can have wal replays clean of bulkloads. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
