keith-turner commented on issue #4538:
URL: https://github.com/apache/accumulo/issues/4538#issuecomment-2104870011

   Chatted w/ @cshannon  about this.  One challenge we identified is large 
compaction operations.  For example if a large number of external compaction 
processes are temporarily stood up and then a large compaction operation is 
initiated, it may start to generate large numbers of files in a short time that 
should be deleted.  In the worst case if the GC delays deleting files it could 
result in the compaction operation filling up DFS.  This implies that the delay 
may need to adjust depending on DFS free space and the amount of files that 
could be deleted, but are delayed.  Dynamically adjusting the delay makes it 
harder for the scan servers to reason about it.
   
   Another thing discsussed was race conditions.  This change would 
conceptually create another set of files the GC is tracking.  
   
    * File references (existing set)
    * GC candidates (existing set)
    * Delayed delete files (new set)
    * Deleting files (existing set)
   
   When the GC moves a file from the new delayed_delete set to the 
deleting_files set, it must be done in a such a way that considers what the 
scan servers are using and not have race conditions.  We talked through a few 
ways to do this, but those possible solutions had race conditions.  So still 
need to figure out something for that.
   
   This offers a potential speed up to scan servers for writing out scan server 
refs to the metadata table.  The scan server will still have to read tablet 
files from the metadata table, which is an extra cost over tablet servers.  
That leads to another potential solution to lower scan time as some way to pre 
load tablets on select scan servers, that may be something to explore in tandem.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to