steveloughran commented on issue #1359: HADOOP-16430.S3AFilesystem.delete to 
incrementally update s3guard with deletions
URL: https://github.com/apache/hadoop/pull/1359#issuecomment-526350261
 
 
   thanks. I;ve just pushed up another comment for you to look at, and I am 
making sure I run it without s3guard as well as with. Tested, S3 ireland
   
   # undeleted files
   
   A `ITestS3AContractRootDir` run failed failing with undeleted file 
"fork-0005/test/testFile". That filename is used in too many tests to identify 
the problem.
   The latest patch uses a unique name for each test case, so if the problem 
recurs, I can start tracking down the issue.
   
   I don't think it's related to my changes in deletion, but given how critical 
delete is for cleanup, I am not ignoring it.
   
   At the same time, I think with this patch, we are actually being more 
rigourous in cleanup. We use see guard to identify and delete all files, even 
when S3 being inconsistent. And when the client is using s3guard as an 
authoritative store, we still bypass it for a final bit of due diligence.
   
   ```
   
   [ERROR] 
testListEmptyRootDirectory(org.apache.hadoop.fs.contract.s3a.ITestS3AContractRootDir)
  Time elapsed: 38.181 s  <<< FAILURE!
   java.lang.AssertionError: 
   Expected no results from listFiles(/, true), but got 1 elements:
   
S3ALocatedFileStatus{path=s3a://hwdev-steve-ireland-new/fork-0005/test/testFile;
 isDirectory=false; length=1; replication=1; blocksize=33554432;
      modification_time=1567098043000; access_time=0; owner=stevel; 
group=stevel; permission=rw-rw-rw-; isSymlink=false; hasAcl=false; 
isEncrypted=true;
       isErasureCoded=false}[eTag='55a54008ad1ba589aa210d2629c1df41', 
versionId='']
        at org.junit.Assert.fail(Assert.java:88)
        at 
org.apache.hadoop.fs.contract.AbstractContractRootDirectoryTest.assertNoElements(AbstractContractRootDirectoryTest.java:218)
        at 
org.apache.hadoop.fs.contract.AbstractContractRootDirectoryTest.testListEmptyRootDirectory(AbstractContractRootDirectoryTest.java:200)
        at 
org.apache.hadoop.fs.contract.s3a.ITestS3AContractRootDir.testListEmptyRootDirectory(ITestS3AContractRootDir.java:82)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
        at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
        at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:748)
   
   ```
   
   BTW, note in the stack how S3ALocatedFileStatus now prints etag and version? 
There's now lossless conversion between that and S3AFileStatus, which I needed 
for the isDirectory flag to not get lost, as that is used in the delete 
operation to choose whether to add a / on a path. This why there's a new 
package-private ctor for S3AFileStatus.
   
   With the changes in, the file at fault appears to be 
`testFailedMetadataUpdate` from `ITestS3AMetadataPersistenceException`; doing 
more cleanup there. That test is deliberately creating failures in the 
metastore update process; maybe if the normal test FS doesn't know about the 
file then it's test clenanup doesn't find it. Now I know that empty dir markers 
will stop a scan for and delete of objects, I am starting to wonder if that is 
the cause of some intermittent test failures we've had in the past -though 
those could just have come from S3 list inconsistency not finding files to 
delete.
   
   ## Speed
   
   I added a section in the DeleteOperation about opportunities to speak up 
that process through better parallelisation.
   
   I also make it clear that you should only derive such changing from data 
created in benchmarks running in ECT itself.
   If you test remotely, latency can dominate, but also you are less prone to 
encountering throttling 
   on AWS services, because you are not generating enough load.
   
   I don't immediately plan to do such performance tuning as I can see 
opportunities in speeding up directory listings which are on critical path for 
query planning. This patch here already make things faster especially very 
large directories. A question I have no information off is how important is 
directory deletion performance to applications? That's large directories. 
There's always going to be a fixed amount of needless overhead for deleting 
even a small directory: Head List delete calls, and setting up fake parent 
directories as needed. It's a larger directories that Dynamo DB calls come to 
dominate, and only once you have many thousands of files in a directory tree, 
that the advantages of more parallel batches of deletions probably start to 
deliver benefits.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to