[jira] [Commented] (HADOOP-13761) S3Guard: implement retries for DDB failures and throttling; translate exceptions

Aaron Fabbri (JIRA) Thu, 01 Mar 2018 20:53:03 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383164#comment-16383164
 ]


Aaron Fabbri commented on HADOOP-13761:
---------------------------------------

v12 patch attached which eliminates nested retries. Changes from v11:

- reopen() RetryTranslated -> OnceTranslated.  Use once() instead of retry().  
All callers (lazySeek() and callers of onReadFailure()) have their own 
higher-level retries.

- onReadFailure() RetryTranslated -> OnceTranslated, because reopen() no longer 
has internal retries.

- Move lazySeek() in read() above the retry'd lambda; eliminates nested retry 
and matches what read(b, off, len) does.

-  Remove reopen() from read(b, off, len) after wrapped stream throws EOF. 
*This was existing code I changed*: I don't understand why you'd reopen() on 
EOF, [[email protected]]?

Tested in us-west-2.  I was able to reproduce the test timeout for 
ITestS3AFailureHandling, which is now passing for me.

Running on little sleep but wanted to get a patch posted to give a chance for 
European Friday code review.
 
Here is the delta from v11:
{noformat}
diff -ur ./src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java 
/Users/fabbri/Code/hadoop/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
--- ./src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java        
2018-03-01 20:34:33.000000000 -0800
+++ 
/Users/fabbri/Code/hadoop/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
        2018-03-01 20:09:05.000000000 -0800
@@ -154,7 +154,7 @@
    * @param length length requested
    * @throws IOException on any failure to open the object
    */
-  @Retries.RetryTranslated
+  @Retries.OnceTranslated
   private synchronized void reopen(String reason, long targetPos, long length)
       throws IOException {

@@ -178,7 +178,7 @@
     }
     String text = String.format("Failed to %s %s at %d",
         (opencount == 0 ? "open" : "re-open"), uri, targetPos);
-    S3Object object = context.invoker.retry(text, uri, true,
+    S3Object object = context.getReadInvoker().once(text, uri,
         () -> client.getObject(request));
     wrappedStream = object.getObjectContent();
     contentRangeStart = targetPos;
@@ -349,6 +349,12 @@
       return -1;
     }

+    try {
+      lazySeek(nextReadPos, 1);
+    } catch (EOFException e) {
+      return -1;
+    }
+
     // With S3Guard, the metadatastore gave us metadata for the file in
     // open(), so we use a slightly different retry policy.
     // read() may not be likely to fail, but reopen() does a GET which
@@ -358,7 +364,6 @@
         () -> {
           int b;
           try {
-            lazySeek(nextReadPos, 1);
             b = wrappedStream.read();
           } catch (EOFException e) {
             return -1;
@@ -387,7 +392,7 @@
    * @param length length of data being attempted to read
    * @throws IOException any exception thrown on the re-open attempt.
    */
-  @Retries.RetryTranslated
+  @Retries.OnceTranslated
   private void onReadFailure(IOException ioe, int length) throws IOException {

     LOG.info("Got exception while trying to read from stream {}" +
@@ -439,7 +444,6 @@
           try {
             bytes = wrappedStream.read(buf, off, len);
           } catch (EOFException e) {
-            onReadFailure(e, len);
             // the base implementation swallows EOFs.
             return -1;
           } catch (IOException e) {

{noformat}
 

 

> S3Guard: implement retries for DDB failures and throttling; translate 
> exceptions
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-13761
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13761
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.0.0-beta1
>            Reporter: Aaron Fabbri
>            Assignee: Aaron Fabbri
>            Priority: Blocker
>         Attachments: HADOOP-13761-004-to-005.patch, 
> HADOOP-13761-005-to-006-approx.diff.txt, HADOOP-13761-005.patch, 
> HADOOP-13761-006.patch, HADOOP-13761-007.patch, HADOOP-13761-008.patch, 
> HADOOP-13761-009.patch, HADOOP-13761-010.patch, HADOOP-13761-010.patch, 
> HADOOP-13761-011.patch, HADOOP-13761-012.patch, HADOOP-13761.001.patch, 
> HADOOP-13761.002.patch, HADOOP-13761.003.patch, HADOOP-13761.004.patch
>
>
> Following the S3AFileSystem integration patch in HADOOP-13651, we need to add 
> retry logic.
> In HADOOP-13651, I added TODO comments in most of the places retry loops are 
> needed, including:
> - open(path).  If MetadataStore reflects recent create/move of file path, but 
> we fail to read it from S3, retry.
> - delete(path).  If deleteObject() on S3 fails, but MetadataStore shows the 
> file exists, retry.
> - rename(src,dest).  If source path is not visible in S3 yet, retry.
> - listFiles(). Skip for now. Not currently implemented in S3Guard. I will 
> create a separate JIRA for this as it will likely require interface changes 
> (i.e. prefix or subtree scan).
> We may miss some cases initially and we should do failure injection testing 
> to make sure we're covered.  Failure injection tests can be a separate JIRA 
> to make this easier to review.
> We also need basic configuration parameters around retry policy.  There 
> should be a way to specify maximum retry duration, as some applications would 
> prefer to receive an error eventually, than waiting indefinitely.  We should 
> also be keeping statistics when inconsistency is detected and we enter a 
> retry loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-13761) S3Guard: implement retries for DDB failures and throttling; translate exceptions

Reply via email to