[
https://issues.apache.org/jira/browse/HADOOP-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366433#comment-16366433
]
Aaron Fabbri commented on HADOOP-13761:
---------------------------------------
[[email protected]] on your question about changing the retry() to enclose
lazySeek() instead of around stream.read(), that is not sufficient with the
current failure model (i.e. how I'm injecting failures). I think the failure
injection needs work.
{noformat}
2018-02-15 15:27:00,907 [JUnit-testOpenFailOnRead] ERROR
s3a.AbstractS3ATestBase (ITestS3AInconsistency.java:testOpenFailOnRead(129)) -
Error:
java.io.FileNotFoundException: read(b, 0, 4) on key
test/ancestor/file-to-read-DELAY_LISTING_ME failed: injecting error 3/5 for
test.
at
org.apache.hadoop.fs.s3a.InconsistentS3Object$InconsistentS3InputStream.readFailpoint(InconsistentS3Object.java:169)
at
org.apache.hadoop.fs.s3a.InconsistentS3Object$InconsistentS3InputStream.read(InconsistentS3Object.java:159)
at
org.apache.hadoop.fs.s3a.S3AInputStream.lambda$read$56(S3AInputStream.java:426)
at
org.apache.hadoop.fs.s3a.S3AInputStream$$Lambda$31/1279551328.execute(Unknown
Source)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$7(Invoker.java:260)
at
org.apache.hadoop.fs.s3a.Invoker$$Lambda$12/999989609.execute(Unknown Source)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:317)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:256)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:231)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:438)
at java.io.DataInputStream.read(DataInputStream.java:149)
at
org.apache.hadoop.fs.s3a.ITestS3AInconsistency.testOpenFailOnRead(ITestS3AInconsistency.java:126)
{noformat}
If we can agree on where to inject failures I think I can come up with a good
solution.
Maybe:
- We need both lazySeek() and the stream.read() retries?
- The failure injection for InconsistentS3InputStream() should also have a
failpoint in skip(), which would have exposed the lack of retry in lazySeek().
- AmazonS3Client.getObject() currently does not fail, but returns an
InconsistentS3Object with the read/skip/etc. failpoints mentioned. Seems like
getObject() needs to fail as, looking at the SDK code, it actually does the GET
request I believe.
Does this sound right?
Also: My current approach of failing read 5 times then succeeding (5<20 which
is retry max) is not going to work to expose all the codepaths that fail. I
need a loop runs multiple tests and either (1) increases max failures, or
failure offset, by one each duration (how traceroute uses TTLs to probe router
hops; I use to probe failure points) or (2) uses randomization to be likely to
hit different failure points.
#1 actually seems more deterministic.
> S3Guard: implement retries for DDB failures and throttling; translate
> exceptions
> --------------------------------------------------------------------------------
>
> Key: HADOOP-13761
> URL: https://issues.apache.org/jira/browse/HADOOP-13761
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.0.0-beta1
> Reporter: Aaron Fabbri
> Assignee: Aaron Fabbri
> Priority: Blocker
> Attachments: HADOOP-13761-004-to-005.patch, HADOOP-13761-005.patch,
> HADOOP-13761.001.patch, HADOOP-13761.002.patch, HADOOP-13761.003.patch,
> HADOOP-13761.004.patch
>
>
> Following the S3AFileSystem integration patch in HADOOP-13651, we need to add
> retry logic.
> In HADOOP-13651, I added TODO comments in most of the places retry loops are
> needed, including:
> - open(path). If MetadataStore reflects recent create/move of file path, but
> we fail to read it from S3, retry.
> - delete(path). If deleteObject() on S3 fails, but MetadataStore shows the
> file exists, retry.
> - rename(src,dest). If source path is not visible in S3 yet, retry.
> - listFiles(). Skip for now. Not currently implemented in S3Guard. I will
> create a separate JIRA for this as it will likely require interface changes
> (i.e. prefix or subtree scan).
> We may miss some cases initially and we should do failure injection testing
> to make sure we're covered. Failure injection tests can be a separate JIRA
> to make this easier to review.
> We also need basic configuration parameters around retry policy. There
> should be a way to specify maximum retry duration, as some applications would
> prefer to receive an error eventually, than waiting indefinitely. We should
> also be keeping statistics when inconsistency is detected and we enter a
> retry loop.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]