[
https://issues.apache.org/jira/browse/HADOOP-18759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anuj Modi updated HADOOP-18759:
-------------------------------
Description:
Today when a request fails with connection timeout, it falls back into the loop
for exponential retry. Unlike Azure Storage, there are no guarantees of success
on exponentially retried request or recommendations for ideal retry policies
for Azure network or any other general failures. Faster failure and retry might
be more beneficial for such generic connection timeout failures.
This PR introduces a new Static Retry Policy which will currently be used only
for Connection Timeout failures. It means all the requests failing with
Connection Timeout errors will be retried after a constant retry(sleep)
interval independent of how many times that request has failed. Max Retry Count
check will still be in place.
Following Configurations will be introduced in the change:
# "fs.azure.static.retry.for.connection.timeout.enabled" - default: true,
true: static retry will be used for CT, false: Exponential retry will be used.
# "fs.azure.static.retry.interval" - default: 1000ms.
This also introduces a new field in x-ms-client-request-id only for the
requests that are being retried after connection timeout failure. New filed
will tell what retry policy was used to get the sleep interval before making
this request.
Header "x-ms-client-request-id " right now has only the retryCount and
retryReason this particular API call is. For ex:
:eb06d8f6-5693-461b-b63c-5858fa7655e6:29cb0d19-2b68-4409-bc35-cb7160b90dd8:::CF:1_CT.
Moving ahead for retryReason "CT" it will have retry policy abbreviation as
well.
For ex:
:eb06d8f6-5693-461b-b63c-5858fa7655e6:29cb0d19-2b68-4409-bc35-cb7160b90dd8:::CF:1_CT_E.
was:
Today when a request fails with connection timeout, it falls back into the loop
for exponential retry. Unlike Azure Storage, there are no guarantees of success
on exponentially retried request or recommendations for ideal retry policies
for Azure network or any other general failures. Faster failure and retry might
be more beneficial for such generic connection timeout failures.
This PR introduces a new Linear Retry Policy which will currently be used only
for Connection Timeout failures.
Two types of Linear Backoff calculations will be supported:
# min-backoff starts with 500 ms and with each attempted retry, back-off
increments double, capped at 30 sec max
# min-backoff starts with 500 ms and with each attempted retry, back-off
increments by 1 sec, capped at 30 sec max
Following Configurations will be introduced in the change:
1. {*}"fs.azure.linear.retry.for.connection.timeout.enabled{*}" -
{*}default{*}: true, {*}true{*}: linear retry will be used for CT, {*}false{*}:
Exponential retry will be used.
2. "{*}fs.azure.io.retry.min.backoff.interval.for.connection.timeout{*}" -
{*}default{*}: 500ms
3. "{*}fs.azure.io.retry.max.backoff.interval.for.connection.timeout{*}" -
{*}default{*}: 30s
4. {*}"fs.azure.linear.retry.double.step.up.enabled{*}" - {*}default{*}: true,
{*}true{*}: Double up the interval for every retry count, {*}false{*}: Add 1s
to interval for every retry count
> [ABFS][Backoff-Optimization] Have a Linear retry policy for connection
> timeout failures
> ---------------------------------------------------------------------------------------
>
> Key: HADOOP-18759
> URL: https://issues.apache.org/jira/browse/HADOOP-18759
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/azure
> Affects Versions: 3.3.4
> Reporter: Anuj Modi
> Assignee: Anuj Modi
> Priority: Major
>
> Today when a request fails with connection timeout, it falls back into the
> loop for exponential retry. Unlike Azure Storage, there are no guarantees of
> success on exponentially retried request or recommendations for ideal retry
> policies for Azure network or any other general failures. Faster failure and
> retry might be more beneficial for such generic connection timeout failures.
> This PR introduces a new Static Retry Policy which will currently be used
> only for Connection Timeout failures. It means all the requests failing with
> Connection Timeout errors will be retried after a constant retry(sleep)
> interval independent of how many times that request has failed. Max Retry
> Count check will still be in place.
> Following Configurations will be introduced in the change:
> # "fs.azure.static.retry.for.connection.timeout.enabled" - default: true,
> true: static retry will be used for CT, false: Exponential retry will be used.
> # "fs.azure.static.retry.interval" - default: 1000ms.
> This also introduces a new field in x-ms-client-request-id only for the
> requests that are being retried after connection timeout failure. New filed
> will tell what retry policy was used to get the sleep interval before making
> this request.
> Header "x-ms-client-request-id " right now has only the retryCount and
> retryReason this particular API call is. For ex:
> :eb06d8f6-5693-461b-b63c-5858fa7655e6:29cb0d19-2b68-4409-bc35-cb7160b90dd8:::CF:1_CT.
> Moving ahead for retryReason "CT" it will have retry policy abbreviation as
> well.
> For ex:
> :eb06d8f6-5693-461b-b63c-5858fa7655e6:29cb0d19-2b68-4409-bc35-cb7160b90dd8:::CF:1_CT_E.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]