This is an automated email from the ASF dual-hosted git repository.
ivandika pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/ozone.git
The following commit(s) were added to refs/heads/master by this push:
new 4b90304b0a7 HDDS-13919. Design Doc for S3 Conditional Writes
(PutObject)
4b90304b0a7 is described below
commit 4b90304b0a731274c0d4f8fe38dcc38676ff647c
Author: Peter Lee <[email protected]>
AuthorDate: Thu Jan 15 09:25:11 2026 +0800
HDDS-13919. Design Doc for S3 Conditional Writes (PutObject)
---
.../docs/content/design/s3-conditional-requests.md | 205 +++++++++++++++++++++
1 file changed, 205 insertions(+)
diff --git a/hadoop-hdds/docs/content/design/s3-conditional-requests.md
b/hadoop-hdds/docs/content/design/s3-conditional-requests.md
new file mode 100644
index 00000000000..c7e51708381
--- /dev/null
+++ b/hadoop-hdds/docs/content/design/s3-conditional-requests.md
@@ -0,0 +1,205 @@
+---
+title: "S3 Conditional Requests"
+summary: Design to support S3 conditional requests for atomic operations.
+date: 2025-11-20
+jira: HDDS-13117
+status: draft
+author: Chu Cheng Li
+---
+<!--
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License. See accompanying LICENSE file.
+-->
+
+# S3 Conditional Requests Design
+
+## Background
+
+AWS S3 supports conditional requests using HTTP conditional headers, enabling
atomic operations, cache optimization, and preventing race conditions. This
includes:
+
+- **Conditional Writes** (PutObject): `If-Match` and `If-None-Match` headers
for atomic operations
+- **Conditional Reads** (GetObject, HeadObject): `If-Match`, `If-None-Match`,
`If-Modified-Since`, `If-Unmodified-Since` for cache validation
+- **Conditional Copy** (CopyObject): Conditions on both source and destination
objects
+
+### Current State
+
+- HDDS-10656 implemented atomic rewrite using `expectedDataGeneration`
+- OM HA uses single Raft group with single applier thread (Ratis
StateMachineUpdater)
+- S3 gateway doesn't expose conditional headers to OM layer
+
+## Use Cases
+
+### Conditional Writes
+
+- **Atomic key rewrites**: Prevent race conditions when updating existing
objects
+- **Create-only semantics**: Prevent accidental overwrites (`If-None-Match: *`)
+- **Optimistic locking**: Enable concurrent access with conflict detection
+- **Leader election**: Implement distributed coordination using S3 as backing
store
+
+### Conditional Reads
+
+- **Bandwidth optimization**: Avoid downloading unchanged objects (304 Not
Modified)
+- **HTTP caching**: Support standard browser/CDN caching semantics
+- **Conditional processing**: Only process objects that meet specific criteria
+
+### Conditional Copy
+
+- **Atomic copy operations**: Copy only if source/destination meets specific
conditions
+- **Prevent overwrite**: Copy only if destination doesn't exist
+
+## Specification
+
+### AWS S3 Conditional Write Specification
+
+#### If-None-Match Header
+
+```
+If-None-Match: "*"
+```
+
+- Succeeds only if object does NOT exist
+- Returns `412 Precondition Failed` if object exists
+- Primary use case: Create-only semantics
+
+#### If-Match Header
+
+```
+If-Match: "<etag>"
+```
+
+- Succeeds only if object EXISTS and ETag matches
+- Returns `412 Precondition Failed` if object doesn't exist or ETag mismatches
+- Primary use case: Atomic updates (compare-and-swap)
+
+#### Restrictions
+
+- Cannot use both headers together in same request
+- No additional charges for failed conditional requests
+
+### AWS S3 Conditional Read Specification
+
+TODO
+
+### AWS S3 Conditional Copy Specification
+
+TODO
+
+## Implementation
+
+### AWS S3 Conditional Write Implementation
+
+The implementation aims to minimize Redundant RPCs (RTT) while ensuring strict
atomicity for conditional operations.
+
+- **If-None-Match** utilizes the atomic "Create-If-Not-Exists" capability
([HDDS-13963](https://issues.apache.org/jira/browse/HDDS-13963 "null")).
+- **If-Match** optimizes the happy path by pushing ETag validation directly
into the Ozone Manager's write path, avoiding preliminary read operations.
+
+#### If-None-Match Implementation
+
+This implementation ensures strict create-only semantics by utilizing a
specific generation ID marker.
+
+In `OzoneConsts.java`, add the `-1` as a constant for readability:
+```java
+/**
+ * Special value for expectedDataGeneration to indicate "Create-If-Not-Exists"
semantics.
+ * When used with If-None-Match conditional requests, this ensures atomicity:
+ * if a concurrent write commits between Create and Commit phases, the commit
+ * fails the validation check, preserving strict create-if-not-exists
semantics.
+ */
+public static final long EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS = -1L;
+```
+
+##### S3 Gateway Layer
+
+1. Parse `If-None-Match: *`.
+2. Set `existingKeyGeneration =
OzoneConsts.EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS`.
+3. Call `RpcClient.rewriteKey()`.
+
+##### OM Create Phase
+
+1. OM receives request with `expectedDataGeneration ==
OzoneConsts.EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS`.
+2. **Pre-check**: If key is already in the OpenKeyTable or KeyTable, throw
`KEY_ALREADY_EXISTS`.
+3. If not exists, proceed to create the open key entry.
+
+##### OM Commit Phase (Atomicity)
+
+1. During the commit phase (or strict atomic create), the OM validates that
the key still does not exist.
+2. If a concurrent client created the key between the Create and Commit
phases, the transaction fails with `KET_GENERATION_MISMATCH`.
+
+##### Race Condition Handling
+
+Using `OzoneConsts.EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS = -1` ensures
atomicity. If a concurrent write (Client B) commits between Client A's Create
and Commit,
+Client A's commit fails the `CREATE IF NOT EXISTS` validation check,
preserving strict create-if-not-exists semantics.
+
+> **Note**: This ability will be added along with
[HDDS-13963](https://issues.apache.org/jira/browse/HDDS-13963) (Atomic
Create-If-Not-Exists).
+
+#### If-Match Implementation
+
+To optimize performance and reduce latency, we avoid a pre-flight check
(GetS3KeyDetails) and instead validate the ETag during the OM Write operation.
+This requires adding an optional `expectedETag` field to `KeyArgs`. This
approach optimizes the "happy path" (successful match) by removing an extra
network round trip.
+For failing requests, they still incur the cost of a write RPC and Raft log
entry, but this is acceptable under optimistic concurrency control assumptions.
+
+##### S3 Gateway Layer
+
+1. Parse `If-Match: "<etag>"` header.
+2. Populate `KeyArgs` with the parsed `expectedETag`.
+3. Send the write request (CreateKey) to OM.
+
+##### OM Create Phase
+
+Validation is performed within the `validateAndUpdateCache` method to ensure
atomicity within the Ratis state machine application.
+
+1. **Locking**: The OM acquires the write lock for the bucket/key.
+2. **Key Lookup**: Retrieve the existing key from `KeyTable`.
+3. **Validation**:
+ - **Key Not Found**: If the key does not exist, throw `KEY_NOT_FOUND`
(maps to S3 412).
+ - **No ETag Metadata**: If the existing key (e.g., uploaded via OFS) does
not have an ETag property, throw `ETAG_NOT_AVAILABLE` (maps to S3 412). The
precondition cannot be evaluated, so we must fail rather than silently proceed.
+ - **ETag Mismatch**: Compare `existingKey.ETag` with `expectedETag`. If
they do not match, throw `ETAG_MISMATCH` (maps to S3 412).
+4. **Extract Generation**: If ETag matches, extract `existingKey.updateID`.
+5. **Create Open Key**: Create open key entry with `expectedDataGeneration =
existingKey.updateID`.
+
+##### OM Commit Phase
+
+The commit phase reuses the existing atomic-rewrite validation logic from
HDDS-10656:
+
+1. Read open key entry (contains `expectedDataGeneration` set during create
phase).
+2. Read current committed key from `KeyTable`.
+3. Validate `currentKey.updateID == openKey.expectedDataGeneration`.
+4. If match, commit succeeds. If mismatch (concurrent modification), throw
`KEY_NOT_FOUND` (maps to S3 412).
+
+This approach ensures end-to-end atomicity: even if another client modifies
the key between Create and Commit phases, the commit will fail.
+
+#### Error Mapping
+
+| | | | |
+|---|---|---|---|
+|**OM Error**|**S3 Status**|**S3 Error Code**|**Scenario**|
+|`KEY_ALREADY_EXISTS`|412|PreconditionFailed|If-None-Match failed (key exists)|
+|`KEY_NOT_FOUND`|412|PreconditionFailed|If-Match failed (key missing or
concurrent modification)|
+|`ETAG_NOT_AVAILABLE`|412|PreconditionFailed|If-Match failed (key has no ETag,
e.g., created via OFS)|
+|`ETAG_MISMATCH`|412|PreconditionFailed|If-Match failed (ETag mismatch)|
+
+## AWS S3 Conditional Read Implementation
+
+TODO
+
+## AWS S3 Conditional Copy Implementation
+
+TODO
+
+## References
+
+- [AWS S3 Conditional
Requests](https://docs.aws.amazon.com/AmazonS3/latest/userguide/conditional-requests.html)
+- [RFC 7232 - HTTP Conditional Requests](https://tools.ietf.org/html/rfc7232)
+- [HDDS-10656 - Atomic Rewrite
Key](https://issues.apache.org/jira/browse/HDDS-10656)
+- [HDDS-13963 - Atomic
Create-If-Not-Exists](https://issues.apache.org/jira/browse/HDDS-13963)
+- [Leader Election with S3 Conditional
Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/)
+- [An MVCC-like columnar table on S3 with constant-time
deletes](https://simonwillison.net/2025/Oct/11/mvcc-s3/)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]