Scrooge-McDucks opened a new pull request, #11058:
URL: https://github.com/apache/nifi/pull/11058
…e fragment attributes from MergeContent in Defragment mode
## Summary
This change adds optional fragment attribute support to `UnpackContent` so
unpacked FlowFiles can be regrouped downstream using `MergeContent` in
`Defragment` mode.
It also updates `MergeContent` to remove reassembly-related attributes from
the merged FlowFile once defragmentation has completed successfully, including:
- `fragment.identifier`
- `fragment.index`
- `fragment.count`
- `segment.original.filename`
## Motivation
A common dataflow pattern is:
1. Data is packaged to optimise transport
2. `UnpackContent` extracts individual FlowFiles
3. Files are enriched or transformed independently
4. Files are regrouped and repackaged
This works well conceptually, but today `UnpackContent` does not provide a
built-in way to assign the fragment attributes needed for downstream reassembly
across formats such as ZIP, TAR, and FlowFile Package.
Without those attributes, users need custom logic to preserve grouping and
ordering, which adds complexity and can lead to inconsistent behaviour.
This change makes that workflow easier by allowing `UnpackContent` to
optionally generate fragment attributes, while ensuring `MergeContent` removes
the temporary reassembly metadata once the final merged FlowFile has been
produced.
## Changes Included
### UnpackContent
Added optional support for assigning fragment attributes to unpacked
FlowFiles.
#### New Properties
**Add Fragment Attributes**
- When enabled, assigns:
- `fragment.identifier`
- `fragment.index`
- `fragment.count`
**Fragment Identifier Value**
- Specifies the value used for `fragment.identifier`
- Supports Expression Language evaluated against the incoming packed FlowFile
- Evaluated once per source FlowFile, with the resulting value applied to
all unpacked FlowFiles derived from that source
- Default: `${uuid()}`
Examples:
- `${UUID()}` for a unique grouping per archive (default)
- `${filename}` for grouping based on the original filename
- `${archive.filename}` when an explicit archive attribute is available
#### Behaviour
When enabled:
- All FlowFiles produced from a single archive share the same
`fragment.identifier`
- `fragment.index` is assigned based on entry order within the archive
- `fragment.count` is set to the total number of unpacked entries
- The identifier expression is evaluated once per parent FlowFile
When disabled:
- No change to current `UnpackContent` behaviour
### MergeContent
Updated `MergeContent` so that after a successful defragmentation, the
merged FlowFile no longer retains temporary reassembly metadata.
When operating in `Defragment` mode, the merged FlowFile now removes:
- `fragment.identifier`
- `fragment.index`
- `fragment.count`
- `segment.original.filename`
This ensures the final merged output reflects the completed repackaged
artifact rather than the intermediate fragmentation state used to drive
regrouping.
## Compatibility
- Fully backward compatible
- Fragment attribute generation in `UnpackContent` is opt-in
- Existing flows are unchanged unless the new property is enabled
- The `MergeContent` cleanup only applies after successful defragmentation
## Example
Input:
- `archive.zip` containing 3 files
Unpack output when enabled with `${filename}` as the identifier:
| filename | fragment.identifier | fragment.index | fragment.count |
|-----------|---------------------|----------------|----------------|
| file1.txt | archive.zip | 0 | 3 |
| file2.txt | archive.zip | 1 | 3 |
| file3.txt | archive.zip | 2 | 3 |
After processing and successful defragmentation in `MergeContent`, the
merged FlowFile no longer retains:
- `fragment.identifier`
- `fragment.index`
- `fragment.count`
- `segment.original.filename`
<!-- Licensed to the Apache Software Foundation (ASF) under one or more -->
<!-- contributor license agreements. See the NOTICE file distributed with
-->
<!-- this work for additional information regarding copyright ownership. -->
<!-- The ASF licenses this file to You under the Apache License, Version 2.0
-->
<!-- (the "License"); you may not use this file except in compliance with -->
<!-- the License. You may obtain a copy of the License at -->
<!-- http://www.apache.org/licenses/LICENSE-2.0 -->
<!-- Unless required by applicable law or agreed to in writing, software -->
<!-- distributed under the License is distributed on an "AS IS" BASIS, -->
<!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. -->
<!-- See the License for the specific language governing permissions and -->
<!-- limitations under the License. -->
# Summary
[NIFI-15758](https://issues.apache.org/jira/browse/NIFI-15758)
# Tracking
Please complete the following tracking steps prior to pull request creation.
### Issue Tracking
- [x] [Apache NiFi Jira](https://issues.apache.org/jira/browse/NIFI) issue
created
### Pull Request Tracking
- [x] Pull Request title starts with Apache NiFi Jira issue number, such as
`NIFI-00000`
- [x] Pull Request commit message starts with Apache NiFi Jira issue number,
as such `NIFI-00000`
- [x] Pull request contains [commits
signed](https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits)
with a registered key indicating `Verified` status
### Pull Request Formatting
- [x] Pull Request based on current revision of the `main` branch
- [x] Pull Request refers to a feature branch with one commit containing
changes
# Verification
Please indicate the verification steps performed prior to pull request
creation.
### Build
- [x] Build completed using `./mvnw clean install -P contrib-check`
- [x] JDK 21
- [ ] JDK 25
### Licensing
- [x] New dependencies are compatible with the [Apache License
2.0](https://apache.org/licenses/LICENSE-2.0) according to the [License
Policy](https://www.apache.org/legal/resolved.html)
- [x] New dependencies are documented in applicable `LICENSE` and `NOTICE`
files
### Documentation
- [x] Documentation formatting appears as expected in rendered files
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]