[
https://issues.apache.org/jira/browse/NIFI-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Richard Scott updated NIFI-15758:
---------------------------------
Description:
h2. Summary
Currently, the *UnpackContent* processor extracts FlowFiles from archive
formats such as ZIP, TAR, and FlowFile Package, but it does not provide a
built-in mechanism to assign the fragment attributes required for downstream
reassembly using *MergeContent* in *Defragment* mode.
The relevant attributes are:
* *fragment.identifier*
* *fragment.index*
* *fragment.count*
This makes it difficult to support a common dataflow pattern where content is
packed to optimise transport, unpacked for enrichment or processing, and then
repacked back into the original or equivalent archive structure.
Without fragment attributes, users must implement custom logic to track
grouping and ordering, which introduces unnecessary complexity and
inconsistency.
In addition, once *MergeContent* completes defragmentation, the fragment
attributes used for reassembly are no longer needed on the merged FlowFile.
These attributes should be removed from the merged result so the output
reflects the completed package rather than the intermediate fragmentation state.
h2. Use Case
A common dataflow pattern is:
# Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise
transport
# *UnpackContent* extracts individual FlowFiles
# Files are enriched or transformed independently
# Files are regrouped and repackaged using *MergeContent*
To support correct regrouping, all unpacked FlowFiles must share:
* *fragment.identifier* — groups entries from the same archive
* *fragment.index* — preserves ordering
* *fragment.count* — total number of entries
Currently, this either requires custom logic or behaves inconsistently
depending on the format being unpacked.
h2. Proposed Enhancement
h3. 1. Add support in UnpackContent for optional fragment attribute generation
Add the following optional properties to {*}UnpackContent{*}:
*Add Fragment Attributes*
*Property Name:* Add Fragment Attributes
*Description:* When enabled, assigns {*}fragment.identifier{*},
{*}fragment.index{*}, and *fragment.count* to all unpacked FlowFiles.
*Allowable Values:* true / false
*Default:* false
*Fragment Identifier Value*
*Property Name:* Fragment Identifier Value
*Description:* Specifies the value used for {*}fragment.identifier{*}.
Supports Expression Language evaluated against the incoming packed FlowFile.
The expression is evaluated once per source FlowFile, and the resulting value
is applied to all unpacked FlowFiles derived from that source.
*Default Value:* ${uuid()}
*Examples:*
* ${uuid()} — unique grouping per archive
* ${filename} — stable grouping based on original filename
* ${archive.filename} — explicit archive attribute, if present
h3. UnpackContent Behaviour
When enabled:
* All FlowFiles produced from a single archive share the same
*fragment.identifier*
* *fragment.index* is assigned based on entry order within the archive
* *fragment.count* is set to the total number of entries extracted
* The identifier expression is evaluated once per parent FlowFile
When disabled:
* No change to current *UnpackContent* behaviour
h3. 2. Update MergeContent to remove fragment attributes after defragmentation
*MergeContent* already supports *Defragment* mode using:
* *fragment.identifier*
* *fragment.index*
* *fragment.count*
However, after a successful defragmentation, these attributes are no longer
needed on the merged FlowFile and should be removed from the output.
Proposed *MergeContent* behaviour:
* When operating in *Defragment* mode, after the merged FlowFile is created,
remove:
** *fragment.identifier*
** *fragment.index*
** *fragment.count*
This ensures the merged FlowFile represents the final repackaged artifact
rather than retaining temporary grouping metadata from the unpack/repack
workflow.
h2. Example
*Input:*
archive.zip containing 3 files
*UnpackContent output* (with feature enabled and *${filename}* as identifier):
||filename||fragment.identifier||fragment.index||fragment.count||
|file1.txt|archive.zip|0|3|
|file2.txt|archive.zip|1|3|
|file3.txt|archive.zip|2|3|
After processing and successful defragmentation in {*}MergeContent{*}, the
merged FlowFile would no longer retain:
* *fragment.identifier*
* *fragment.index*
* *fragment.count*
This leaves the merged FlowFile as the completed repackaged output, without
temporary fragmentation metadata.
was:
h2. Summary
Currently, the *UnpackContent* processor extracts FlowFiles from archive
formats such as ZIP, TAR, and FlowFile Package, but it does not provide a
built-in mechanism to assign the fragment attributes required for downstream
reassembly using *MergeContent* in *Defragment* mode.
The relevant attributes are:
* *fragment.identifier*
* *fragment.index*
* *fragment.count*
This makes it difficult to support a common dataflow pattern where content is
packed to optimise transport, unpacked for enrichment or processing, and then
repacked back into the original or equivalent archive structure.
Without fragment attributes, users must implement custom logic to track
grouping and ordering, which introduces unnecessary complexity and
inconsistency.
In addition, once *MergeContent* completes defragmentation, the fragment
attributes used for reassembly are no longer needed on the merged FlowFile.
These attributes should be removed from the merged result so the output
reflects the completed package rather than the intermediate fragmentation state.
h2. Use Case
A common dataflow pattern is:
# Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise
transport
# *UnpackContent* extracts individual FlowFiles
# Files are enriched or transformed independently
# Files are regrouped and repackaged using *MergeContent*
To support correct regrouping, all unpacked FlowFiles must share:
* *fragment.identifier* — groups entries from the same archive
* *fragment.index* — preserves ordering
* *fragment.count* — total number of entries
Currently, this either requires custom logic or behaves inconsistently
depending on the format being unpacked.
h2. Proposed Enhancement
h3. 1. Add support in UnpackContent for optional fragment attribute generation
Add the following optional properties to *UnpackContent*:
*Add Fragment Attributes*
*Property Name:* Add Fragment Attributes
*Description:* When enabled, assigns *fragment.identifier*, *fragment.index*,
and *fragment.count* to all unpacked FlowFiles.
*Allowable Values:* true / false
*Default:* false
*Fragment Identifier Value*
*Property Name:* Fragment Identifier Value
*Description:* Specifies the value used for *fragment.identifier*.
Supports Expression Language evaluated against the incoming packed FlowFile.
The expression is evaluated once per source FlowFile, and the resulting value
is applied to all unpacked FlowFiles derived from that source.
*Default Value:* ${uuid()}
*Examples:*
* ${uuid()} — unique grouping per archive
* ${filename} — stable grouping based on original filename
* ${archive.filename} — explicit archive attribute, if present
h3. UnpackContent Behaviour
When enabled:
* All FlowFiles produced from a single archive share the same
*fragment.identifier*
* *fragment.index* is assigned based on entry order within the archive
* *fragment.count* is set to the total number of entries extracted
* The identifier expression is evaluated once per parent FlowFile
When disabled:
* No change to current *UnpackContent* behaviour
h3. 2. Update MergeContent to remove fragment attributes after defragmentation
*MergeContent* already supports *Defragment* mode using:
* *fragment.identifier*
* *fragment.index*
* *fragment.count*
However, after a successful defragmentation, these attributes are no longer
needed on the merged FlowFile and should be removed from the output.
Proposed *MergeContent* behaviour:
* When operating in *Defragment* mode, after the merged FlowFile is created,
remove:
** *fragment.identifier*
** *fragment.index*
** *fragment.count*
This ensures the merged FlowFile represents the final repackaged artifact
rather than retaining temporary grouping metadata from the unpack/repack
workflow.
h2. Compatibility and Scope
* Fully backward compatible
* Fragment attribute generation in *UnpackContent* is opt-in
* Existing flows remain unchanged unless the new property is enabled
* The *MergeContent* change only affects the output of successful *Defragment*
operations
* Applies consistently across supported unpack/archive formats such as ZIP,
TAR, and FlowFile Package
h2. Benefits
* Enables standard unpack → enrich → repack workflows
* Eliminates the need for custom scripting or manual attribute tracking
* Provides consistent fragment-based behaviour across formats
* Aligns with existing NiFi fragment processing patterns
* Ensures final merged FlowFiles do not retain intermediate fragmentation
metadata
* Keeps configuration simple by leveraging Expression Language instead of
introducing additional strategy modes
h2. Implementation Notes
* Ensure consistent ordering across archive formats when assigning
*fragment.index*
* Avoid overwriting existing fragment attributes unless the feature is
explicitly enabled
* Evaluate the *fragment.identifier* expression once per parent FlowFile
* Avoid full in-memory buffering where possible when determining
*fragment.count*
* Ensure *MergeContent* removes fragment attributes only after a successful
defragmentation, not during partial or failed merge scenarios
h2. Example
*Input:*
archive.zip containing 3 files
*UnpackContent output* (with feature enabled and *${filename}* as identifier):
|| filename || fragment.identifier || fragment.index || fragment.count ||
| file1.txt | archive.zip | 0 | 3 |
| file2.txt | archive.zip | 1 | 3 |
| file3.txt | archive.zip | 2 | 3 |
After processing and successful defragmentation in *MergeContent*, the merged
FlowFile would no longer retain:
* *fragment.identifier*
* *fragment.index*
* *fragment.count*
This leaves the merged FlowFile as the completed repackaged output, without
temporary fragmentation metadata.
> Enhance UnpackContent to Optionally Add Fragment Attributes for Repackaging
> Use Case
> ------------------------------------------------------------------------------------
>
> Key: NIFI-15758
> URL: https://issues.apache.org/jira/browse/NIFI-15758
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Richard Scott
> Assignee: Richard Scott
> Priority: Minor
>
> h2. Summary
> Currently, the *UnpackContent* processor extracts FlowFiles from archive
> formats such as ZIP, TAR, and FlowFile Package, but it does not provide a
> built-in mechanism to assign the fragment attributes required for downstream
> reassembly using *MergeContent* in *Defragment* mode.
> The relevant attributes are:
> * *fragment.identifier*
> * *fragment.index*
> * *fragment.count*
> This makes it difficult to support a common dataflow pattern where content is
> packed to optimise transport, unpacked for enrichment or processing, and then
> repacked back into the original or equivalent archive structure.
> Without fragment attributes, users must implement custom logic to track
> grouping and ordering, which introduces unnecessary complexity and
> inconsistency.
> In addition, once *MergeContent* completes defragmentation, the fragment
> attributes used for reassembly are no longer needed on the merged FlowFile.
> These attributes should be removed from the merged result so the output
> reflects the completed package rather than the intermediate fragmentation
> state.
> h2. Use Case
> A common dataflow pattern is:
> # Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise
> transport
> # *UnpackContent* extracts individual FlowFiles
> # Files are enriched or transformed independently
> # Files are regrouped and repackaged using *MergeContent*
> To support correct regrouping, all unpacked FlowFiles must share:
> * *fragment.identifier* — groups entries from the same archive
> * *fragment.index* — preserves ordering
> * *fragment.count* — total number of entries
> Currently, this either requires custom logic or behaves inconsistently
> depending on the format being unpacked.
> h2. Proposed Enhancement
> h3. 1. Add support in UnpackContent for optional fragment attribute generation
> Add the following optional properties to {*}UnpackContent{*}:
> *Add Fragment Attributes*
> *Property Name:* Add Fragment Attributes
> *Description:* When enabled, assigns {*}fragment.identifier{*},
> {*}fragment.index{*}, and *fragment.count* to all unpacked FlowFiles.
> *Allowable Values:* true / false
> *Default:* false
> *Fragment Identifier Value*
> *Property Name:* Fragment Identifier Value
> *Description:* Specifies the value used for {*}fragment.identifier{*}.
> Supports Expression Language evaluated against the incoming packed FlowFile.
> The expression is evaluated once per source FlowFile, and the resulting value
> is applied to all unpacked FlowFiles derived from that source.
> *Default Value:* ${uuid()}
> *Examples:*
> * ${uuid()} — unique grouping per archive
> * ${filename} — stable grouping based on original filename
> * ${archive.filename} — explicit archive attribute, if present
> h3. UnpackContent Behaviour
> When enabled:
> * All FlowFiles produced from a single archive share the same
> *fragment.identifier*
> * *fragment.index* is assigned based on entry order within the archive
> * *fragment.count* is set to the total number of entries extracted
> * The identifier expression is evaluated once per parent FlowFile
> When disabled:
> * No change to current *UnpackContent* behaviour
> h3. 2. Update MergeContent to remove fragment attributes after defragmentation
> *MergeContent* already supports *Defragment* mode using:
> * *fragment.identifier*
> * *fragment.index*
> * *fragment.count*
> However, after a successful defragmentation, these attributes are no longer
> needed on the merged FlowFile and should be removed from the output.
> Proposed *MergeContent* behaviour:
> * When operating in *Defragment* mode, after the merged FlowFile is created,
> remove:
> ** *fragment.identifier*
> ** *fragment.index*
> ** *fragment.count*
> This ensures the merged FlowFile represents the final repackaged artifact
> rather than retaining temporary grouping metadata from the unpack/repack
> workflow.
> h2. Example
> *Input:*
> archive.zip containing 3 files
> *UnpackContent output* (with feature enabled and *${filename}* as identifier):
> ||filename||fragment.identifier||fragment.index||fragment.count||
> |file1.txt|archive.zip|0|3|
> |file2.txt|archive.zip|1|3|
> |file3.txt|archive.zip|2|3|
> After processing and successful defragmentation in {*}MergeContent{*}, the
> merged FlowFile would no longer retain:
> * *fragment.identifier*
> * *fragment.index*
> * *fragment.count*
> This leaves the merged FlowFile as the completed repackaged output, without
> temporary fragmentation metadata.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)