Richard Scott created NIFI-15758:
------------------------------------
Summary: Enhance UnpackContent to Optionally Add Fragment
Attributes for Repackaging Use Cases
Key: NIFI-15758
URL: https://issues.apache.org/jira/browse/NIFI-15758
Project: Apache NiFi
Issue Type: Improvement
Reporter: Richard Scott
Assignee: Richard Scott
Currently, the {{UnpackContent}} processor extracts FlowFiles from archive
formats (e.g. ZIP, TAR, etc.) but does not provide a built-in mechanism to
assign fragment attributes ({{{}fragment.identifier{}}},
{{{}fragment.index{}}}, {{{}fragment.count{}}}) required for downstream
reassembly using {{{}MergeContent{}}}.
This makes it difficult to support a common dataflow pattern where content is:
# Packed to optimise transport
# Unpacked for enrichment or processing of individual entries
# Repacked back into the original (or equivalent) archive structure
Without fragment attributes, users must implement custom logic to track
grouping and ordering, which introduces complexity and inconsistency.
h2. *Use Case*
A common dataflow pattern:
# Data is packaged (e.g. ZIP/TAR/FFv3) to optimise transport
# {{UnpackContent}} extracts individual FlowFiles
# Files are enriched or transformed independently
# Files are regrouped and repackaged
To support correct regrouping, all unpacked FlowFiles must share:
* {{fragment.identifier}} → groups entries from the same archive
* {{fragment.index}} → preserves ordering
* {{fragment.count}} → total number of entries
Currently, this requires custom logic or is inconsistent depending on format.
----
h2. *Proposed Enhancement*
Add the following optional properties to {{{}UnpackContent{}}}:
h3. *1. Add Fragment Attributes*
* *Property Name:* {{Add Fragment Attributes}}
* *Description:* When enabled, assigns {{{}fragment.identifier{}}},
{{{}fragment.index{}}}, and {{fragment.count}} to all unpacked FlowFiles.
* *Allowable Values:* {{true}} / {{false}}
* *Default:* {{false}} (no change to existing behaviour)
----
h3. *2. Fragment Identifier Value*
* *Property Name:* {{Fragment Identifier Value}}
* *Description:*
Specifies the value used for {{{}fragment.identifier{}}}.
Supports Expression Language evaluated against the incoming (packed) FlowFile.
The expression is evaluated {*}once per source FlowFile{*}, and the resulting
value is applied to all unpacked FlowFiles derived from that source.
* *Default Value:* {{${uuid()}}}
* *Examples:*
** {{${uuid()}}} → unique grouping per archive (default)
** {{${filename}}} → stable grouping based on original filename
** {{${archive.filename}}} → explicit archive attribute (if present)
----
h2. *Behaviour Details*
When {*}enabled{*}:
* All FlowFiles produced from a single archive share the same
{{fragment.identifier}}
* {{fragment.index}} is assigned based on entry order within the archive
* {{fragment.count}} is set to the total number of entries extracted
* The identifier expression is evaluated once per parent FlowFile
When {*}disabled{*}:
* No change to current {{UnpackContent}} behaviour
----
h2. *Compatibility & Scope*
* Fully backward compatible (feature is opt-in)
* No changes required to {{MergeContent}}
** {{MergeContent}} already supports Defragment mode using:
*** {{fragment.identifier}}
*** {{fragment.index}}
*** {{fragment.count}}
* Applies consistently across all supported archive formats (ZIP, TAR, etc.)
----
h2. *Benefits*
* Enables standard unpack → enrich → repack workflows
* Eliminates need for custom scripting or attribute tracking
* Provides consistent behaviour across formats
* Aligns with existing NiFi fragment-based processing patterns
* Keeps configuration simple by leveraging Expression Language instead of
strategy modes
----
h2. *Implementation Notes*
* Ensure consistent ordering across archive formats when assigning
{{fragment.index}}
* Avoid overwriting existing fragment attributes unless explicitly enabled
* Expression for {{fragment.identifier}} must be evaluated once per parent
FlowFile
* Avoid full in-memory buffering where possible when determining
{{fragment.count}}
----
h2. *Example*
*Input:*
{{archive.zip}} containing 3 files
*Output (with feature enabled and {{${filename}}} as identifier):*
||filename||fragment.identifier||fragment.index||fragment.count||
|file1.txt|archive.zip|0|3|
|file2.txt|archive.zip|1|3|
|file3.txt|archive.zip|2|3|
--
This message was sent by Atlassian Jira
(v8.20.10#820010)