Richard Scott created NIFI-15758:
------------------------------------

             Summary: Enhance UnpackContent to Optionally Add Fragment 
Attributes for Repackaging Use Cases
                 Key: NIFI-15758
                 URL: https://issues.apache.org/jira/browse/NIFI-15758
             Project: Apache NiFi
          Issue Type: Improvement
            Reporter: Richard Scott
            Assignee: Richard Scott


Currently, the {{UnpackContent}} processor extracts FlowFiles from archive 
formats (e.g. ZIP, TAR, etc.) but does not provide a built-in mechanism to 
assign fragment attributes ({{{}fragment.identifier{}}}, 
{{{}fragment.index{}}}, {{{}fragment.count{}}}) required for downstream 
reassembly using {{{}MergeContent{}}}.

This makes it difficult to support a common dataflow pattern where content is:
 # Packed to optimise transport
 # Unpacked for enrichment or processing of individual entries
 # Repacked back into the original (or equivalent) archive structure

Without fragment attributes, users must implement custom logic to track 
grouping and ordering, which introduces complexity and inconsistency.
h2. *Use Case*

A common dataflow pattern:
 # Data is packaged (e.g. ZIP/TAR/FFv3) to optimise transport
 # {{UnpackContent}} extracts individual FlowFiles
 # Files are enriched or transformed independently
 # Files are regrouped and repackaged

To support correct regrouping, all unpacked FlowFiles must share:
 * {{fragment.identifier}} → groups entries from the same archive
 * {{fragment.index}} → preserves ordering
 * {{fragment.count}} → total number of entries

Currently, this requires custom logic or is inconsistent depending on format.
----
h2. *Proposed Enhancement*

Add the following optional properties to {{{}UnpackContent{}}}:
h3. *1. Add Fragment Attributes*
 * *Property Name:* {{Add Fragment Attributes}}
 * *Description:* When enabled, assigns {{{}fragment.identifier{}}}, 
{{{}fragment.index{}}}, and {{fragment.count}} to all unpacked FlowFiles.
 * *Allowable Values:* {{true}} / {{false}}
 * *Default:* {{false}} (no change to existing behaviour)

----
h3. *2. Fragment Identifier Value*
 * *Property Name:* {{Fragment Identifier Value}}
 * *Description:*
Specifies the value used for {{{}fragment.identifier{}}}.

Supports Expression Language evaluated against the incoming (packed) FlowFile.
The expression is evaluated {*}once per source FlowFile{*}, and the resulting 
value is applied to all unpacked FlowFiles derived from that source.

 * *Default Value:* {{${uuid()}}}
 * *Examples:*
 ** {{${uuid()}}} → unique grouping per archive (default)
 ** {{${filename}}} → stable grouping based on original filename
 ** {{${archive.filename}}} → explicit archive attribute (if present)

----
h2. *Behaviour Details*

When {*}enabled{*}:
 * All FlowFiles produced from a single archive share the same 
{{fragment.identifier}}
 * {{fragment.index}} is assigned based on entry order within the archive
 * {{fragment.count}} is set to the total number of entries extracted
 * The identifier expression is evaluated once per parent FlowFile

When {*}disabled{*}:
 * No change to current {{UnpackContent}} behaviour

----
h2. *Compatibility & Scope*
 * Fully backward compatible (feature is opt-in)
 * No changes required to {{MergeContent}}
 ** {{MergeContent}} already supports Defragment mode using:
 *** {{fragment.identifier}}
 *** {{fragment.index}}
 *** {{fragment.count}}
 * Applies consistently across all supported archive formats (ZIP, TAR, etc.)

----
h2. *Benefits*
 * Enables standard unpack → enrich → repack workflows
 * Eliminates need for custom scripting or attribute tracking
 * Provides consistent behaviour across formats
 * Aligns with existing NiFi fragment-based processing patterns
 * Keeps configuration simple by leveraging Expression Language instead of 
strategy modes

----
h2. *Implementation Notes*
 * Ensure consistent ordering across archive formats when assigning 
{{fragment.index}}
 * Avoid overwriting existing fragment attributes unless explicitly enabled
 * Expression for {{fragment.identifier}} must be evaluated once per parent 
FlowFile
 * Avoid full in-memory buffering where possible when determining 
{{fragment.count}}

----
h2. *Example*

*Input:*
{{archive.zip}} containing 3 files

*Output (with feature enabled and {{${filename}}} as identifier):*
||filename||fragment.identifier||fragment.index||fragment.count||
|file1.txt|archive.zip|0|3|
|file2.txt|archive.zip|1|3|
|file3.txt|archive.zip|2|3|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to