[ 
https://issues.apache.org/jira/browse/NIFI-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Scott updated NIFI-15758:
---------------------------------
    Summary: Enhance UnpackContent to Optionally Add Fragment Attributes for 
Repackaging Use Case  (was: Enhance UnpackContent to Optionally Add Fragment 
Attributes for Repackaging Use Cases)

> Enhance UnpackContent to Optionally Add Fragment Attributes for Repackaging 
> Use Case
> ------------------------------------------------------------------------------------
>
>                 Key: NIFI-15758
>                 URL: https://issues.apache.org/jira/browse/NIFI-15758
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Richard Scott
>            Assignee: Richard Scott
>            Priority: Minor
>
> Currently, the {{UnpackContent}} processor extracts FlowFiles from archive 
> formats (e.g. ZIP, TAR, etc.) but does not provide a built-in mechanism to 
> assign fragment attributes ({{{}fragment.identifier{}}}, 
> {{{}fragment.index{}}}, {{{}fragment.count{}}}) required for downstream 
> reassembly using {{MergeContent for formats like Flowfile Package. }}
> This makes it difficult to support a common dataflow pattern where content is:
>  # Packed to optimise transport
>  # Unpacked for enrichment or processing of individual entries
>  # Repacked back into the original (or equivalent) archive structure
> Without fragment attributes, users must implement custom logic to track 
> grouping and ordering, which introduces complexity and inconsistency.
> h2. *Use Case*
> A common dataflow pattern:
>  # Data is packaged (e.g. ZIP/TAR/FFv3) to optimise transport
>  # {{UnpackContent}} extracts individual FlowFiles
>  # Files are enriched or transformed independently
>  # Files are regrouped and repackaged
> To support correct regrouping, all unpacked FlowFiles must share:
>  * {{fragment.identifier}} → groups entries from the same archive
>  * {{fragment.index}} → preserves ordering
>  * {{fragment.count}} → total number of entries
> Currently, this requires custom logic or is inconsistent depending on format.
> ----
> h2. *Proposed Enhancement*
> Add the following optional properties to {{{}UnpackContent{}}}:
> h3. *1. Add Fragment Attributes*
>  * *Property Name:* {{Add Fragment Attributes}}
>  * *Description:* When enabled, assigns {{{}fragment.identifier{}}}, 
> {{{}fragment.index{}}}, and {{fragment.count}} to all unpacked FlowFiles.
>  * *Allowable Values:* {{true}} / {{false}}
>  * *Default:* {{false}} (no change to existing behaviour)
> ----
> h3. *2. Fragment Identifier Value*
>  * *Property Name:* {{Fragment Identifier Value}}
>  * *Description:*
> Specifies the value used for {{{}fragment.identifier{}}}.
> Supports Expression Language evaluated against the incoming (packed) FlowFile.
> The expression is evaluated {*}once per source FlowFile{*}, and the resulting 
> value is applied to all unpacked FlowFiles derived from that source.
>  * *Default Value:* {{{}$\{uuid(){}}}}
>  * *Examples:*
>  ** {{{}$\{uuid(){}}}} → unique grouping per archive (default)
>  ** {{{}$\{filename{}}}} → stable grouping based on original filename
>  ** {{{}$\{archive.filename{}}}} → explicit archive attribute (if present)
> ----
> h2. *Behaviour Details*
> When {*}enabled{*}:
>  * All FlowFiles produced from a single archive share the same 
> {{fragment.identifier}}
>  * {{fragment.index}} is assigned based on entry order within the archive
>  * {{fragment.count}} is set to the total number of entries extracted
>  * The identifier expression is evaluated once per parent FlowFile
> When {*}disabled{*}:
>  * No change to current {{UnpackContent}} behaviour
> ----
> h2. *Compatibility & Scope*
>  * Fully backward compatible (feature is opt-in)
>  * No changes required to {{MergeContent}}
>  ** {{MergeContent}} already supports Defragment mode using:
>  *** {{fragment.identifier}}
>  *** {{fragment.index}}
>  *** {{fragment.count}}
>  * Applies consistently across all supported archive formats (ZIP, TAR, etc.)
> ----
> h2. *Benefits*
>  * Enables standard unpack → enrich → repack workflows
>  * Eliminates need for custom scripting or attribute tracking
>  * Provides consistent behaviour across formats
>  * Aligns with existing NiFi fragment-based processing patterns
>  * Keeps configuration simple by leveraging Expression Language instead of 
> strategy modes
> ----
> h2. *Implementation Notes*
>  * Ensure consistent ordering across archive formats when assigning 
> {{fragment.index}}
>  * Avoid overwriting existing fragment attributes unless explicitly enabled
>  * Expression for {{fragment.identifier}} must be evaluated once per parent 
> FlowFile
>  * Avoid full in-memory buffering where possible when determining 
> {{fragment.count}}
> ----
> h2. *Example*
> *Input:*
> {{archive.zip}} containing 3 files
> *Output (with feature enabled and {{{}$\{filename{}}}} as identifier):*
> ||filename||fragment.identifier||fragment.index||fragment.count||
> |file1.txt|archive.zip|0|3|
> |file2.txt|archive.zip|1|3|
> |file3.txt|archive.zip|2|3|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to