[ 
https://issues.apache.org/jira/browse/NIFI-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Scott updated NIFI-15758:
---------------------------------
    Description: 
Currently, the {{UnpackContent}} processor extracts FlowFiles from archive 
formats (e.g. ZIP, TAR, etc.) but does not provide a built-in mechanism to 
assign fragment attributes ({{{}fragment.identifier{}}}, 
{{{}fragment.index{}}}, {{{}fragment.count{}}}) required for downstream 
reassembly using {{MergeContent for formats like Flowfile Package. }}

This makes it difficult to support a common dataflow pattern where content is:
 # Packed to optimise transport
 # Unpacked for enrichment or processing of individual entries
 # Repacked back into the original (or equivalent) archive structure

Without fragment attributes, users must implement custom logic to track 
grouping and ordering, which introduces complexity and inconsistency.
h2. *Use Case*

A common dataflow pattern:
 # Data is packaged (e.g. ZIP/TAR/FFv3) to optimise transport
 # {{UnpackContent}} extracts individual FlowFiles
 # Files are enriched or transformed independently
 # Files are regrouped and repackaged

To support correct regrouping, all unpacked FlowFiles must share:
 * {{fragment.identifier}} → groups entries from the same archive
 * {{fragment.index}} → preserves ordering
 * {{fragment.count}} → total number of entries

Currently, this requires custom logic or is inconsistent depending on format.
----
h2. *Proposed Enhancement*

Add the following optional properties to {{{}UnpackContent{}}}:
h3. *1. Add Fragment Attributes*
 * *Property Name:* {{Add Fragment Attributes}}
 * *Description:* When enabled, assigns {{{}fragment.identifier{}}}, 
{{{}fragment.index{}}}, and {{fragment.count}} to all unpacked FlowFiles.
 * *Allowable Values:* {{true}} / {{false}}
 * *Default:* {{false}} (no change to existing behaviour)

----
h3. *2. Fragment Identifier Value*
 * *Property Name:* {{Fragment Identifier Value}}
 * *Description:*
Specifies the value used for {{{}fragment.identifier{}}}.

Supports Expression Language evaluated against the incoming (packed) FlowFile.
The expression is evaluated {*}once per source FlowFile{*}, and the resulting 
value is applied to all unpacked FlowFiles derived from that source.
 * *Default Value:* {{{}$\{uuid(){}}}}
 * *Examples:*
 ** {{{}$\{uuid(){}}}} → unique grouping per archive (default)
 ** {{{}$\{filename{}}}} → stable grouping based on original filename
 ** {{{}$\{archive.filename{}}}} → explicit archive attribute (if present)

----
h2. *Behaviour Details*

When {*}enabled{*}:
 * All FlowFiles produced from a single archive share the same 
{{fragment.identifier}}
 * {{fragment.index}} is assigned based on entry order within the archive
 * {{fragment.count}} is set to the total number of entries extracted
 * The identifier expression is evaluated once per parent FlowFile

When {*}disabled{*}:
 * No change to current {{UnpackContent}} behaviour

----
h2. *Compatibility & Scope*
 * Fully backward compatible (feature is opt-in)
 * No changes required to {{MergeContent}}
 ** {{MergeContent}} already supports Defragment mode using:
 *** {{fragment.identifier}}
 *** {{fragment.index}}
 *** {{fragment.count}}
 * Applies consistently across all supported archive formats (ZIP, TAR, etc.)

----
h2. *Benefits*
 * Enables standard unpack → enrich → repack workflows
 * Eliminates need for custom scripting or attribute tracking
 * Provides consistent behaviour across formats
 * Aligns with existing NiFi fragment-based processing patterns
 * Keeps configuration simple by leveraging Expression Language instead of 
strategy modes

----
h2. *Implementation Notes*
 * Ensure consistent ordering across archive formats when assigning 
{{fragment.index}}
 * Avoid overwriting existing fragment attributes unless explicitly enabled
 * Expression for {{fragment.identifier}} must be evaluated once per parent 
FlowFile
 * Avoid full in-memory buffering where possible when determining 
{{fragment.count}}

----
h2. *Example*

*Input:*
{{archive.zip}} containing 3 files

*Output (with feature enabled and {{{}$\{filename{}}}} as identifier):*
||filename||fragment.identifier||fragment.index||fragment.count||
|file1.txt|archive.zip|0|3|
|file2.txt|archive.zip|1|3|
|file3.txt|archive.zip|2|3|

  was:
Currently, the {{UnpackContent}} processor extracts FlowFiles from archive 
formats (e.g. ZIP, TAR, etc.) but does not provide a built-in mechanism to 
assign fragment attributes ({{{}fragment.identifier{}}}, 
{{{}fragment.index{}}}, {{{}fragment.count{}}}) required for downstream 
reassembly using {{{}MergeContent{}}}.

This makes it difficult to support a common dataflow pattern where content is:
 # Packed to optimise transport
 # Unpacked for enrichment or processing of individual entries
 # Repacked back into the original (or equivalent) archive structure

Without fragment attributes, users must implement custom logic to track 
grouping and ordering, which introduces complexity and inconsistency.
h2. *Use Case*

A common dataflow pattern:
 # Data is packaged (e.g. ZIP/TAR/FFv3) to optimise transport
 # {{UnpackContent}} extracts individual FlowFiles
 # Files are enriched or transformed independently
 # Files are regrouped and repackaged

To support correct regrouping, all unpacked FlowFiles must share:
 * {{fragment.identifier}} → groups entries from the same archive
 * {{fragment.index}} → preserves ordering
 * {{fragment.count}} → total number of entries

Currently, this requires custom logic or is inconsistent depending on format.
----
h2. *Proposed Enhancement*

Add the following optional properties to {{{}UnpackContent{}}}:
h3. *1. Add Fragment Attributes*
 * *Property Name:* {{Add Fragment Attributes}}
 * *Description:* When enabled, assigns {{{}fragment.identifier{}}}, 
{{{}fragment.index{}}}, and {{fragment.count}} to all unpacked FlowFiles.
 * *Allowable Values:* {{true}} / {{false}}
 * *Default:* {{false}} (no change to existing behaviour)

----
h3. *2. Fragment Identifier Value*
 * *Property Name:* {{Fragment Identifier Value}}
 * *Description:*
Specifies the value used for {{{}fragment.identifier{}}}.

Supports Expression Language evaluated against the incoming (packed) FlowFile.
The expression is evaluated {*}once per source FlowFile{*}, and the resulting 
value is applied to all unpacked FlowFiles derived from that source.

 * *Default Value:* {{${uuid()}}}
 * *Examples:*
 ** {{${uuid()}}} → unique grouping per archive (default)
 ** {{${filename}}} → stable grouping based on original filename
 ** {{${archive.filename}}} → explicit archive attribute (if present)

----
h2. *Behaviour Details*

When {*}enabled{*}:
 * All FlowFiles produced from a single archive share the same 
{{fragment.identifier}}
 * {{fragment.index}} is assigned based on entry order within the archive
 * {{fragment.count}} is set to the total number of entries extracted
 * The identifier expression is evaluated once per parent FlowFile

When {*}disabled{*}:
 * No change to current {{UnpackContent}} behaviour

----
h2. *Compatibility & Scope*
 * Fully backward compatible (feature is opt-in)
 * No changes required to {{MergeContent}}
 ** {{MergeContent}} already supports Defragment mode using:
 *** {{fragment.identifier}}
 *** {{fragment.index}}
 *** {{fragment.count}}
 * Applies consistently across all supported archive formats (ZIP, TAR, etc.)

----
h2. *Benefits*
 * Enables standard unpack → enrich → repack workflows
 * Eliminates need for custom scripting or attribute tracking
 * Provides consistent behaviour across formats
 * Aligns with existing NiFi fragment-based processing patterns
 * Keeps configuration simple by leveraging Expression Language instead of 
strategy modes

----
h2. *Implementation Notes*
 * Ensure consistent ordering across archive formats when assigning 
{{fragment.index}}
 * Avoid overwriting existing fragment attributes unless explicitly enabled
 * Expression for {{fragment.identifier}} must be evaluated once per parent 
FlowFile
 * Avoid full in-memory buffering where possible when determining 
{{fragment.count}}

----
h2. *Example*

*Input:*
{{archive.zip}} containing 3 files

*Output (with feature enabled and {{${filename}}} as identifier):*
||filename||fragment.identifier||fragment.index||fragment.count||
|file1.txt|archive.zip|0|3|
|file2.txt|archive.zip|1|3|
|file3.txt|archive.zip|2|3|


> Enhance UnpackContent to Optionally Add Fragment Attributes for Repackaging 
> Use Cases
> -------------------------------------------------------------------------------------
>
>                 Key: NIFI-15758
>                 URL: https://issues.apache.org/jira/browse/NIFI-15758
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Richard Scott
>            Assignee: Richard Scott
>            Priority: Minor
>
> Currently, the {{UnpackContent}} processor extracts FlowFiles from archive 
> formats (e.g. ZIP, TAR, etc.) but does not provide a built-in mechanism to 
> assign fragment attributes ({{{}fragment.identifier{}}}, 
> {{{}fragment.index{}}}, {{{}fragment.count{}}}) required for downstream 
> reassembly using {{MergeContent for formats like Flowfile Package. }}
> This makes it difficult to support a common dataflow pattern where content is:
>  # Packed to optimise transport
>  # Unpacked for enrichment or processing of individual entries
>  # Repacked back into the original (or equivalent) archive structure
> Without fragment attributes, users must implement custom logic to track 
> grouping and ordering, which introduces complexity and inconsistency.
> h2. *Use Case*
> A common dataflow pattern:
>  # Data is packaged (e.g. ZIP/TAR/FFv3) to optimise transport
>  # {{UnpackContent}} extracts individual FlowFiles
>  # Files are enriched or transformed independently
>  # Files are regrouped and repackaged
> To support correct regrouping, all unpacked FlowFiles must share:
>  * {{fragment.identifier}} → groups entries from the same archive
>  * {{fragment.index}} → preserves ordering
>  * {{fragment.count}} → total number of entries
> Currently, this requires custom logic or is inconsistent depending on format.
> ----
> h2. *Proposed Enhancement*
> Add the following optional properties to {{{}UnpackContent{}}}:
> h3. *1. Add Fragment Attributes*
>  * *Property Name:* {{Add Fragment Attributes}}
>  * *Description:* When enabled, assigns {{{}fragment.identifier{}}}, 
> {{{}fragment.index{}}}, and {{fragment.count}} to all unpacked FlowFiles.
>  * *Allowable Values:* {{true}} / {{false}}
>  * *Default:* {{false}} (no change to existing behaviour)
> ----
> h3. *2. Fragment Identifier Value*
>  * *Property Name:* {{Fragment Identifier Value}}
>  * *Description:*
> Specifies the value used for {{{}fragment.identifier{}}}.
> Supports Expression Language evaluated against the incoming (packed) FlowFile.
> The expression is evaluated {*}once per source FlowFile{*}, and the resulting 
> value is applied to all unpacked FlowFiles derived from that source.
>  * *Default Value:* {{{}$\{uuid(){}}}}
>  * *Examples:*
>  ** {{{}$\{uuid(){}}}} → unique grouping per archive (default)
>  ** {{{}$\{filename{}}}} → stable grouping based on original filename
>  ** {{{}$\{archive.filename{}}}} → explicit archive attribute (if present)
> ----
> h2. *Behaviour Details*
> When {*}enabled{*}:
>  * All FlowFiles produced from a single archive share the same 
> {{fragment.identifier}}
>  * {{fragment.index}} is assigned based on entry order within the archive
>  * {{fragment.count}} is set to the total number of entries extracted
>  * The identifier expression is evaluated once per parent FlowFile
> When {*}disabled{*}:
>  * No change to current {{UnpackContent}} behaviour
> ----
> h2. *Compatibility & Scope*
>  * Fully backward compatible (feature is opt-in)
>  * No changes required to {{MergeContent}}
>  ** {{MergeContent}} already supports Defragment mode using:
>  *** {{fragment.identifier}}
>  *** {{fragment.index}}
>  *** {{fragment.count}}
>  * Applies consistently across all supported archive formats (ZIP, TAR, etc.)
> ----
> h2. *Benefits*
>  * Enables standard unpack → enrich → repack workflows
>  * Eliminates need for custom scripting or attribute tracking
>  * Provides consistent behaviour across formats
>  * Aligns with existing NiFi fragment-based processing patterns
>  * Keeps configuration simple by leveraging Expression Language instead of 
> strategy modes
> ----
> h2. *Implementation Notes*
>  * Ensure consistent ordering across archive formats when assigning 
> {{fragment.index}}
>  * Avoid overwriting existing fragment attributes unless explicitly enabled
>  * Expression for {{fragment.identifier}} must be evaluated once per parent 
> FlowFile
>  * Avoid full in-memory buffering where possible when determining 
> {{fragment.count}}
> ----
> h2. *Example*
> *Input:*
> {{archive.zip}} containing 3 files
> *Output (with feature enabled and {{{}$\{filename{}}}} as identifier):*
> ||filename||fragment.identifier||fragment.index||fragment.count||
> |file1.txt|archive.zip|0|3|
> |file2.txt|archive.zip|1|3|
> |file3.txt|archive.zip|2|3|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to