[ 
https://issues.apache.org/jira/browse/NIFI-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Scott updated NIFI-15758:
---------------------------------
    Description: 
h2. Summary

Currently, the *UnpackContent* processor extracts FlowFiles from archive 
formats such as ZIP, TAR, and FlowFile Package, but it does not provide a 
built-in mechanism to assign the fragment attributes required for downstream 
reassembly using *MergeContent* in *Defragment* mode.

The relevant attributes are:

* *fragment.identifier*
* *fragment.index*
* *fragment.count*

This makes it difficult to support a common dataflow pattern where content is 
packed to optimise transport, unpacked for enrichment or processing, and then 
repacked back into the original or equivalent archive structure.

Without fragment attributes, users must implement custom logic to track 
grouping and ordering, which introduces unnecessary complexity and 
inconsistency.

In addition, once *MergeContent* completes defragmentation, the fragment 
attributes used for reassembly are no longer needed on the merged FlowFile. 
These attributes should be removed from the merged result so the output 
reflects the completed package rather than the intermediate fragmentation state.

h2. Use Case

A common dataflow pattern is:

# Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise 
transport
# *UnpackContent* extracts individual FlowFiles
# Files are enriched or transformed independently
# Files are regrouped and repackaged using *MergeContent*

To support correct regrouping, all unpacked FlowFiles must share:

* *fragment.identifier* — groups entries from the same archive
* *fragment.index* — preserves ordering
* *fragment.count* — total number of entries

Currently, this either requires custom logic or behaves inconsistently 
depending on the format being unpacked.

h2. Proposed Enhancement

h3. 1. Add support in UnpackContent for optional fragment attribute generation

Add the following optional properties to *UnpackContent*:

*Add Fragment Attributes*  
*Property Name:* Add Fragment Attributes  
*Description:* When enabled, assigns *fragment.identifier*, *fragment.index*, 
and *fragment.count* to all unpacked FlowFiles.  
*Allowable Values:* true / false  
*Default:* false

*Fragment Identifier Value*  
*Property Name:* Fragment Identifier Value  
*Description:* Specifies the value used for *fragment.identifier*.

Supports Expression Language evaluated against the incoming packed FlowFile. 
The expression is evaluated once per source FlowFile, and the resulting value 
is applied to all unpacked FlowFiles derived from that source.

*Default Value:* ${uuid()}

*Examples:*
* ${uuid()} — unique grouping per archive
* ${filename} — stable grouping based on original filename
* ${archive.filename} — explicit archive attribute, if present

h3. UnpackContent Behaviour

When enabled:

* All FlowFiles produced from a single archive share the same 
*fragment.identifier*
* *fragment.index* is assigned based on entry order within the archive
* *fragment.count* is set to the total number of entries extracted
* The identifier expression is evaluated once per parent FlowFile

When disabled:

* No change to current *UnpackContent* behaviour

h3. 2. Update MergeContent to remove fragment attributes after defragmentation

*MergeContent* already supports *Defragment* mode using:

* *fragment.identifier*
* *fragment.index*
* *fragment.count*

However, after a successful defragmentation, these attributes are no longer 
needed on the merged FlowFile and should be removed from the output.

Proposed *MergeContent* behaviour:

* When operating in *Defragment* mode, after the merged FlowFile is created, 
remove:
** *fragment.identifier*
** *fragment.index*
** *fragment.count*

This ensures the merged FlowFile represents the final repackaged artifact 
rather than retaining temporary grouping metadata from the unpack/repack 
workflow.

h2. Compatibility and Scope

* Fully backward compatible
* Fragment attribute generation in *UnpackContent* is opt-in
* Existing flows remain unchanged unless the new property is enabled
* The *MergeContent* change only affects the output of successful *Defragment* 
operations
* Applies consistently across supported unpack/archive formats such as ZIP, 
TAR, and FlowFile Package

h2. Benefits

* Enables standard unpack → enrich → repack workflows
* Eliminates the need for custom scripting or manual attribute tracking
* Provides consistent fragment-based behaviour across formats
* Aligns with existing NiFi fragment processing patterns
* Ensures final merged FlowFiles do not retain intermediate fragmentation 
metadata
* Keeps configuration simple by leveraging Expression Language instead of 
introducing additional strategy modes

h2. Implementation Notes

* Ensure consistent ordering across archive formats when assigning 
*fragment.index*
* Avoid overwriting existing fragment attributes unless the feature is 
explicitly enabled
* Evaluate the *fragment.identifier* expression once per parent FlowFile
* Avoid full in-memory buffering where possible when determining 
*fragment.count*
* Ensure *MergeContent* removes fragment attributes only after a successful 
defragmentation, not during partial or failed merge scenarios

h2. Example

*Input:*  
archive.zip containing 3 files

*UnpackContent output* (with feature enabled and *${filename}* as identifier):

|| filename || fragment.identifier || fragment.index || fragment.count ||
| file1.txt | archive.zip | 0 | 3 |
| file2.txt | archive.zip | 1 | 3 |
| file3.txt | archive.zip | 2 | 3 |

After processing and successful defragmentation in *MergeContent*, the merged 
FlowFile would no longer retain:

* *fragment.identifier*
* *fragment.index*
* *fragment.count*

This leaves the merged FlowFile as the completed repackaged output, without 
temporary fragmentation metadata.


  was:
Currently, the {{UnpackContent}} processor extracts FlowFiles from archive 
formats (e.g. ZIP, TAR, etc.) but does not provide a built-in mechanism to 
assign fragment attributes ({{{}fragment.identifier{}}}, 
{{{}fragment.index{}}}, {{{}fragment.count{}}}) required for downstream 
reassembly using {{MergeContent for formats like Flowfile Package. }}

This makes it difficult to support a common dataflow pattern where content is:
 # Packed to optimise transport
 # Unpacked for enrichment or processing of individual entries
 # Repacked back into the original (or equivalent) archive structure

Without fragment attributes, users must implement custom logic to track 
grouping and ordering, which introduces complexity and inconsistency.
h2. *Use Case*

A common dataflow pattern:
 # Data is packaged (e.g. ZIP/TAR/FFv3) to optimise transport
 # {{UnpackContent}} extracts individual FlowFiles
 # Files are enriched or transformed independently
 # Files are regrouped and repackaged

To support correct regrouping, all unpacked FlowFiles must share:
 * {{fragment.identifier}} → groups entries from the same archive
 * {{fragment.index}} → preserves ordering
 * {{fragment.count}} → total number of entries

Currently, this requires custom logic or is inconsistent depending on format.
----
h2. *Proposed Enhancement*

Add the following optional properties to {{{}UnpackContent{}}}:
h3. *1. Add Fragment Attributes*
 * *Property Name:* {{Add Fragment Attributes}}
 * *Description:* When enabled, assigns {{{}fragment.identifier{}}}, 
{{{}fragment.index{}}}, and {{fragment.count}} to all unpacked FlowFiles.
 * *Allowable Values:* {{true}} / {{false}}
 * *Default:* {{false}} (no change to existing behaviour)

----
h3. *2. Fragment Identifier Value*
 * *Property Name:* {{Fragment Identifier Value}}
 * *Description:*
Specifies the value used for {{{}fragment.identifier{}}}.

Supports Expression Language evaluated against the incoming (packed) FlowFile.
The expression is evaluated {*}once per source FlowFile{*}, and the resulting 
value is applied to all unpacked FlowFiles derived from that source.
 * *Default Value:* {{{}$\{uuid(){}}}}
 * *Examples:*
 ** {{{}$\{uuid(){}}}} → unique grouping per archive (default)
 ** {{{}$\{filename{}}}} → stable grouping based on original filename
 ** {{{}$\{archive.filename{}}}} → explicit archive attribute (if present)

----
h2. *Behaviour Details*

When {*}enabled{*}:
 * All FlowFiles produced from a single archive share the same 
{{fragment.identifier}}
 * {{fragment.index}} is assigned based on entry order within the archive
 * {{fragment.count}} is set to the total number of entries extracted
 * The identifier expression is evaluated once per parent FlowFile

When {*}disabled{*}:
 * No change to current {{UnpackContent}} behaviour

----
h2. *Compatibility & Scope*
 * Fully backward compatible (feature is opt-in)
 * No changes required to {{MergeContent}}
 ** {{MergeContent}} already supports Defragment mode using:
 *** {{fragment.identifier}}
 *** {{fragment.index}}
 *** {{fragment.count}}
 * Applies consistently across all supported archive formats (ZIP, TAR, etc.)

----
h2. *Benefits*
 * Enables standard unpack → enrich → repack workflows
 * Eliminates need for custom scripting or attribute tracking
 * Provides consistent behaviour across formats
 * Aligns with existing NiFi fragment-based processing patterns
 * Keeps configuration simple by leveraging Expression Language instead of 
strategy modes

----
h2. *Implementation Notes*
 * Ensure consistent ordering across archive formats when assigning 
{{fragment.index}}
 * Avoid overwriting existing fragment attributes unless explicitly enabled
 * Expression for {{fragment.identifier}} must be evaluated once per parent 
FlowFile
 * Avoid full in-memory buffering where possible when determining 
{{fragment.count}}

----
h2. *Example*

*Input:*
{{archive.zip}} containing 3 files

*Output (with feature enabled and {{{}$\{filename{}}}} as identifier):*
||filename||fragment.identifier||fragment.index||fragment.count||
|file1.txt|archive.zip|0|3|
|file2.txt|archive.zip|1|3|
|file3.txt|archive.zip|2|3|


> Enhance UnpackContent to Optionally Add Fragment Attributes for Repackaging 
> Use Case
> ------------------------------------------------------------------------------------
>
>                 Key: NIFI-15758
>                 URL: https://issues.apache.org/jira/browse/NIFI-15758
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Richard Scott
>            Assignee: Richard Scott
>            Priority: Minor
>
> h2. Summary
> Currently, the *UnpackContent* processor extracts FlowFiles from archive 
> formats such as ZIP, TAR, and FlowFile Package, but it does not provide a 
> built-in mechanism to assign the fragment attributes required for downstream 
> reassembly using *MergeContent* in *Defragment* mode.
> The relevant attributes are:
> * *fragment.identifier*
> * *fragment.index*
> * *fragment.count*
> This makes it difficult to support a common dataflow pattern where content is 
> packed to optimise transport, unpacked for enrichment or processing, and then 
> repacked back into the original or equivalent archive structure.
> Without fragment attributes, users must implement custom logic to track 
> grouping and ordering, which introduces unnecessary complexity and 
> inconsistency.
> In addition, once *MergeContent* completes defragmentation, the fragment 
> attributes used for reassembly are no longer needed on the merged FlowFile. 
> These attributes should be removed from the merged result so the output 
> reflects the completed package rather than the intermediate fragmentation 
> state.
> h2. Use Case
> A common dataflow pattern is:
> # Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise 
> transport
> # *UnpackContent* extracts individual FlowFiles
> # Files are enriched or transformed independently
> # Files are regrouped and repackaged using *MergeContent*
> To support correct regrouping, all unpacked FlowFiles must share:
> * *fragment.identifier* — groups entries from the same archive
> * *fragment.index* — preserves ordering
> * *fragment.count* — total number of entries
> Currently, this either requires custom logic or behaves inconsistently 
> depending on the format being unpacked.
> h2. Proposed Enhancement
> h3. 1. Add support in UnpackContent for optional fragment attribute generation
> Add the following optional properties to *UnpackContent*:
> *Add Fragment Attributes*  
> *Property Name:* Add Fragment Attributes  
> *Description:* When enabled, assigns *fragment.identifier*, *fragment.index*, 
> and *fragment.count* to all unpacked FlowFiles.  
> *Allowable Values:* true / false  
> *Default:* false
> *Fragment Identifier Value*  
> *Property Name:* Fragment Identifier Value  
> *Description:* Specifies the value used for *fragment.identifier*.
> Supports Expression Language evaluated against the incoming packed FlowFile. 
> The expression is evaluated once per source FlowFile, and the resulting value 
> is applied to all unpacked FlowFiles derived from that source.
> *Default Value:* ${uuid()}
> *Examples:*
> * ${uuid()} — unique grouping per archive
> * ${filename} — stable grouping based on original filename
> * ${archive.filename} — explicit archive attribute, if present
> h3. UnpackContent Behaviour
> When enabled:
> * All FlowFiles produced from a single archive share the same 
> *fragment.identifier*
> * *fragment.index* is assigned based on entry order within the archive
> * *fragment.count* is set to the total number of entries extracted
> * The identifier expression is evaluated once per parent FlowFile
> When disabled:
> * No change to current *UnpackContent* behaviour
> h3. 2. Update MergeContent to remove fragment attributes after defragmentation
> *MergeContent* already supports *Defragment* mode using:
> * *fragment.identifier*
> * *fragment.index*
> * *fragment.count*
> However, after a successful defragmentation, these attributes are no longer 
> needed on the merged FlowFile and should be removed from the output.
> Proposed *MergeContent* behaviour:
> * When operating in *Defragment* mode, after the merged FlowFile is created, 
> remove:
> ** *fragment.identifier*
> ** *fragment.index*
> ** *fragment.count*
> This ensures the merged FlowFile represents the final repackaged artifact 
> rather than retaining temporary grouping metadata from the unpack/repack 
> workflow.
> h2. Compatibility and Scope
> * Fully backward compatible
> * Fragment attribute generation in *UnpackContent* is opt-in
> * Existing flows remain unchanged unless the new property is enabled
> * The *MergeContent* change only affects the output of successful 
> *Defragment* operations
> * Applies consistently across supported unpack/archive formats such as ZIP, 
> TAR, and FlowFile Package
> h2. Benefits
> * Enables standard unpack → enrich → repack workflows
> * Eliminates the need for custom scripting or manual attribute tracking
> * Provides consistent fragment-based behaviour across formats
> * Aligns with existing NiFi fragment processing patterns
> * Ensures final merged FlowFiles do not retain intermediate fragmentation 
> metadata
> * Keeps configuration simple by leveraging Expression Language instead of 
> introducing additional strategy modes
> h2. Implementation Notes
> * Ensure consistent ordering across archive formats when assigning 
> *fragment.index*
> * Avoid overwriting existing fragment attributes unless the feature is 
> explicitly enabled
> * Evaluate the *fragment.identifier* expression once per parent FlowFile
> * Avoid full in-memory buffering where possible when determining 
> *fragment.count*
> * Ensure *MergeContent* removes fragment attributes only after a successful 
> defragmentation, not during partial or failed merge scenarios
> h2. Example
> *Input:*  
> archive.zip containing 3 files
> *UnpackContent output* (with feature enabled and *${filename}* as identifier):
> || filename || fragment.identifier || fragment.index || fragment.count ||
> | file1.txt | archive.zip | 0 | 3 |
> | file2.txt | archive.zip | 1 | 3 |
> | file3.txt | archive.zip | 2 | 3 |
> After processing and successful defragmentation in *MergeContent*, the merged 
> FlowFile would no longer retain:
> * *fragment.identifier*
> * *fragment.index*
> * *fragment.count*
> This leaves the merged FlowFile as the completed repackaged output, without 
> temporary fragmentation metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to