[ 
https://issues.apache.org/jira/browse/NIFI-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Scott updated NIFI-15758:
---------------------------------
    Description: 
h2. Summary

Currently, the *UnpackContent* processor extracts FlowFiles from archive 
formats such as ZIP, TAR, and FlowFile Package, but it does not provide a 
built-in mechanism to assign the fragment attributes required for downstream 
reassembly using *MergeContent* in *Defragment* mode.

The relevant attributes are:
 * *fragment.identifier*
 * *fragment.index*
 * *fragment.count*

This makes it difficult to support a common dataflow pattern where content is 
packed to optimise transport, unpacked for enrichment or processing, and then 
repacked back into the original or equivalent archive structure.

Without fragment attributes, users must implement custom logic to track 
grouping and ordering, which introduces unnecessary complexity and 
inconsistency.

In addition, once *MergeContent* completes defragmentation, the fragment 
attributes used for reassembly are no longer needed on the merged FlowFile. 
These attributes should be removed from the merged result so the output 
reflects the completed package rather than the intermediate fragmentation state.
h2. Use Case

A common dataflow pattern is:
 # Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise 
transport
 # *UnpackContent* extracts individual FlowFiles
 # Files are enriched or transformed independently
 # Files are regrouped and repackaged using *MergeContent*

To support correct regrouping, all unpacked FlowFiles must share:
 * *fragment.identifier* — groups entries from the same archive
 * *fragment.index* — preserves ordering
 * *fragment.count* — total number of entries

Currently, this either requires custom logic or behaves inconsistently 
depending on the format being unpacked.
h2. Proposed Enhancement
h3. 1. Add support in UnpackContent for optional fragment attribute generation

Add the following optional properties to {*}UnpackContent{*}:

*Add Fragment Attributes* 
*Property Name:* Add Fragment Attributes 
*Description:* When enabled, assigns {*}fragment.identifier{*}, 
{*}fragment.index{*}, and *fragment.count* to all unpacked FlowFiles. 
*Allowable Values:* true / false 
*Default:* false

*Fragment Identifier Value* 
*Property Name:* Fragment Identifier Value 
*Description:* Specifies the value used for {*}fragment.identifier{*}.

Supports Expression Language evaluated against the incoming packed FlowFile. 
The expression is evaluated once per source FlowFile, and the resulting value 
is applied to all unpacked FlowFiles derived from that source.

*Default Value:* ${uuid()}

*Examples:*
 * ${uuid()} — unique grouping per archive
 * ${filename} — stable grouping based on original filename
 * ${archive.filename} — explicit archive attribute, if present

h3. UnpackContent Behaviour

When enabled:
 * All FlowFiles produced from a single archive share the same 
*fragment.identifier*
 * *fragment.index* is assigned based on entry order within the archive
 * *fragment.count* is set to the total number of entries extracted
 * The identifier expression is evaluated once per parent FlowFile

When disabled:
 * No change to current *UnpackContent* behaviour

h3. 2. Update MergeContent to remove fragment attributes after defragmentation

*MergeContent* already supports *Defragment* mode using:
 * *fragment.identifier*
 * *fragment.index*
 * *fragment.count*

However, after a successful defragmentation, these attributes are no longer 
needed on the merged FlowFile and should be removed from the output.

Proposed *MergeContent* behaviour:
 * When operating in *Defragment* mode, after the merged FlowFile is created, 
remove:
 ** *fragment.identifier*
 ** *fragment.index*
 ** *fragment.count*

This ensures the merged FlowFile represents the final repackaged artifact 
rather than retaining temporary grouping metadata from the unpack/repack 
workflow.
h2. Example

*Input:* 
archive.zip containing 3 files

*UnpackContent output* (with feature enabled and *${filename}* as identifier):
||filename||fragment.identifier||fragment.index||fragment.count||
|file1.txt|archive.zip|0|3|
|file2.txt|archive.zip|1|3|
|file3.txt|archive.zip|2|3|

After processing and successful defragmentation in {*}MergeContent{*}, the 
merged FlowFile would no longer retain:
 * *fragment.identifier*
 * *fragment.index*
 * *fragment.count*

This leaves the merged FlowFile as the completed repackaged output, without 
temporary fragmentation metadata.

  was:
h2. Summary

Currently, the *UnpackContent* processor extracts FlowFiles from archive 
formats such as ZIP, TAR, and FlowFile Package, but it does not provide a 
built-in mechanism to assign the fragment attributes required for downstream 
reassembly using *MergeContent* in *Defragment* mode.

The relevant attributes are:

* *fragment.identifier*
* *fragment.index*
* *fragment.count*

This makes it difficult to support a common dataflow pattern where content is 
packed to optimise transport, unpacked for enrichment or processing, and then 
repacked back into the original or equivalent archive structure.

Without fragment attributes, users must implement custom logic to track 
grouping and ordering, which introduces unnecessary complexity and 
inconsistency.

In addition, once *MergeContent* completes defragmentation, the fragment 
attributes used for reassembly are no longer needed on the merged FlowFile. 
These attributes should be removed from the merged result so the output 
reflects the completed package rather than the intermediate fragmentation state.

h2. Use Case

A common dataflow pattern is:

# Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise 
transport
# *UnpackContent* extracts individual FlowFiles
# Files are enriched or transformed independently
# Files are regrouped and repackaged using *MergeContent*

To support correct regrouping, all unpacked FlowFiles must share:

* *fragment.identifier* — groups entries from the same archive
* *fragment.index* — preserves ordering
* *fragment.count* — total number of entries

Currently, this either requires custom logic or behaves inconsistently 
depending on the format being unpacked.

h2. Proposed Enhancement

h3. 1. Add support in UnpackContent for optional fragment attribute generation

Add the following optional properties to *UnpackContent*:

*Add Fragment Attributes*  
*Property Name:* Add Fragment Attributes  
*Description:* When enabled, assigns *fragment.identifier*, *fragment.index*, 
and *fragment.count* to all unpacked FlowFiles.  
*Allowable Values:* true / false  
*Default:* false

*Fragment Identifier Value*  
*Property Name:* Fragment Identifier Value  
*Description:* Specifies the value used for *fragment.identifier*.

Supports Expression Language evaluated against the incoming packed FlowFile. 
The expression is evaluated once per source FlowFile, and the resulting value 
is applied to all unpacked FlowFiles derived from that source.

*Default Value:* ${uuid()}

*Examples:*
* ${uuid()} — unique grouping per archive
* ${filename} — stable grouping based on original filename
* ${archive.filename} — explicit archive attribute, if present

h3. UnpackContent Behaviour

When enabled:

* All FlowFiles produced from a single archive share the same 
*fragment.identifier*
* *fragment.index* is assigned based on entry order within the archive
* *fragment.count* is set to the total number of entries extracted
* The identifier expression is evaluated once per parent FlowFile

When disabled:

* No change to current *UnpackContent* behaviour

h3. 2. Update MergeContent to remove fragment attributes after defragmentation

*MergeContent* already supports *Defragment* mode using:

* *fragment.identifier*
* *fragment.index*
* *fragment.count*

However, after a successful defragmentation, these attributes are no longer 
needed on the merged FlowFile and should be removed from the output.

Proposed *MergeContent* behaviour:

* When operating in *Defragment* mode, after the merged FlowFile is created, 
remove:
** *fragment.identifier*
** *fragment.index*
** *fragment.count*

This ensures the merged FlowFile represents the final repackaged artifact 
rather than retaining temporary grouping metadata from the unpack/repack 
workflow.

h2. Compatibility and Scope

* Fully backward compatible
* Fragment attribute generation in *UnpackContent* is opt-in
* Existing flows remain unchanged unless the new property is enabled
* The *MergeContent* change only affects the output of successful *Defragment* 
operations
* Applies consistently across supported unpack/archive formats such as ZIP, 
TAR, and FlowFile Package

h2. Benefits

* Enables standard unpack → enrich → repack workflows
* Eliminates the need for custom scripting or manual attribute tracking
* Provides consistent fragment-based behaviour across formats
* Aligns with existing NiFi fragment processing patterns
* Ensures final merged FlowFiles do not retain intermediate fragmentation 
metadata
* Keeps configuration simple by leveraging Expression Language instead of 
introducing additional strategy modes

h2. Implementation Notes

* Ensure consistent ordering across archive formats when assigning 
*fragment.index*
* Avoid overwriting existing fragment attributes unless the feature is 
explicitly enabled
* Evaluate the *fragment.identifier* expression once per parent FlowFile
* Avoid full in-memory buffering where possible when determining 
*fragment.count*
* Ensure *MergeContent* removes fragment attributes only after a successful 
defragmentation, not during partial or failed merge scenarios

h2. Example

*Input:*  
archive.zip containing 3 files

*UnpackContent output* (with feature enabled and *${filename}* as identifier):

|| filename || fragment.identifier || fragment.index || fragment.count ||
| file1.txt | archive.zip | 0 | 3 |
| file2.txt | archive.zip | 1 | 3 |
| file3.txt | archive.zip | 2 | 3 |

After processing and successful defragmentation in *MergeContent*, the merged 
FlowFile would no longer retain:

* *fragment.identifier*
* *fragment.index*
* *fragment.count*

This leaves the merged FlowFile as the completed repackaged output, without 
temporary fragmentation metadata.



> Enhance UnpackContent to Optionally Add Fragment Attributes for Repackaging 
> Use Case
> ------------------------------------------------------------------------------------
>
>                 Key: NIFI-15758
>                 URL: https://issues.apache.org/jira/browse/NIFI-15758
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Richard Scott
>            Assignee: Richard Scott
>            Priority: Minor
>
> h2. Summary
> Currently, the *UnpackContent* processor extracts FlowFiles from archive 
> formats such as ZIP, TAR, and FlowFile Package, but it does not provide a 
> built-in mechanism to assign the fragment attributes required for downstream 
> reassembly using *MergeContent* in *Defragment* mode.
> The relevant attributes are:
>  * *fragment.identifier*
>  * *fragment.index*
>  * *fragment.count*
> This makes it difficult to support a common dataflow pattern where content is 
> packed to optimise transport, unpacked for enrichment or processing, and then 
> repacked back into the original or equivalent archive structure.
> Without fragment attributes, users must implement custom logic to track 
> grouping and ordering, which introduces unnecessary complexity and 
> inconsistency.
> In addition, once *MergeContent* completes defragmentation, the fragment 
> attributes used for reassembly are no longer needed on the merged FlowFile. 
> These attributes should be removed from the merged result so the output 
> reflects the completed package rather than the intermediate fragmentation 
> state.
> h2. Use Case
> A common dataflow pattern is:
>  # Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise 
> transport
>  # *UnpackContent* extracts individual FlowFiles
>  # Files are enriched or transformed independently
>  # Files are regrouped and repackaged using *MergeContent*
> To support correct regrouping, all unpacked FlowFiles must share:
>  * *fragment.identifier* — groups entries from the same archive
>  * *fragment.index* — preserves ordering
>  * *fragment.count* — total number of entries
> Currently, this either requires custom logic or behaves inconsistently 
> depending on the format being unpacked.
> h2. Proposed Enhancement
> h3. 1. Add support in UnpackContent for optional fragment attribute generation
> Add the following optional properties to {*}UnpackContent{*}:
> *Add Fragment Attributes* 
> *Property Name:* Add Fragment Attributes 
> *Description:* When enabled, assigns {*}fragment.identifier{*}, 
> {*}fragment.index{*}, and *fragment.count* to all unpacked FlowFiles. 
> *Allowable Values:* true / false 
> *Default:* false
> *Fragment Identifier Value* 
> *Property Name:* Fragment Identifier Value 
> *Description:* Specifies the value used for {*}fragment.identifier{*}.
> Supports Expression Language evaluated against the incoming packed FlowFile. 
> The expression is evaluated once per source FlowFile, and the resulting value 
> is applied to all unpacked FlowFiles derived from that source.
> *Default Value:* ${uuid()}
> *Examples:*
>  * ${uuid()} — unique grouping per archive
>  * ${filename} — stable grouping based on original filename
>  * ${archive.filename} — explicit archive attribute, if present
> h3. UnpackContent Behaviour
> When enabled:
>  * All FlowFiles produced from a single archive share the same 
> *fragment.identifier*
>  * *fragment.index* is assigned based on entry order within the archive
>  * *fragment.count* is set to the total number of entries extracted
>  * The identifier expression is evaluated once per parent FlowFile
> When disabled:
>  * No change to current *UnpackContent* behaviour
> h3. 2. Update MergeContent to remove fragment attributes after defragmentation
> *MergeContent* already supports *Defragment* mode using:
>  * *fragment.identifier*
>  * *fragment.index*
>  * *fragment.count*
> However, after a successful defragmentation, these attributes are no longer 
> needed on the merged FlowFile and should be removed from the output.
> Proposed *MergeContent* behaviour:
>  * When operating in *Defragment* mode, after the merged FlowFile is created, 
> remove:
>  ** *fragment.identifier*
>  ** *fragment.index*
>  ** *fragment.count*
> This ensures the merged FlowFile represents the final repackaged artifact 
> rather than retaining temporary grouping metadata from the unpack/repack 
> workflow.
> h2. Example
> *Input:* 
> archive.zip containing 3 files
> *UnpackContent output* (with feature enabled and *${filename}* as identifier):
> ||filename||fragment.identifier||fragment.index||fragment.count||
> |file1.txt|archive.zip|0|3|
> |file2.txt|archive.zip|1|3|
> |file3.txt|archive.zip|2|3|
> After processing and successful defragmentation in {*}MergeContent{*}, the 
> merged FlowFile would no longer retain:
>  * *fragment.identifier*
>  * *fragment.index*
>  * *fragment.count*
> This leaves the merged FlowFile as the completed repackaged output, without 
> temporary fragmentation metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to