[ 
https://issues.apache.org/jira/browse/NIFI-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Scott updated NIFI-15758:
---------------------------------
    Description: 
h2. Summary

Currently, the *UnpackContent* processor extracts FlowFiles from archive 
formats such as ZIP, TAR, and FlowFile Package, but it does not provide a 
built-in mechanism to assign the fragment attributes required for downstream 
reassembly using *MergeContent* in *Defragment* mode.

The relevant attributes are:
 * *fragment.identifier*
 * *fragment.index*
 * *fragment.count*
 * *segment.original.filename*

This makes it difficult to support a common dataflow pattern where content is 
packed to optimise transport, unpacked for enrichment or processing, and then 
repacked back into the original or equivalent archive structure.

Without fragment attributes, users must implement custom logic to track 
grouping and ordering, which introduces unnecessary complexity and 
inconsistency.

In addition, once *MergeContent* completes defragmentation, the fragment 
attributes used for reassembly are no longer needed on the merged FlowFile. 
These attributes should be removed from the merged result so the output 
reflects the completed package rather than the intermediate fragmentation state.
h2. Use Case

A common dataflow pattern is:
 # Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise 
transport
 # *UnpackContent* extracts individual FlowFiles
 # Files are enriched or transformed independently
 # Files are regrouped and repackaged using *MergeContent*

To support correct regrouping, all unpacked FlowFiles must share:
 * *fragment.identifier* — groups entries from the same archive
 * *fragment.index* — preserves ordering
 * *fragment.count* — total number of entries

Currently, this either requires custom logic or behaves inconsistently 
depending on the format being unpacked.
h2. Proposed Enhancement
h3. 1. Add support in UnpackContent for optional fragment attribute generation

Add the following optional properties to {*}UnpackContent{*}:

*Add Fragment Attributes* 
*Property Name:* Add Fragment Attributes 
*Description:* When enabled, assigns {*}fragment.identifier{*}, 
{*}fragment.index{*}, and *fragment.count* to all unpacked FlowFiles. 
*Allowable Values:* true / false 
*Default:* false

*Fragment Identifier Value* 
*Property Name:* Fragment Identifier Value 
*Description:* Specifies the value used for {*}fragment.identifier{*}.

Supports Expression Language evaluated against the incoming packed FlowFile. 
The expression is evaluated once per source FlowFile, and the resulting value 
is applied to all unpacked FlowFiles derived from that source.

*Default Value:* ${uuid()}

*Examples:*
 * ${uuid()} — unique grouping per archive
 * ${filename} — stable grouping based on original filename
 * ${archive.filename} — explicit archive attribute, if present

h3. UnpackContent Behaviour

When enabled:
 * All FlowFiles produced from a single archive share the same 
*fragment.identifier*
 * *fragment.index* is assigned based on entry order within the archive
 * *fragment.count* is set to the total number of entries extracted
 * The identifier expression is evaluated once per parent FlowFile

When disabled:
 * No change to current *UnpackContent* behaviour

h3. 2. Update MergeContent to remove fragment attributes after defragmentation

*MergeContent* already supports *Defragment* mode using:
 * *fragment.identifier*
 * *fragment.index*
 * *fragment.count*

However, after a successful defragmentation, these attributes are no longer 
needed on the merged FlowFile and should be removed from the output.

Proposed *MergeContent* behaviour:
 * When operating in *Defragment* mode, after the merged FlowFile is created, 
remove:
 ** *fragment.identifier*
 ** *fragment.index*
 ** *fragment.count*
 ** {*}segment.original.filename{*}{*}{*}

This ensures the merged FlowFile represents the final repackaged artifact 
rather than retaining temporary grouping metadata from the unpack/repack 
workflow.
h2. Example

*Input:* 
archive.zip containing 3 files

*UnpackContent output* (with feature enabled and *${filename}* as identifier):
||filename||fragment.identifier||fragment.index||fragment.count||
|file1.txt|archive.zip|0|3|
|file2.txt|archive.zip|1|3|
|file3.txt|archive.zip|2|3|

After processing and successful defragmentation in {*}MergeContent{*}, the 
merged FlowFile would no longer retain:
 * *fragment.identifier*
 * *fragment.index*
 * *fragment.count*

This leaves the merged FlowFile as the completed repackaged output, without 
temporary fragmentation metadata.

  was:
h2. Summary

Currently, the *UnpackContent* processor extracts FlowFiles from archive 
formats such as ZIP, TAR, and FlowFile Package, but it does not provide a 
built-in mechanism to assign the fragment attributes required for downstream 
reassembly using *MergeContent* in *Defragment* mode.

The relevant attributes are:
 * *fragment.identifier*
 * *fragment.index*
 * *fragment.count*

This makes it difficult to support a common dataflow pattern where content is 
packed to optimise transport, unpacked for enrichment or processing, and then 
repacked back into the original or equivalent archive structure.

Without fragment attributes, users must implement custom logic to track 
grouping and ordering, which introduces unnecessary complexity and 
inconsistency.

In addition, once *MergeContent* completes defragmentation, the fragment 
attributes used for reassembly are no longer needed on the merged FlowFile. 
These attributes should be removed from the merged result so the output 
reflects the completed package rather than the intermediate fragmentation state.
h2. Use Case

A common dataflow pattern is:
 # Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise 
transport
 # *UnpackContent* extracts individual FlowFiles
 # Files are enriched or transformed independently
 # Files are regrouped and repackaged using *MergeContent*

To support correct regrouping, all unpacked FlowFiles must share:
 * *fragment.identifier* — groups entries from the same archive
 * *fragment.index* — preserves ordering
 * *fragment.count* — total number of entries

Currently, this either requires custom logic or behaves inconsistently 
depending on the format being unpacked.
h2. Proposed Enhancement
h3. 1. Add support in UnpackContent for optional fragment attribute generation

Add the following optional properties to {*}UnpackContent{*}:

*Add Fragment Attributes* 
*Property Name:* Add Fragment Attributes 
*Description:* When enabled, assigns {*}fragment.identifier{*}, 
{*}fragment.index{*}, and *fragment.count* to all unpacked FlowFiles. 
*Allowable Values:* true / false 
*Default:* false

*Fragment Identifier Value* 
*Property Name:* Fragment Identifier Value 
*Description:* Specifies the value used for {*}fragment.identifier{*}.

Supports Expression Language evaluated against the incoming packed FlowFile. 
The expression is evaluated once per source FlowFile, and the resulting value 
is applied to all unpacked FlowFiles derived from that source.

*Default Value:* ${uuid()}

*Examples:*
 * ${uuid()} — unique grouping per archive
 * ${filename} — stable grouping based on original filename
 * ${archive.filename} — explicit archive attribute, if present

h3. UnpackContent Behaviour

When enabled:
 * All FlowFiles produced from a single archive share the same 
*fragment.identifier*
 * *fragment.index* is assigned based on entry order within the archive
 * *fragment.count* is set to the total number of entries extracted
 * The identifier expression is evaluated once per parent FlowFile

When disabled:
 * No change to current *UnpackContent* behaviour

h3. 2. Update MergeContent to remove fragment attributes after defragmentation

*MergeContent* already supports *Defragment* mode using:
 * *fragment.identifier*
 * *fragment.index*
 * *fragment.count*

However, after a successful defragmentation, these attributes are no longer 
needed on the merged FlowFile and should be removed from the output.

Proposed *MergeContent* behaviour:
 * When operating in *Defragment* mode, after the merged FlowFile is created, 
remove:
 ** *fragment.identifier*
 ** *fragment.index*
 ** *fragment.count*

This ensures the merged FlowFile represents the final repackaged artifact 
rather than retaining temporary grouping metadata from the unpack/repack 
workflow.
h2. Example

*Input:* 
archive.zip containing 3 files

*UnpackContent output* (with feature enabled and *${filename}* as identifier):
||filename||fragment.identifier||fragment.index||fragment.count||
|file1.txt|archive.zip|0|3|
|file2.txt|archive.zip|1|3|
|file3.txt|archive.zip|2|3|

After processing and successful defragmentation in {*}MergeContent{*}, the 
merged FlowFile would no longer retain:
 * *fragment.identifier*
 * *fragment.index*
 * *fragment.count*

This leaves the merged FlowFile as the completed repackaged output, without 
temporary fragmentation metadata.


> Add fragment attributes in UnpackContent to support reassembly .PKG and 
> remove them after MergeContent defragmentation
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-15758
>                 URL: https://issues.apache.org/jira/browse/NIFI-15758
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Richard Scott
>            Assignee: Richard Scott
>            Priority: Minor
>
> h2. Summary
> Currently, the *UnpackContent* processor extracts FlowFiles from archive 
> formats such as ZIP, TAR, and FlowFile Package, but it does not provide a 
> built-in mechanism to assign the fragment attributes required for downstream 
> reassembly using *MergeContent* in *Defragment* mode.
> The relevant attributes are:
>  * *fragment.identifier*
>  * *fragment.index*
>  * *fragment.count*
>  * *segment.original.filename*
> This makes it difficult to support a common dataflow pattern where content is 
> packed to optimise transport, unpacked for enrichment or processing, and then 
> repacked back into the original or equivalent archive structure.
> Without fragment attributes, users must implement custom logic to track 
> grouping and ordering, which introduces unnecessary complexity and 
> inconsistency.
> In addition, once *MergeContent* completes defragmentation, the fragment 
> attributes used for reassembly are no longer needed on the merged FlowFile. 
> These attributes should be removed from the merged result so the output 
> reflects the completed package rather than the intermediate fragmentation 
> state.
> h2. Use Case
> A common dataflow pattern is:
>  # Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise 
> transport
>  # *UnpackContent* extracts individual FlowFiles
>  # Files are enriched or transformed independently
>  # Files are regrouped and repackaged using *MergeContent*
> To support correct regrouping, all unpacked FlowFiles must share:
>  * *fragment.identifier* — groups entries from the same archive
>  * *fragment.index* — preserves ordering
>  * *fragment.count* — total number of entries
> Currently, this either requires custom logic or behaves inconsistently 
> depending on the format being unpacked.
> h2. Proposed Enhancement
> h3. 1. Add support in UnpackContent for optional fragment attribute generation
> Add the following optional properties to {*}UnpackContent{*}:
> *Add Fragment Attributes* 
> *Property Name:* Add Fragment Attributes 
> *Description:* When enabled, assigns {*}fragment.identifier{*}, 
> {*}fragment.index{*}, and *fragment.count* to all unpacked FlowFiles. 
> *Allowable Values:* true / false 
> *Default:* false
> *Fragment Identifier Value* 
> *Property Name:* Fragment Identifier Value 
> *Description:* Specifies the value used for {*}fragment.identifier{*}.
> Supports Expression Language evaluated against the incoming packed FlowFile. 
> The expression is evaluated once per source FlowFile, and the resulting value 
> is applied to all unpacked FlowFiles derived from that source.
> *Default Value:* ${uuid()}
> *Examples:*
>  * ${uuid()} — unique grouping per archive
>  * ${filename} — stable grouping based on original filename
>  * ${archive.filename} — explicit archive attribute, if present
> h3. UnpackContent Behaviour
> When enabled:
>  * All FlowFiles produced from a single archive share the same 
> *fragment.identifier*
>  * *fragment.index* is assigned based on entry order within the archive
>  * *fragment.count* is set to the total number of entries extracted
>  * The identifier expression is evaluated once per parent FlowFile
> When disabled:
>  * No change to current *UnpackContent* behaviour
> h3. 2. Update MergeContent to remove fragment attributes after defragmentation
> *MergeContent* already supports *Defragment* mode using:
>  * *fragment.identifier*
>  * *fragment.index*
>  * *fragment.count*
> However, after a successful defragmentation, these attributes are no longer 
> needed on the merged FlowFile and should be removed from the output.
> Proposed *MergeContent* behaviour:
>  * When operating in *Defragment* mode, after the merged FlowFile is created, 
> remove:
>  ** *fragment.identifier*
>  ** *fragment.index*
>  ** *fragment.count*
>  ** {*}segment.original.filename{*}{*}{*}
> This ensures the merged FlowFile represents the final repackaged artifact 
> rather than retaining temporary grouping metadata from the unpack/repack 
> workflow.
> h2. Example
> *Input:* 
> archive.zip containing 3 files
> *UnpackContent output* (with feature enabled and *${filename}* as identifier):
> ||filename||fragment.identifier||fragment.index||fragment.count||
> |file1.txt|archive.zip|0|3|
> |file2.txt|archive.zip|1|3|
> |file3.txt|archive.zip|2|3|
> After processing and successful defragmentation in {*}MergeContent{*}, the 
> merged FlowFile would no longer retain:
>  * *fragment.identifier*
>  * *fragment.index*
>  * *fragment.count*
> This leaves the merged FlowFile as the completed repackaged output, without 
> temporary fragmentation metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to