[ 
https://issues.apache.org/jira/browse/NIFI-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Scott updated NIFI-15758:
---------------------------------
    Summary: Add fragment attributes in UnpackContent to support reassembly 
.PKG and remove them after MergeContent defragmentation  (was: Enhance 
UnpackContent to Optionally Add Fragment Attributes for Repackaging Use Case)

> Add fragment attributes in UnpackContent to support reassembly .PKG and 
> remove them after MergeContent defragmentation
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-15758
>                 URL: https://issues.apache.org/jira/browse/NIFI-15758
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Richard Scott
>            Assignee: Richard Scott
>            Priority: Minor
>
> h2. Summary
> Currently, the *UnpackContent* processor extracts FlowFiles from archive 
> formats such as ZIP, TAR, and FlowFile Package, but it does not provide a 
> built-in mechanism to assign the fragment attributes required for downstream 
> reassembly using *MergeContent* in *Defragment* mode.
> The relevant attributes are:
>  * *fragment.identifier*
>  * *fragment.index*
>  * *fragment.count*
> This makes it difficult to support a common dataflow pattern where content is 
> packed to optimise transport, unpacked for enrichment or processing, and then 
> repacked back into the original or equivalent archive structure.
> Without fragment attributes, users must implement custom logic to track 
> grouping and ordering, which introduces unnecessary complexity and 
> inconsistency.
> In addition, once *MergeContent* completes defragmentation, the fragment 
> attributes used for reassembly are no longer needed on the merged FlowFile. 
> These attributes should be removed from the merged result so the output 
> reflects the completed package rather than the intermediate fragmentation 
> state.
> h2. Use Case
> A common dataflow pattern is:
>  # Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise 
> transport
>  # *UnpackContent* extracts individual FlowFiles
>  # Files are enriched or transformed independently
>  # Files are regrouped and repackaged using *MergeContent*
> To support correct regrouping, all unpacked FlowFiles must share:
>  * *fragment.identifier* — groups entries from the same archive
>  * *fragment.index* — preserves ordering
>  * *fragment.count* — total number of entries
> Currently, this either requires custom logic or behaves inconsistently 
> depending on the format being unpacked.
> h2. Proposed Enhancement
> h3. 1. Add support in UnpackContent for optional fragment attribute generation
> Add the following optional properties to {*}UnpackContent{*}:
> *Add Fragment Attributes* 
> *Property Name:* Add Fragment Attributes 
> *Description:* When enabled, assigns {*}fragment.identifier{*}, 
> {*}fragment.index{*}, and *fragment.count* to all unpacked FlowFiles. 
> *Allowable Values:* true / false 
> *Default:* false
> *Fragment Identifier Value* 
> *Property Name:* Fragment Identifier Value 
> *Description:* Specifies the value used for {*}fragment.identifier{*}.
> Supports Expression Language evaluated against the incoming packed FlowFile. 
> The expression is evaluated once per source FlowFile, and the resulting value 
> is applied to all unpacked FlowFiles derived from that source.
> *Default Value:* ${uuid()}
> *Examples:*
>  * ${uuid()} — unique grouping per archive
>  * ${filename} — stable grouping based on original filename
>  * ${archive.filename} — explicit archive attribute, if present
> h3. UnpackContent Behaviour
> When enabled:
>  * All FlowFiles produced from a single archive share the same 
> *fragment.identifier*
>  * *fragment.index* is assigned based on entry order within the archive
>  * *fragment.count* is set to the total number of entries extracted
>  * The identifier expression is evaluated once per parent FlowFile
> When disabled:
>  * No change to current *UnpackContent* behaviour
> h3. 2. Update MergeContent to remove fragment attributes after defragmentation
> *MergeContent* already supports *Defragment* mode using:
>  * *fragment.identifier*
>  * *fragment.index*
>  * *fragment.count*
> However, after a successful defragmentation, these attributes are no longer 
> needed on the merged FlowFile and should be removed from the output.
> Proposed *MergeContent* behaviour:
>  * When operating in *Defragment* mode, after the merged FlowFile is created, 
> remove:
>  ** *fragment.identifier*
>  ** *fragment.index*
>  ** *fragment.count*
> This ensures the merged FlowFile represents the final repackaged artifact 
> rather than retaining temporary grouping metadata from the unpack/repack 
> workflow.
> h2. Example
> *Input:* 
> archive.zip containing 3 files
> *UnpackContent output* (with feature enabled and *${filename}* as identifier):
> ||filename||fragment.identifier||fragment.index||fragment.count||
> |file1.txt|archive.zip|0|3|
> |file2.txt|archive.zip|1|3|
> |file3.txt|archive.zip|2|3|
> After processing and successful defragmentation in {*}MergeContent{*}, the 
> merged FlowFile would no longer retain:
>  * *fragment.identifier*
>  * *fragment.index*
>  * *fragment.count*
> This leaves the merged FlowFile as the completed repackaged output, without 
> temporary fragmentation metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to