[
https://issues.apache.org/jira/browse/NIFI-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Richard Scott updated NIFI-15758:
---------------------------------
Summary: Add fragment attributes in UnpackContent to support reassembly
.PKG and remove them after MergeContent defragmentation (was: Enhance
UnpackContent to Optionally Add Fragment Attributes for Repackaging Use Case)
> Add fragment attributes in UnpackContent to support reassembly .PKG and
> remove them after MergeContent defragmentation
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-15758
> URL: https://issues.apache.org/jira/browse/NIFI-15758
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Richard Scott
> Assignee: Richard Scott
> Priority: Minor
>
> h2. Summary
> Currently, the *UnpackContent* processor extracts FlowFiles from archive
> formats such as ZIP, TAR, and FlowFile Package, but it does not provide a
> built-in mechanism to assign the fragment attributes required for downstream
> reassembly using *MergeContent* in *Defragment* mode.
> The relevant attributes are:
> * *fragment.identifier*
> * *fragment.index*
> * *fragment.count*
> This makes it difficult to support a common dataflow pattern where content is
> packed to optimise transport, unpacked for enrichment or processing, and then
> repacked back into the original or equivalent archive structure.
> Without fragment attributes, users must implement custom logic to track
> grouping and ordering, which introduces unnecessary complexity and
> inconsistency.
> In addition, once *MergeContent* completes defragmentation, the fragment
> attributes used for reassembly are no longer needed on the merged FlowFile.
> These attributes should be removed from the merged result so the output
> reflects the completed package rather than the intermediate fragmentation
> state.
> h2. Use Case
> A common dataflow pattern is:
> # Data is packaged (for example ZIP, TAR, or FlowFile Package) to optimise
> transport
> # *UnpackContent* extracts individual FlowFiles
> # Files are enriched or transformed independently
> # Files are regrouped and repackaged using *MergeContent*
> To support correct regrouping, all unpacked FlowFiles must share:
> * *fragment.identifier* — groups entries from the same archive
> * *fragment.index* — preserves ordering
> * *fragment.count* — total number of entries
> Currently, this either requires custom logic or behaves inconsistently
> depending on the format being unpacked.
> h2. Proposed Enhancement
> h3. 1. Add support in UnpackContent for optional fragment attribute generation
> Add the following optional properties to {*}UnpackContent{*}:
> *Add Fragment Attributes*
> *Property Name:* Add Fragment Attributes
> *Description:* When enabled, assigns {*}fragment.identifier{*},
> {*}fragment.index{*}, and *fragment.count* to all unpacked FlowFiles.
> *Allowable Values:* true / false
> *Default:* false
> *Fragment Identifier Value*
> *Property Name:* Fragment Identifier Value
> *Description:* Specifies the value used for {*}fragment.identifier{*}.
> Supports Expression Language evaluated against the incoming packed FlowFile.
> The expression is evaluated once per source FlowFile, and the resulting value
> is applied to all unpacked FlowFiles derived from that source.
> *Default Value:* ${uuid()}
> *Examples:*
> * ${uuid()} — unique grouping per archive
> * ${filename} — stable grouping based on original filename
> * ${archive.filename} — explicit archive attribute, if present
> h3. UnpackContent Behaviour
> When enabled:
> * All FlowFiles produced from a single archive share the same
> *fragment.identifier*
> * *fragment.index* is assigned based on entry order within the archive
> * *fragment.count* is set to the total number of entries extracted
> * The identifier expression is evaluated once per parent FlowFile
> When disabled:
> * No change to current *UnpackContent* behaviour
> h3. 2. Update MergeContent to remove fragment attributes after defragmentation
> *MergeContent* already supports *Defragment* mode using:
> * *fragment.identifier*
> * *fragment.index*
> * *fragment.count*
> However, after a successful defragmentation, these attributes are no longer
> needed on the merged FlowFile and should be removed from the output.
> Proposed *MergeContent* behaviour:
> * When operating in *Defragment* mode, after the merged FlowFile is created,
> remove:
> ** *fragment.identifier*
> ** *fragment.index*
> ** *fragment.count*
> This ensures the merged FlowFile represents the final repackaged artifact
> rather than retaining temporary grouping metadata from the unpack/repack
> workflow.
> h2. Example
> *Input:*
> archive.zip containing 3 files
> *UnpackContent output* (with feature enabled and *${filename}* as identifier):
> ||filename||fragment.identifier||fragment.index||fragment.count||
> |file1.txt|archive.zip|0|3|
> |file2.txt|archive.zip|1|3|
> |file3.txt|archive.zip|2|3|
> After processing and successful defragmentation in {*}MergeContent{*}, the
> merged FlowFile would no longer retain:
> * *fragment.identifier*
> * *fragment.index*
> * *fragment.count*
> This leaves the merged FlowFile as the completed repackaged output, without
> temporary fragmentation metadata.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)