[
https://issues.apache.org/jira/browse/NIFI-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Richard Scott updated NIFI-15758:
---------------------------------
Description:
Currently, the {{UnpackContent}} processor extracts FlowFiles from archive
formats (e.g. ZIP, TAR, etc.) but does not provide a built-in mechanism to
assign fragment attributes ({{{}fragment.identifier{}}},
{{{}fragment.index{}}}, {{{}fragment.count{}}}) required for downstream
reassembly using {{MergeContent for formats like Flowfile Package. }}
This makes it difficult to support a common dataflow pattern where content is:
# Packed to optimise transport
# Unpacked for enrichment or processing of individual entries
# Repacked back into the original (or equivalent) archive structure
Without fragment attributes, users must implement custom logic to track
grouping and ordering, which introduces complexity and inconsistency.
h2. *Use Case*
A common dataflow pattern:
# Data is packaged (e.g. ZIP/TAR/FFv3) to optimise transport
# {{UnpackContent}} extracts individual FlowFiles
# Files are enriched or transformed independently
# Files are regrouped and repackaged
To support correct regrouping, all unpacked FlowFiles must share:
* {{fragment.identifier}} → groups entries from the same archive
* {{fragment.index}} → preserves ordering
* {{fragment.count}} → total number of entries
Currently, this requires custom logic or is inconsistent depending on format.
----
h2. *Proposed Enhancement*
Add the following optional properties to {{{}UnpackContent{}}}:
h3. *1. Add Fragment Attributes*
* *Property Name:* {{Add Fragment Attributes}}
* *Description:* When enabled, assigns {{{}fragment.identifier{}}},
{{{}fragment.index{}}}, and {{fragment.count}} to all unpacked FlowFiles.
* *Allowable Values:* {{true}} / {{false}}
* *Default:* {{false}} (no change to existing behaviour)
----
h3. *2. Fragment Identifier Value*
* *Property Name:* {{Fragment Identifier Value}}
* *Description:*
Specifies the value used for {{{}fragment.identifier{}}}.
Supports Expression Language evaluated against the incoming (packed) FlowFile.
The expression is evaluated {*}once per source FlowFile{*}, and the resulting
value is applied to all unpacked FlowFiles derived from that source.
* *Default Value:* {{{}$\{uuid(){}}}}
* *Examples:*
** {{{}$\{uuid(){}}}} → unique grouping per archive (default)
** {{{}$\{filename{}}}} → stable grouping based on original filename
** {{{}$\{archive.filename{}}}} → explicit archive attribute (if present)
----
h2. *Behaviour Details*
When {*}enabled{*}:
* All FlowFiles produced from a single archive share the same
{{fragment.identifier}}
* {{fragment.index}} is assigned based on entry order within the archive
* {{fragment.count}} is set to the total number of entries extracted
* The identifier expression is evaluated once per parent FlowFile
When {*}disabled{*}:
* No change to current {{UnpackContent}} behaviour
----
h2. *Compatibility & Scope*
* Fully backward compatible (feature is opt-in)
* No changes required to {{MergeContent}}
** {{MergeContent}} already supports Defragment mode using:
*** {{fragment.identifier}}
*** {{fragment.index}}
*** {{fragment.count}}
* Applies consistently across all supported archive formats (ZIP, TAR, etc.)
----
h2. *Benefits*
* Enables standard unpack → enrich → repack workflows
* Eliminates need for custom scripting or attribute tracking
* Provides consistent behaviour across formats
* Aligns with existing NiFi fragment-based processing patterns
* Keeps configuration simple by leveraging Expression Language instead of
strategy modes
----
h2. *Implementation Notes*
* Ensure consistent ordering across archive formats when assigning
{{fragment.index}}
* Avoid overwriting existing fragment attributes unless explicitly enabled
* Expression for {{fragment.identifier}} must be evaluated once per parent
FlowFile
* Avoid full in-memory buffering where possible when determining
{{fragment.count}}
----
h2. *Example*
*Input:*
{{archive.zip}} containing 3 files
*Output (with feature enabled and {{{}$\{filename{}}}} as identifier):*
||filename||fragment.identifier||fragment.index||fragment.count||
|file1.txt|archive.zip|0|3|
|file2.txt|archive.zip|1|3|
|file3.txt|archive.zip|2|3|
was:
Currently, the {{UnpackContent}} processor extracts FlowFiles from archive
formats (e.g. ZIP, TAR, etc.) but does not provide a built-in mechanism to
assign fragment attributes ({{{}fragment.identifier{}}},
{{{}fragment.index{}}}, {{{}fragment.count{}}}) required for downstream
reassembly using {{{}MergeContent{}}}.
This makes it difficult to support a common dataflow pattern where content is:
# Packed to optimise transport
# Unpacked for enrichment or processing of individual entries
# Repacked back into the original (or equivalent) archive structure
Without fragment attributes, users must implement custom logic to track
grouping and ordering, which introduces complexity and inconsistency.
h2. *Use Case*
A common dataflow pattern:
# Data is packaged (e.g. ZIP/TAR/FFv3) to optimise transport
# {{UnpackContent}} extracts individual FlowFiles
# Files are enriched or transformed independently
# Files are regrouped and repackaged
To support correct regrouping, all unpacked FlowFiles must share:
* {{fragment.identifier}} → groups entries from the same archive
* {{fragment.index}} → preserves ordering
* {{fragment.count}} → total number of entries
Currently, this requires custom logic or is inconsistent depending on format.
----
h2. *Proposed Enhancement*
Add the following optional properties to {{{}UnpackContent{}}}:
h3. *1. Add Fragment Attributes*
* *Property Name:* {{Add Fragment Attributes}}
* *Description:* When enabled, assigns {{{}fragment.identifier{}}},
{{{}fragment.index{}}}, and {{fragment.count}} to all unpacked FlowFiles.
* *Allowable Values:* {{true}} / {{false}}
* *Default:* {{false}} (no change to existing behaviour)
----
h3. *2. Fragment Identifier Value*
* *Property Name:* {{Fragment Identifier Value}}
* *Description:*
Specifies the value used for {{{}fragment.identifier{}}}.
Supports Expression Language evaluated against the incoming (packed) FlowFile.
The expression is evaluated {*}once per source FlowFile{*}, and the resulting
value is applied to all unpacked FlowFiles derived from that source.
* *Default Value:* {{${uuid()}}}
* *Examples:*
** {{${uuid()}}} → unique grouping per archive (default)
** {{${filename}}} → stable grouping based on original filename
** {{${archive.filename}}} → explicit archive attribute (if present)
----
h2. *Behaviour Details*
When {*}enabled{*}:
* All FlowFiles produced from a single archive share the same
{{fragment.identifier}}
* {{fragment.index}} is assigned based on entry order within the archive
* {{fragment.count}} is set to the total number of entries extracted
* The identifier expression is evaluated once per parent FlowFile
When {*}disabled{*}:
* No change to current {{UnpackContent}} behaviour
----
h2. *Compatibility & Scope*
* Fully backward compatible (feature is opt-in)
* No changes required to {{MergeContent}}
** {{MergeContent}} already supports Defragment mode using:
*** {{fragment.identifier}}
*** {{fragment.index}}
*** {{fragment.count}}
* Applies consistently across all supported archive formats (ZIP, TAR, etc.)
----
h2. *Benefits*
* Enables standard unpack → enrich → repack workflows
* Eliminates need for custom scripting or attribute tracking
* Provides consistent behaviour across formats
* Aligns with existing NiFi fragment-based processing patterns
* Keeps configuration simple by leveraging Expression Language instead of
strategy modes
----
h2. *Implementation Notes*
* Ensure consistent ordering across archive formats when assigning
{{fragment.index}}
* Avoid overwriting existing fragment attributes unless explicitly enabled
* Expression for {{fragment.identifier}} must be evaluated once per parent
FlowFile
* Avoid full in-memory buffering where possible when determining
{{fragment.count}}
----
h2. *Example*
*Input:*
{{archive.zip}} containing 3 files
*Output (with feature enabled and {{${filename}}} as identifier):*
||filename||fragment.identifier||fragment.index||fragment.count||
|file1.txt|archive.zip|0|3|
|file2.txt|archive.zip|1|3|
|file3.txt|archive.zip|2|3|
> Enhance UnpackContent to Optionally Add Fragment Attributes for Repackaging
> Use Cases
> -------------------------------------------------------------------------------------
>
> Key: NIFI-15758
> URL: https://issues.apache.org/jira/browse/NIFI-15758
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Richard Scott
> Assignee: Richard Scott
> Priority: Minor
>
> Currently, the {{UnpackContent}} processor extracts FlowFiles from archive
> formats (e.g. ZIP, TAR, etc.) but does not provide a built-in mechanism to
> assign fragment attributes ({{{}fragment.identifier{}}},
> {{{}fragment.index{}}}, {{{}fragment.count{}}}) required for downstream
> reassembly using {{MergeContent for formats like Flowfile Package. }}
> This makes it difficult to support a common dataflow pattern where content is:
> # Packed to optimise transport
> # Unpacked for enrichment or processing of individual entries
> # Repacked back into the original (or equivalent) archive structure
> Without fragment attributes, users must implement custom logic to track
> grouping and ordering, which introduces complexity and inconsistency.
> h2. *Use Case*
> A common dataflow pattern:
> # Data is packaged (e.g. ZIP/TAR/FFv3) to optimise transport
> # {{UnpackContent}} extracts individual FlowFiles
> # Files are enriched or transformed independently
> # Files are regrouped and repackaged
> To support correct regrouping, all unpacked FlowFiles must share:
> * {{fragment.identifier}} → groups entries from the same archive
> * {{fragment.index}} → preserves ordering
> * {{fragment.count}} → total number of entries
> Currently, this requires custom logic or is inconsistent depending on format.
> ----
> h2. *Proposed Enhancement*
> Add the following optional properties to {{{}UnpackContent{}}}:
> h3. *1. Add Fragment Attributes*
> * *Property Name:* {{Add Fragment Attributes}}
> * *Description:* When enabled, assigns {{{}fragment.identifier{}}},
> {{{}fragment.index{}}}, and {{fragment.count}} to all unpacked FlowFiles.
> * *Allowable Values:* {{true}} / {{false}}
> * *Default:* {{false}} (no change to existing behaviour)
> ----
> h3. *2. Fragment Identifier Value*
> * *Property Name:* {{Fragment Identifier Value}}
> * *Description:*
> Specifies the value used for {{{}fragment.identifier{}}}.
> Supports Expression Language evaluated against the incoming (packed) FlowFile.
> The expression is evaluated {*}once per source FlowFile{*}, and the resulting
> value is applied to all unpacked FlowFiles derived from that source.
> * *Default Value:* {{{}$\{uuid(){}}}}
> * *Examples:*
> ** {{{}$\{uuid(){}}}} → unique grouping per archive (default)
> ** {{{}$\{filename{}}}} → stable grouping based on original filename
> ** {{{}$\{archive.filename{}}}} → explicit archive attribute (if present)
> ----
> h2. *Behaviour Details*
> When {*}enabled{*}:
> * All FlowFiles produced from a single archive share the same
> {{fragment.identifier}}
> * {{fragment.index}} is assigned based on entry order within the archive
> * {{fragment.count}} is set to the total number of entries extracted
> * The identifier expression is evaluated once per parent FlowFile
> When {*}disabled{*}:
> * No change to current {{UnpackContent}} behaviour
> ----
> h2. *Compatibility & Scope*
> * Fully backward compatible (feature is opt-in)
> * No changes required to {{MergeContent}}
> ** {{MergeContent}} already supports Defragment mode using:
> *** {{fragment.identifier}}
> *** {{fragment.index}}
> *** {{fragment.count}}
> * Applies consistently across all supported archive formats (ZIP, TAR, etc.)
> ----
> h2. *Benefits*
> * Enables standard unpack → enrich → repack workflows
> * Eliminates need for custom scripting or attribute tracking
> * Provides consistent behaviour across formats
> * Aligns with existing NiFi fragment-based processing patterns
> * Keeps configuration simple by leveraging Expression Language instead of
> strategy modes
> ----
> h2. *Implementation Notes*
> * Ensure consistent ordering across archive formats when assigning
> {{fragment.index}}
> * Avoid overwriting existing fragment attributes unless explicitly enabled
> * Expression for {{fragment.identifier}} must be evaluated once per parent
> FlowFile
> * Avoid full in-memory buffering where possible when determining
> {{fragment.count}}
> ----
> h2. *Example*
> *Input:*
> {{archive.zip}} containing 3 files
> *Output (with feature enabled and {{{}$\{filename{}}}} as identifier):*
> ||filename||fragment.identifier||fragment.index||fragment.count||
> |file1.txt|archive.zip|0|3|
> |file2.txt|archive.zip|1|3|
> |file3.txt|archive.zip|2|3|
--
This message was sent by Atlassian Jira
(v8.20.10#820010)