This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch TIKA-4630-on-main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 1de7cf7547eee9932eb30dd5e01387f12b3fb499
Author: tallison <[email protected]>
AuthorDate: Fri Jan 23 15:47:41 2026 -0500

    TIKA-4630 -- further cleanups
---
 .../ROOT/pages/advanced/embedded-documents.adoc    | 145 +++++++++++++--------
 .../ooxml/OOXMLContainerExtractionTest.java        |   2 +-
 .../tika/parser/RecursiveParserWrapperTest.java    |   2 +-
 .../org/apache/tika/parser/pkg/ZipParserTest.java  |  20 +--
 4 files changed, 103 insertions(+), 66 deletions(-)

diff --git a/docs/modules/ROOT/pages/advanced/embedded-documents.adoc 
b/docs/modules/ROOT/pages/advanced/embedded-documents.adoc
index ce57a47de1..bd6afbc41d 100644
--- a/docs/modules/ROOT/pages/advanced/embedded-documents.adoc
+++ b/docs/modules/ROOT/pages/advanced/embedded-documents.adoc
@@ -24,75 +24,95 @@ resources.
 
 == Overview
 
-Embedded document metadata falls into two categories:
+Understanding embedded document metadata requires distinguishing between two 
fundamentally
+different types of information:
 
-* *Tika-Generated Metadata* - Fields that Tika calculates during parsing to 
help you
-  understand the document structure
-* *Internal File Metadata* - Fields that come directly from the container 
file's own
-  metadata storage
+* *Containment Structure (Tika-Generated)* - Metadata that Tika generates to 
track _how documents
+  are nested within each other_. This answers questions like: "Which file 
contained this
+  attachment?" and "What is the nesting depth?"
 
-== Tika-Generated Metadata
+* *Container Metadata (From the File)* - Metadata that comes from _the 
container file itself_,
+  describing what the container knows about its contents. This answers 
questions like: "What
+  path was this file stored at inside the archive?" and "What was the original 
filename?"
 
-These fields are generated by Tika during parsing and reflect the structure of 
embedded
-resources as Tika encounters them. All fields below are defined in 
`TikaCoreProperties`.
+The distinction matters because containers often store embedded files in 
internal directory
+structures that are independent of how deeply nested the embedding is. A ZIP 
file preserves
+its original folder hierarchy; an OOXML document stores media in `xl/media/` 
or `ppt/media/`;
+a PST file organizes emails by folder. This internal organization is separate 
from the
+question of containment.
 
-=== Structure Tracking
+== Containment Structure (Tika-Generated)
+
+These fields are generated by Tika during parsing to track the _nesting 
relationships_ between
+documents. They answer: "Which document contained this one?" All fields are 
defined in
+`TikaCoreProperties`.
+
+=== Nesting Identifiers
 
 `TikaCoreProperties.EMBEDDED_ID` (`X-TIKA:embedded_id`)::
-A 1-indexed integer assigned by Tika to each embedded document during parsing. 
These IDs
-are assigned in the order documents are encountered by the 
`RecursiveParserWrapper`.
+A 1-indexed integer assigned by Tika to each embedded document during parsing. 
IDs are
+assigned in the order documents are encountered by the 
`RecursiveParserWrapper`. This ID
+uniquely identifies each embedded document within a single parse operation.
 
 `TikaCoreProperties.EMBEDDED_ID_PATH` (`X-TIKA:embedded_id_path`)::
-A path-like representation of the embedded document's position in the 
hierarchy, built
-from `EMBEDDED_ID` values. For example, `/1/3` indicates that the file with 
`EMBEDDED_ID=3`
-was an attachment within the file with `EMBEDDED_ID=1`. This is the most 
robust path for
-tracking document structure.
+A path showing the containment hierarchy using `EMBEDDED_ID` values. For 
example, `/1/3`
+indicates that the file with `EMBEDDED_ID=3` was contained within the file with
+`EMBEDDED_ID=1`. This is the most reliable field for tracking containment 
relationships.
++
+NOTE: This is purely about _which document contains which_ - it tells you 
nothing about
+folder structures or original paths within the containers themselves.
 
-=== Path Synthesis
+=== Synthetic Paths
 
 `TikaCoreProperties.EMBEDDED_RESOURCE_PATH` (`X-TIKA:embedded_resource_path`)::
-A synthetic path generated by concatenating file names (from 
`RESOURCE_NAME_KEY`) for each
-level of embedding. This provides a human-readable path through the document 
hierarchy.
+A synthetic path built by concatenating file names (from `RESOURCE_NAME_KEY`) 
at each
+nesting level. This provides a human-readable path through the containment 
hierarchy.
 +
 WARNING: Do not use this field for creating directory structures to write out 
attachments.
-There may be path collisions, illegal characters, or zip slip vulnerabilities. 
Use `EMBEDDED_ID_PATH`
-for reliable path tracking.
+There may be path collisions, illegal characters, or zip slip vulnerabilities. 
Use
+`EMBEDDED_ID_PATH` for reliable containment tracking.
 
 `TikaCoreProperties.FINAL_EMBEDDED_RESOURCE_PATH` 
(`X-TIKA:final_embedded_resource_path`)::
-Similar to `EMBEDDED_RESOURCE_PATH`, but calculated at the end of the full 
parse rather
-than during parsing. For some parsers, an embedded file's name isn't known 
until after its
-child files have been parsed. This field may have fewer "unknown" file names 
than
-`EMBEDDED_RESOURCE_PATH`, but the synthetic names (e.g., `embedded-1`) are not 
correlated
-between the two fields.
+Similar to `EMBEDDED_RESOURCE_PATH`, but calculated at the end of the full 
parse. For some
+parsers, an embedded file's name isn't known until after its child files have 
been parsed.
+This field may have fewer "unknown" file names than `EMBEDDED_RESOURCE_PATH`.
 
 === Resource Naming
 
 `TikaCoreProperties.RESOURCE_NAME_KEY` (`X-TIKA:resourceName`)::
-The "name" of the resource. Tika makes a best effort to determine a meaningful 
name for
-each embedded resource. When a name cannot be determined from the container 
file's
-metadata, Tika falls back to synthetic names such as `embedded-1.jpeg`.
+The file name (not path) of the resource. Tika makes a best effort to 
determine a meaningful
+name from the container's metadata. When unavailable, Tika falls back to 
synthetic names
+such as `embedded-1.jpeg`.
 +
-NOTE: In Tika 3.x, this field may or may not include path information 
depending on the
-parser. In 4.x, use `INTERNAL_PATH` for the full path as stored in the 
container.
+NOTE: In Tika 4.x, this field contains only the file name. Use `INTERNAL_PATH` 
for the
+full path as stored in the container.
 
-== Internal File Metadata
+== Container Metadata (From the File)
 
-These fields contain metadata that is stored within the container file itself, 
not
-generated by Tika. All fields below are defined in `TikaCoreProperties`.
+These fields contain metadata that is stored _within the container file 
itself_. This is
+information the container preserves about its contents, independent of how 
Tika traverses
+the nesting structure. All fields below are defined in `TikaCoreProperties`.
 
-=== Path and Location
+=== Internal Paths
 
 `TikaCoreProperties.INTERNAL_PATH` (`X-TIKA:internalPath`)::
-The path including file name as literally stored within the container file 
(e.g., in a
-TAR, ZIP, or PST file). This is distinct from `EMBEDDED_RESOURCE_PATH` in two 
ways:
+The path (including file name) as literally stored within the container. This 
is what the
+container knows about where the file lives in its internal structure:
++
+* In a ZIP: the entry path (e.g., `reports/Q1/sales.xlsx`)
+* In a PST: the folder path plus message name (e.g., `Inbox/Important/Meeting 
notes.msg`)
+* In an OOXML document: the part name (e.g., `xl/media/image1.png`)
++
+This differs fundamentally from `EMBEDDED_RESOURCE_PATH`:
 +
-1. It is the actual metadata from the container, not synthetically generated
-2. It may include folder/directory information that the container preserves
+* `INTERNAL_PATH` is what the _container stores_ about the file's location 
within itself
+* `EMBEDDED_RESOURCE_PATH` is what _Tika synthesizes_ from the nesting 
structure
 
 `TikaCoreProperties.ORIGINAL_RESOURCE_NAME` (`X-TIKA:origResourceName`)::
-For some file formats, this contains the original path where the file was 
stored on the
-creator's system before being embedded. For example, older `.doc` files and 
`.xlsx` files
-may store the original file system path from the author's computer.
+For some file formats, the file path where the document was last saved on the 
creator's
+system. For example, an `.xlsx` file named `budget.xlsx` may include a 
metadata property
+storing where it was last saved: `C:\Users\Alice\budget.xlsx`. This is not 
specific to
+embedded files - it's a property that certain file formats preserve about 
themselves.
 
 == Microsoft-Specific Metadata
 
@@ -115,39 +135,39 @@ embedded. Defined in the `Office` metadata class.
 
 |`EMBEDDED_ID`
 |`X-TIKA:embedded_id`
-|Tika-generated
+|Containment
 
 |`EMBEDDED_ID_PATH`
 |`X-TIKA:embedded_id_path`
-|Tika-generated
+|Containment
 
 |`EMBEDDED_RESOURCE_PATH`
 |`X-TIKA:embedded_resource_path`
-|Tika-generated
+|Containment
 
 |`FINAL_EMBEDDED_RESOURCE_PATH`
 |`X-TIKA:final_embedded_resource_path`
-|Tika-generated
+|Containment
 
 |`RESOURCE_NAME_KEY`
 |`X-TIKA:resourceName`
-|Tika-generated
+|Containment
 
 |`INTERNAL_PATH`
 |`X-TIKA:internalPath`
-|From container
+|Container
 
 |`ORIGINAL_RESOURCE_NAME`
 |`X-TIKA:origResourceName`
-|From container
+|Container
 
 |`EMBEDDED_RELATIONSHIP_ID`
 |`X-TIKA:embeddedRelationshipId`
-|From container (MS)
+|Container (MS)
 
 |`Office.EMBEDDED_STORAGE_CLASS_ID`
 |`msoffice:embeddedStorageClassId`
-|From container (MS)
+|Container (MS)
 |===
 
 == Example: Understanding the Difference
@@ -208,8 +228,25 @@ itself contains an embedded image:
 |`/sales.xlsx/image1.png`
 |===
 
-Key observations:
+== Key Observations
+
+The table above illustrates the fundamental distinction between containment 
tracking and
+container metadata:
+
+*Containment structure (Tika-generated):*
+
+* `EMBEDDED_ID_PATH` `/1/2` tells you that the image (ID=2) was found _inside_ 
the
+  spreadsheet (ID=1). It answers: "What contains what?"
+* `EMBEDDED_RESOURCE_PATH` `/sales.xlsx/image1.png` is synthesized from file 
names at each
+  nesting level. It provides a human-readable path through the containment 
hierarchy.
+
+*Container metadata (from the file):*
+
+* `INTERNAL_PATH` for the spreadsheet (`reports/Q1/sales.xlsx`) is what the 
_ZIP file knows_
+  about where that entry was stored - its internal folder structure.
+* `INTERNAL_PATH` for the image (`xl/media/image1.png`) is what the _XLSX file 
knows_ about
+  where that media file lives - its internal OOXML part name.
 
-* `INTERNAL_PATH` preserves the full directory structure as it was stored from 
each container
-* `EMBEDDED_RESOURCE_PATH` is built only from file names at each level
-* `EMBEDDED_ID_PATH` `/1/2` shows that the image (ID=2) was found inside the 
spreadsheet (ID=1)
+Notice that `INTERNAL_PATH` resets at each container boundary. The image's 
internal path
+doesn't include `reports/Q1/` because that path information belongs to the ZIP 
container,
+not the XLSX container. Each container only knows about its own internal 
organization.
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java
index dfe86f2040..6dfda6e553 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java
@@ -262,7 +262,7 @@ public class OOXMLContainerExtractionTest extends 
AbstractPOIContainerExtraction
         assertEquals("Microsoft_Office_Excel_Worksheet1.xlsx", 
handler.filenames.get(6));
         assertEquals("Microsoft_Office_Word_Document2.docx", 
handler.filenames.get(7));
         assertEquals("Microsoft_Office_Word_97_-_2003_Document1.doc", 
handler.filenames.get(8));
-        assertEquals("/docProps/thumbnail.jpeg", handler.filenames.get(9));
+        assertEquals("thumbnail.jpeg", handler.filenames.get(9));
 
         // But we do know their types
         assertEquals(TYPE_EMF, handler.mediaTypes.get(0));  // Icon of 
embedded office doc
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
index 30dab1331e..9b054d4ad4 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
@@ -176,7 +176,7 @@ public class RecursiveParserWrapperTest extends TikaTest {
     public void testCharLimitNoThrowOnWriteLimit() throws Exception {
         ParseContext context = new ParseContext();
         Metadata metadata = new Metadata();
-        int writeLimit = 500;
+        int writeLimit = 510;
         RecursiveParserWrapper wrapper = new 
RecursiveParserWrapper(AUTO_DETECT_PARSER);
         RecursiveParserWrapperHandler handler = new 
RecursiveParserWrapperHandler(
                 new 
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT,
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java
index 04a00c5028..8a22855bd8 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java
@@ -77,17 +77,17 @@ public class ZipParserTest extends AbstractPkgTest {
         assertContains("<div class=\"embedded\" id=\"test1.txt\" />", xml);
         assertContains("<div class=\"embedded\" id=\"test2.txt\" />", xml);
 
-        // Also make sure EMBEDDED_RELATIONSHIP_ID was
+        // Also make sure INTERNAL_PATH was
         // passed when parsing the embedded docs:
         ParseContext context = new ParseContext();
-        GatherRelIDsDocumentExtractor relIDs = new 
GatherRelIDsDocumentExtractor();
-        context.set(EmbeddedDocumentExtractor.class, relIDs);
+        GatherInternalPathsDocumentExtractor extractor = new 
GatherInternalPathsDocumentExtractor();
+        context.set(EmbeddedDocumentExtractor.class, extractor);
         try (TikaInputStream tis = 
getResourceAsStream("/test-documents/testEmbedded.zip")) {
             AUTO_DETECT_PARSER.parse(tis, new BodyContentHandler(), new 
Metadata(), context);
         }
 
-        assertTrue(relIDs.allRelIDs.contains("test1.txt"));
-        assertTrue(relIDs.allRelIDs.contains("test2.txt"));
+        assertTrue(extractor.allInternalPaths.contains("test1.txt"));
+        assertTrue(extractor.allInternalPaths.contains("test2.txt"));
     }
 
     @Test
@@ -123,13 +123,13 @@ public class ZipParserTest extends AbstractPkgTest {
                 results.get(4).get("X-TIKA:EXCEPTION:embedded_exception"));
     }
 
-    private static class GatherRelIDsDocumentExtractor implements 
EmbeddedDocumentExtractor {
-        public Set<String> allRelIDs = new HashSet<>();
+    private static class GatherInternalPathsDocumentExtractor implements 
EmbeddedDocumentExtractor {
+        public Set<String> allInternalPaths = new HashSet<>();
 
         public boolean shouldParseEmbedded(Metadata metadata) {
-            String relID = 
metadata.get(TikaCoreProperties.EMBEDDED_RELATIONSHIP_ID);
-            if (relID != null) {
-                allRelIDs.add(relID);
+            String internalPath = 
metadata.get(TikaCoreProperties.INTERNAL_PATH);
+            if (internalPath != null) {
+                allInternalPaths.add(internalPath);
             }
             return false;
         }

Reply via email to