This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch TIKA-4630-on-main in repository https://gitbox.apache.org/repos/asf/tika.git
commit 1de7cf7547eee9932eb30dd5e01387f12b3fb499 Author: tallison <[email protected]> AuthorDate: Fri Jan 23 15:47:41 2026 -0500 TIKA-4630 -- further cleanups --- .../ROOT/pages/advanced/embedded-documents.adoc | 145 +++++++++++++-------- .../ooxml/OOXMLContainerExtractionTest.java | 2 +- .../tika/parser/RecursiveParserWrapperTest.java | 2 +- .../org/apache/tika/parser/pkg/ZipParserTest.java | 20 +-- 4 files changed, 103 insertions(+), 66 deletions(-) diff --git a/docs/modules/ROOT/pages/advanced/embedded-documents.adoc b/docs/modules/ROOT/pages/advanced/embedded-documents.adoc index ce57a47de1..bd6afbc41d 100644 --- a/docs/modules/ROOT/pages/advanced/embedded-documents.adoc +++ b/docs/modules/ROOT/pages/advanced/embedded-documents.adoc @@ -24,75 +24,95 @@ resources. == Overview -Embedded document metadata falls into two categories: +Understanding embedded document metadata requires distinguishing between two fundamentally +different types of information: -* *Tika-Generated Metadata* - Fields that Tika calculates during parsing to help you - understand the document structure -* *Internal File Metadata* - Fields that come directly from the container file's own - metadata storage +* *Containment Structure (Tika-Generated)* - Metadata that Tika generates to track _how documents + are nested within each other_. This answers questions like: "Which file contained this + attachment?" and "What is the nesting depth?" -== Tika-Generated Metadata +* *Container Metadata (From the File)* - Metadata that comes from _the container file itself_, + describing what the container knows about its contents. This answers questions like: "What + path was this file stored at inside the archive?" and "What was the original filename?" -These fields are generated by Tika during parsing and reflect the structure of embedded -resources as Tika encounters them. All fields below are defined in `TikaCoreProperties`. +The distinction matters because containers often store embedded files in internal directory +structures that are independent of how deeply nested the embedding is. A ZIP file preserves +its original folder hierarchy; an OOXML document stores media in `xl/media/` or `ppt/media/`; +a PST file organizes emails by folder. This internal organization is separate from the +question of containment. -=== Structure Tracking +== Containment Structure (Tika-Generated) + +These fields are generated by Tika during parsing to track the _nesting relationships_ between +documents. They answer: "Which document contained this one?" All fields are defined in +`TikaCoreProperties`. + +=== Nesting Identifiers `TikaCoreProperties.EMBEDDED_ID` (`X-TIKA:embedded_id`):: -A 1-indexed integer assigned by Tika to each embedded document during parsing. These IDs -are assigned in the order documents are encountered by the `RecursiveParserWrapper`. +A 1-indexed integer assigned by Tika to each embedded document during parsing. IDs are +assigned in the order documents are encountered by the `RecursiveParserWrapper`. This ID +uniquely identifies each embedded document within a single parse operation. `TikaCoreProperties.EMBEDDED_ID_PATH` (`X-TIKA:embedded_id_path`):: -A path-like representation of the embedded document's position in the hierarchy, built -from `EMBEDDED_ID` values. For example, `/1/3` indicates that the file with `EMBEDDED_ID=3` -was an attachment within the file with `EMBEDDED_ID=1`. This is the most robust path for -tracking document structure. +A path showing the containment hierarchy using `EMBEDDED_ID` values. For example, `/1/3` +indicates that the file with `EMBEDDED_ID=3` was contained within the file with +`EMBEDDED_ID=1`. This is the most reliable field for tracking containment relationships. ++ +NOTE: This is purely about _which document contains which_ - it tells you nothing about +folder structures or original paths within the containers themselves. -=== Path Synthesis +=== Synthetic Paths `TikaCoreProperties.EMBEDDED_RESOURCE_PATH` (`X-TIKA:embedded_resource_path`):: -A synthetic path generated by concatenating file names (from `RESOURCE_NAME_KEY`) for each -level of embedding. This provides a human-readable path through the document hierarchy. +A synthetic path built by concatenating file names (from `RESOURCE_NAME_KEY`) at each +nesting level. This provides a human-readable path through the containment hierarchy. + WARNING: Do not use this field for creating directory structures to write out attachments. -There may be path collisions, illegal characters, or zip slip vulnerabilities. Use `EMBEDDED_ID_PATH` -for reliable path tracking. +There may be path collisions, illegal characters, or zip slip vulnerabilities. Use +`EMBEDDED_ID_PATH` for reliable containment tracking. `TikaCoreProperties.FINAL_EMBEDDED_RESOURCE_PATH` (`X-TIKA:final_embedded_resource_path`):: -Similar to `EMBEDDED_RESOURCE_PATH`, but calculated at the end of the full parse rather -than during parsing. For some parsers, an embedded file's name isn't known until after its -child files have been parsed. This field may have fewer "unknown" file names than -`EMBEDDED_RESOURCE_PATH`, but the synthetic names (e.g., `embedded-1`) are not correlated -between the two fields. +Similar to `EMBEDDED_RESOURCE_PATH`, but calculated at the end of the full parse. For some +parsers, an embedded file's name isn't known until after its child files have been parsed. +This field may have fewer "unknown" file names than `EMBEDDED_RESOURCE_PATH`. === Resource Naming `TikaCoreProperties.RESOURCE_NAME_KEY` (`X-TIKA:resourceName`):: -The "name" of the resource. Tika makes a best effort to determine a meaningful name for -each embedded resource. When a name cannot be determined from the container file's -metadata, Tika falls back to synthetic names such as `embedded-1.jpeg`. +The file name (not path) of the resource. Tika makes a best effort to determine a meaningful +name from the container's metadata. When unavailable, Tika falls back to synthetic names +such as `embedded-1.jpeg`. + -NOTE: In Tika 3.x, this field may or may not include path information depending on the -parser. In 4.x, use `INTERNAL_PATH` for the full path as stored in the container. +NOTE: In Tika 4.x, this field contains only the file name. Use `INTERNAL_PATH` for the +full path as stored in the container. -== Internal File Metadata +== Container Metadata (From the File) -These fields contain metadata that is stored within the container file itself, not -generated by Tika. All fields below are defined in `TikaCoreProperties`. +These fields contain metadata that is stored _within the container file itself_. This is +information the container preserves about its contents, independent of how Tika traverses +the nesting structure. All fields below are defined in `TikaCoreProperties`. -=== Path and Location +=== Internal Paths `TikaCoreProperties.INTERNAL_PATH` (`X-TIKA:internalPath`):: -The path including file name as literally stored within the container file (e.g., in a -TAR, ZIP, or PST file). This is distinct from `EMBEDDED_RESOURCE_PATH` in two ways: +The path (including file name) as literally stored within the container. This is what the +container knows about where the file lives in its internal structure: ++ +* In a ZIP: the entry path (e.g., `reports/Q1/sales.xlsx`) +* In a PST: the folder path plus message name (e.g., `Inbox/Important/Meeting notes.msg`) +* In an OOXML document: the part name (e.g., `xl/media/image1.png`) ++ +This differs fundamentally from `EMBEDDED_RESOURCE_PATH`: + -1. It is the actual metadata from the container, not synthetically generated -2. It may include folder/directory information that the container preserves +* `INTERNAL_PATH` is what the _container stores_ about the file's location within itself +* `EMBEDDED_RESOURCE_PATH` is what _Tika synthesizes_ from the nesting structure `TikaCoreProperties.ORIGINAL_RESOURCE_NAME` (`X-TIKA:origResourceName`):: -For some file formats, this contains the original path where the file was stored on the -creator's system before being embedded. For example, older `.doc` files and `.xlsx` files -may store the original file system path from the author's computer. +For some file formats, the file path where the document was last saved on the creator's +system. For example, an `.xlsx` file named `budget.xlsx` may include a metadata property +storing where it was last saved: `C:\Users\Alice\budget.xlsx`. This is not specific to +embedded files - it's a property that certain file formats preserve about themselves. == Microsoft-Specific Metadata @@ -115,39 +135,39 @@ embedded. Defined in the `Office` metadata class. |`EMBEDDED_ID` |`X-TIKA:embedded_id` -|Tika-generated +|Containment |`EMBEDDED_ID_PATH` |`X-TIKA:embedded_id_path` -|Tika-generated +|Containment |`EMBEDDED_RESOURCE_PATH` |`X-TIKA:embedded_resource_path` -|Tika-generated +|Containment |`FINAL_EMBEDDED_RESOURCE_PATH` |`X-TIKA:final_embedded_resource_path` -|Tika-generated +|Containment |`RESOURCE_NAME_KEY` |`X-TIKA:resourceName` -|Tika-generated +|Containment |`INTERNAL_PATH` |`X-TIKA:internalPath` -|From container +|Container |`ORIGINAL_RESOURCE_NAME` |`X-TIKA:origResourceName` -|From container +|Container |`EMBEDDED_RELATIONSHIP_ID` |`X-TIKA:embeddedRelationshipId` -|From container (MS) +|Container (MS) |`Office.EMBEDDED_STORAGE_CLASS_ID` |`msoffice:embeddedStorageClassId` -|From container (MS) +|Container (MS) |=== == Example: Understanding the Difference @@ -208,8 +228,25 @@ itself contains an embedded image: |`/sales.xlsx/image1.png` |=== -Key observations: +== Key Observations + +The table above illustrates the fundamental distinction between containment tracking and +container metadata: + +*Containment structure (Tika-generated):* + +* `EMBEDDED_ID_PATH` `/1/2` tells you that the image (ID=2) was found _inside_ the + spreadsheet (ID=1). It answers: "What contains what?" +* `EMBEDDED_RESOURCE_PATH` `/sales.xlsx/image1.png` is synthesized from file names at each + nesting level. It provides a human-readable path through the containment hierarchy. + +*Container metadata (from the file):* + +* `INTERNAL_PATH` for the spreadsheet (`reports/Q1/sales.xlsx`) is what the _ZIP file knows_ + about where that entry was stored - its internal folder structure. +* `INTERNAL_PATH` for the image (`xl/media/image1.png`) is what the _XLSX file knows_ about + where that media file lives - its internal OOXML part name. -* `INTERNAL_PATH` preserves the full directory structure as it was stored from each container -* `EMBEDDED_RESOURCE_PATH` is built only from file names at each level -* `EMBEDDED_ID_PATH` `/1/2` shows that the image (ID=2) was found inside the spreadsheet (ID=1) +Notice that `INTERNAL_PATH` resets at each container boundary. The image's internal path +doesn't include `reports/Q1/` because that path information belongs to the ZIP container, +not the XLSX container. Each container only knows about its own internal organization. diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java index dfe86f2040..6dfda6e553 100644 --- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java @@ -262,7 +262,7 @@ public class OOXMLContainerExtractionTest extends AbstractPOIContainerExtraction assertEquals("Microsoft_Office_Excel_Worksheet1.xlsx", handler.filenames.get(6)); assertEquals("Microsoft_Office_Word_Document2.docx", handler.filenames.get(7)); assertEquals("Microsoft_Office_Word_97_-_2003_Document1.doc", handler.filenames.get(8)); - assertEquals("/docProps/thumbnail.jpeg", handler.filenames.get(9)); + assertEquals("thumbnail.jpeg", handler.filenames.get(9)); // But we do know their types assertEquals(TYPE_EMF, handler.mediaTypes.get(0)); // Icon of embedded office doc diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java index 30dab1331e..9b054d4ad4 100644 --- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java @@ -176,7 +176,7 @@ public class RecursiveParserWrapperTest extends TikaTest { public void testCharLimitNoThrowOnWriteLimit() throws Exception { ParseContext context = new ParseContext(); Metadata metadata = new Metadata(); - int writeLimit = 500; + int writeLimit = 510; RecursiveParserWrapper wrapper = new RecursiveParserWrapper(AUTO_DETECT_PARSER); RecursiveParserWrapperHandler handler = new RecursiveParserWrapperHandler( new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java index 04a00c5028..8a22855bd8 100644 --- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java @@ -77,17 +77,17 @@ public class ZipParserTest extends AbstractPkgTest { assertContains("<div class=\"embedded\" id=\"test1.txt\" />", xml); assertContains("<div class=\"embedded\" id=\"test2.txt\" />", xml); - // Also make sure EMBEDDED_RELATIONSHIP_ID was + // Also make sure INTERNAL_PATH was // passed when parsing the embedded docs: ParseContext context = new ParseContext(); - GatherRelIDsDocumentExtractor relIDs = new GatherRelIDsDocumentExtractor(); - context.set(EmbeddedDocumentExtractor.class, relIDs); + GatherInternalPathsDocumentExtractor extractor = new GatherInternalPathsDocumentExtractor(); + context.set(EmbeddedDocumentExtractor.class, extractor); try (TikaInputStream tis = getResourceAsStream("/test-documents/testEmbedded.zip")) { AUTO_DETECT_PARSER.parse(tis, new BodyContentHandler(), new Metadata(), context); } - assertTrue(relIDs.allRelIDs.contains("test1.txt")); - assertTrue(relIDs.allRelIDs.contains("test2.txt")); + assertTrue(extractor.allInternalPaths.contains("test1.txt")); + assertTrue(extractor.allInternalPaths.contains("test2.txt")); } @Test @@ -123,13 +123,13 @@ public class ZipParserTest extends AbstractPkgTest { results.get(4).get("X-TIKA:EXCEPTION:embedded_exception")); } - private static class GatherRelIDsDocumentExtractor implements EmbeddedDocumentExtractor { - public Set<String> allRelIDs = new HashSet<>(); + private static class GatherInternalPathsDocumentExtractor implements EmbeddedDocumentExtractor { + public Set<String> allInternalPaths = new HashSet<>(); public boolean shouldParseEmbedded(Metadata metadata) { - String relID = metadata.get(TikaCoreProperties.EMBEDDED_RELATIONSHIP_ID); - if (relID != null) { - allRelIDs.add(relID); + String internalPath = metadata.get(TikaCoreProperties.INTERNAL_PATH); + if (internalPath != null) { + allInternalPaths.add(internalPath); } return false; }
