[
https://issues.apache.org/jira/browse/TIKA-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763949#comment-17763949
]
Hudson commented on TIKA-4124:
------------------------------
SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1237 (See
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1237/])
TIKA-4124 -- add unit tests (#1324) (github:
[https://github.com/apache/tika/commit/4d94d20ffe599cc4a7746733d94151ea347bf250])
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testAltChunkHTML.docx
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testAltChunkMHT.docx
> embedded html of type
> http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk
> is not parsed
> ---------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-4124
> URL: https://issues.apache.org/jira/browse/TIKA-4124
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Tim Barrett
> Priority: Minor
> Fix For: 2.9.1
>
>
> Word documents that may have been created using third party programs such as
> docx4j sometimes contain embedded html. This is not parsed by Tika. The
> embedded HTML file usually resides within the main folder of the docx
> internal structure.
> Changing the code in:
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart()
> as follows, handles this (the final else if)
>
> {color:#7f0055}if{color}{color:#000000}
> (POIXMLDocument.{color}{color:#0000c0}OLE_OBJECT_REL_TYPE{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000})
> &&
> {color}{color:#0000c0}TYPE_OLE_OBJECT{color}{color:#000000}.equals({color}{color:#6a3e3e}target{color}{color:#000000}.getContentType()))
> {{color}
> {color:#000000}
> handleEmbeddedOLE({color}{color:#6a3e3e}target{color}{color:#000000},
> {color}{color:#6a3e3e}xhtml{color}{color:#000000},
> {color}{color:#6a3e3e}sourceDesc{color}{color:#000000} +
> {color}{color:#6a3e3e}rel{color}{color:#000000}.getId(),
> {color}{color:#6a3e3e}parentMetadata{color}{color:#000000});{color}
> {color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
> ({color}{color:#6a3e3e}targetURI{color}{color:#000000} !=
> {color}{color:#7f0055}null{color}{color:#000000}) {{color}
> {color:#000000}
> {color}{color:#6a3e3e}handledTarget{color}{color:#000000}.add({color}{color:#6a3e3e}targetURI{color}{color:#000000}.toString());{color}
> {color:#000000} }{color}
> {color:#000000} } {color}{color:#7f0055}else{color}{color:#000000}
> {color}{color:#7f0055}if{color}{color:#000000}
> ({color}{color:#0000c0}RELATION_MEDIA{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000})
> ||
> {color}{color:#0000c0}RELATION_VIDEO{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000})
> ||
> {color}{color:#0000c0}RELATION_AUDIO{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000}){color}
> {color:#000000} ||
> PackageRelationshipTypes.{color}{color:#0000c0}IMAGE_PART{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000})
> ||
> POIXMLDocument.{color}{color:#0000c0}PACK_OBJECT_REL_TYPE{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000}){color}
> {color:#000000} ||
> POIXMLDocument.{color}{color:#0000c0}OLE_OBJECT_REL_TYPE{color}{color:#000000}.equals({color}{color:#6a3e3e}type{color}{color:#000000}))
> {{color}
> {color:#000000}
> handleEmbeddedFile({color}{color:#6a3e3e}target{color}{color:#000000},
> {color}{color:#6a3e3e}xhtml{color}{color:#000000},
> {color}{color:#6a3e3e}sourceDesc{color}{color:#000000} +
> {color}{color:#6a3e3e}rel{color}{color:#000000}.getId());{color}
> {color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
> ({color}{color:#6a3e3e}targetURI{color}{color:#000000} !=
> {color}{color:#7f0055}null{color}{color:#000000}) {{color}
> {color:#000000}
> {color}{color:#6a3e3e}handledTarget{color}{color:#000000}.add({color}{color:#6a3e3e}targetURI{color}{color:#000000}.toString());{color}
> {color:#000000} }{color}
> {color:#000000} } {color}{color:#7f0055}else{color}{color:#000000}
> {color}{color:#7f0055}if{color}{color:#000000}
> (XSSFRelation.{color}{color:#0000c0}VBA_MACROS{color}{color:#000000}.getRelation().equals({color}{color:#6a3e3e}type{color}{color:#000000}))
> {{color}
> {color:#000000}
> handleMacros({color}{color:#6a3e3e}target{color}{color:#000000},
> {color}{color:#6a3e3e}xhtml{color}{color:#000000});{color}
> {color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
> ({color}{color:#6a3e3e}targetURI{color}{color:#000000} !=
> {color}{color:#7f0055}null{color}{color:#000000}) {{color}
> {color:#000000}
> {color}{color:#6a3e3e}handledTarget{color}{color:#000000}.add({color}{color:#6a3e3e}targetURI{color}{color:#000000}.toString());{color}
> {color:#000000} }{color}
> {color:#000000} } {color}{color:#7f0055}else{color}{color:#000000}
> {color}{color:#7f0055}if{color}{color:#000000}
> ({color}{color:#6a3e3e}type{color}{color:#000000}.endsWith({color}{color:#2a00ff}"aFChunk"{color}{color:#000000}))
> {{color}
>
> {color:#000000}
> handleEmbeddedFile({color}{color:#6a3e3e}target{color}{color:#000000},
> {color}{color:#6a3e3e}xhtml{color}{color:#000000},
> {color}{color:#6a3e3e}sourceDesc{color}{color:#000000} +
> {color}{color:#6a3e3e}rel{color}{color:#000000}.getId());{color}
>
> {color:#000000} }{color}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)