[
https://issues.apache.org/jira/browse/TIKA-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132737#comment-17132737
]
Kenneth William Krugler commented on TIKA-3109:
-----------------------------------------------
I think we have to treat it as an embedded document, because we'll want a full
HTML parse (with all of the complicated logic to try to handle edge cases)
since that's basically what it is (full HTML doc, starting with optional
{{DOCTYPE}} and then the {{<html>}} element). See
[https://www.w3.org/TR/2010/WD-html5-20101019/author/the-iframe-element.html#attr-iframe-srcdoc]
for more details.
> Ingest attachment: failed to extract text from iframe
> -----------------------------------------------------
>
> Key: TIKA-3109
> URL: https://issues.apache.org/jira/browse/TIKA-3109
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.22
> Environment: * Apache Tika 1.22
> * {{Java}}
> {{java 13.0.2 2020-01-14}}
> * {{Ubuntu 18.04.1 LTS}}
> {{Linux XXXXX 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020
> x86_64 x86_64 x86_64 GNU/Linux}}
> Reporter: Younes
> Priority: Major
>
> This standalone
> [HTML|https://github.com/elastic/elasticsearch/files/4757855/c0711285-8ab7-46c3-b730-7c0639466537.html.zip]
> page has all its CSS/JS/IMAGEs embedded.
> After indexing it using ElasticSearch, we tried to search the keyword
> *logarithmic* which exists. Unfortunately, we couldn't find it.
> [~dadoonet] was able to reproduce the issue which is fully described
> [elasticsearch|https://github.com/elastic/elasticsearch/issues/57924]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)