Xiaohong Yang created TIKA-3106:
-----------------------------------
Summary: Tika Fails to detect some EML files if extension is not
.eml
Key: TIKA-3106
URL: https://issues.apache.org/jira/browse/TIKA-3106
Project: Tika
Issue Type: Bug
Components: metadata, mime
Affects Versions: 1.24
Reporter: Xiaohong Yang
Attachments: EmlFile.txt
I have an eml file that can be detected as message/rfc822 only if the file
extension is .eml, otherwise it will be detected as text/plain. Following is
the code that I use to detect the file type and extension.
TikaConfig config = TikaConfigFactory.getTikaConfig();
Detector detector = config.getDetector();
Metadata metadata = new Metadata();
TikaInputStream stream = TikaInputStream.get(fis = new
FileInputStream(filePath));
metadata.add(Metadata.RESOURCE_NAME_KEY, filePath);
MediaType mediaType = detector.detect(stream, metadata);
MimeType mimeType =
config.getMimeRepository().forName(mediaType.toString());
String tikaExtension = mimeType.getExtension();
When the sample file has .eml extension, mimeType is message/rfc822 and
tikaExtension is eml. When I change the extension to .txt, mimeType is
text/plain and tikaExtension is .txt.
The same mimeType and tikaExtension should be detected regardless the file
extension.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)