Xiaohong Yang created TIKA-3106:
-----------------------------------

             Summary: Tika Fails to detect some EML files if extension is not 
.eml
                 Key: TIKA-3106
                 URL: https://issues.apache.org/jira/browse/TIKA-3106
             Project: Tika
          Issue Type: Bug
          Components: metadata, mime
    Affects Versions: 1.24
            Reporter: Xiaohong Yang
         Attachments: EmlFile.txt

I have an eml file that can be detected as message/rfc822 only if the file 
extension is .eml,  otherwise it will be detected as text/plain.  Following is 
the code that I use to detect the file type and extension.

       TikaConfig config = TikaConfigFactory.getTikaConfig();

       Detector detector = config.getDetector();

       Metadata metadata = new Metadata();

       TikaInputStream stream = TikaInputStream.get(fis = new 
FileInputStream(filePath));

       metadata.add(Metadata.RESOURCE_NAME_KEY, filePath);

       MediaType mediaType = detector.detect(stream, metadata);

       MimeType mimeType = 
config.getMimeRepository().forName(mediaType.toString());

       String tikaExtension = mimeType.getExtension();

 

When the sample file has .eml extension,  mimeType is message/rfc822 and  
tikaExtension is eml. When I change the extension to .txt, mimeType is 
text/plain and  tikaExtension is .txt.

 

The same mimeType and tikaExtension should be detected regardless the file 
extension. 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to