[jira] [Commented] (TIKA-3115) Detect parquet files

Nick Burch (Jira) Tue, 07 Jul 2020 22:08:22 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153235#comment-17153235
 ]


Nick Burch commented on TIKA-3115:
----------------------------------

The Avro metadata files seem to be JSON, so not much hope there without the 
filename. The metadata+data sample files I've located seem to start with 
{{Obj(hex)avro.schema}}, need to dig out the source code for the file format to 
double check what's allowed, especially the hex bit near the start.

ORC files seem to start with {{ORC}} then some binary index header stuff. Again 
need to find the writer code to double check what's allowed/expected before 
writing the magic.

If there's a hadoop-using volunteer, it'd be great if we could create a tiny 
few-row few-column sample dataset, and save it all the various file formats. We 
can then use those for unit testing the detection, knowing they're small + 
complete + suitably licensed

(samples checked all from 
https://github.com/Teradata/kylo/tree/master/samples/sample-data/ but I don't 
think we can use those for unit tests)

> Detect parquet files
> --------------------
>
>                 Key: TIKA-3115
>                 URL: https://issues.apache.org/jira/browse/TIKA-3115
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Example file on https://issues.apache.org/jira/browse/TIKA-3110
> File starts with 'PAR1' and ends with 'PAR1'...anyone happen to know the 
> actual mime magic for parquet or anything more specific than starts with 
> 'PAR1'?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3115) Detect parquet files

Reply via email to