[
https://issues.apache.org/jira/browse/TIKA-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153235#comment-17153235
]
Nick Burch commented on TIKA-3115:
----------------------------------
The Avro metadata files seem to be JSON, so not much hope there without the
filename. The metadata+data sample files I've located seem to start with
{{Obj(hex)avro.schema}}, need to dig out the source code for the file format to
double check what's allowed, especially the hex bit near the start.
ORC files seem to start with {{ORC}} then some binary index header stuff. Again
need to find the writer code to double check what's allowed/expected before
writing the magic.
If there's a hadoop-using volunteer, it'd be great if we could create a tiny
few-row few-column sample dataset, and save it all the various file formats. We
can then use those for unit testing the detection, knowing they're small +
complete + suitably licensed
(samples checked all from
https://github.com/Teradata/kylo/tree/master/samples/sample-data/ but I don't
think we can use those for unit tests)
> Detect parquet files
> --------------------
>
> Key: TIKA-3115
> URL: https://issues.apache.org/jira/browse/TIKA-3115
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> Example file on https://issues.apache.org/jira/browse/TIKA-3110
> File starts with 'PAR1' and ends with 'PAR1'...anyone happen to know the
> actual mime magic for parquet or anything more specific than starts with
> 'PAR1'?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)