[
https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639839#comment-16639839
]
Tim Allison commented on TIKA-2750:
-----------------------------------
{{/data1}} includes: the original zips from Common Crawl contributed by
[~jnioche], the zips I downloaded from {{govdocs1}} and scientific data from
[~chrismattmann]. I propose {{rm -r}} on the original common crawl zips and
the govdocs1 zips to clear up space for sloshing data around and/or fuzzing.
> Update regression corpus
> ------------------------
>
> Key: TIKA-2750
> URL: https://issues.apache.org/jira/browse/TIKA-2750
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> I think we've had great success with the current data on our regression
> corpus. I'd like to re-fresh some data from common crawl with three primary
> goals:
> 1) include more interesting documents (e.g. down sample English UTF-8
> text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely
> more ooxml)
> 3) identify and re-fetch truncated documents from the original site --
> CommonCrawl truncates docs at 1 MB. I think some truncated documents have
> been quite useful, similar to fuzzing, for identifying serious problems with
> some of our parsers. However, it would be useful to have more complete
> files, esp. for PDFs. In short, we should keep some truncated documents, but
> I'd also like to get more complete docs.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)