[
https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671472#comment-16671472
]
Tim Allison commented on TIKA-2750:
-----------------------------------
I'd like to remove "boring" and/or basically duplicative documents from our
regression corpus. We already effectively remove exact dupes because we store
files by their hashes.
Unfortunately, I don't have a great definition of boring (aside from
ascii/UTF-8 English text files), and I recognize that "boring" today may not be
"boring" tomorrow if a given document contains a feature that our parsers
ignore at the moment.
Ideally, if two documents exercise the same lines of code, I'd want to remove
one of them.
Could we use jacoco or something similar to identify documents that exercise
similar code paths? Or, more generally, can we measure coverage of our code
base for a given set of documents fairly easily?
I have zero experience in this realm and welcome input!
> Update regression corpus
> ------------------------
>
> Key: TIKA-2750
> URL: https://issues.apache.org/jira/browse/TIKA-2750
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: CC-MAIN-2018-39-mimes-charsets-by-tld.zip
>
>
> I think we've had great success with the current data on our regression
> corpus. I'd like to re-fresh some data from common crawl with three primary
> goals:
> 1) include more interesting documents (e.g. down sample English UTF-8
> text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely
> more ooxml)
> 3) identify and re-fetch truncated documents from the original site --
> CommonCrawl truncates docs at 1 MB. I think some truncated documents have
> been quite useful, similar to fuzzing, for identifying serious problems with
> some of our parsers. However, it would be useful to have more complete
> files, esp. for PDFs. In short, we should keep some truncated documents, but
> I'd also like to get more complete docs.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)