[ 
https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671472#comment-16671472
 ] 

Tim Allison commented on TIKA-2750:
-----------------------------------

I'd like to remove "boring" and/or basically duplicative documents from our 
regression corpus.  We already effectively remove exact dupes because we store 
files by their hashes.

Unfortunately, I don't have a great definition of boring (aside from 
ascii/UTF-8 English text files), and I recognize that "boring" today may not be 
"boring" tomorrow if a given document contains a feature that our parsers 
ignore at the moment.

Ideally, if two documents exercise the same lines of code, I'd want to remove 
one of them.

Could we use jacoco or something similar to identify documents that exercise 
similar code paths?  Or, more generally, can we measure coverage of our code 
base for a given set of documents fairly easily?

I have zero experience in this realm and welcome input!

> Update regression corpus
> ------------------------
>
>                 Key: TIKA-2750
>                 URL: https://issues.apache.org/jira/browse/TIKA-2750
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: CC-MAIN-2018-39-mimes-charsets-by-tld.zip
>
>
> I think we've had great success with the current data on our regression 
> corpus.  I'd like to re-fresh some data from common crawl with three primary 
> goals:
> 1) include more interesting documents (e.g. down sample English UTF-8 
> text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely 
> more ooxml)
> 3) identify and re-fetch truncated documents from the original site -- 
> CommonCrawl truncates docs at 1 MB.  I think some truncated documents have 
> been quite useful, similar to fuzzing, for identifying serious problems with 
> some of our parsers.  However, it would be useful to have more complete 
> files, esp. for PDFs.  In short, we should keep some truncated documents, but 
> I'd also like to get more complete docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to