Tim Allison created TIKA-2750:
---------------------------------

             Summary: Update regression corpus
                 Key: TIKA-2750
                 URL: https://issues.apache.org/jira/browse/TIKA-2750
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


I think we've had great success with the current data on our regression corpus. 
 I'd like to re-fresh some data from common crawl with three primary goals:

1) include more interesting documents (e.g. down sample English UTF-8 text/html)
2) include more recent documents (perhaps newer features in PDFs? definitely 
more ooxml)
3) identify and re-fetch truncated documents from the original site -- 
CommonCrawl truncates docs at 1 MB.  I think some truncated documents have been 
quite useful, similar to fuzzing, for identifying serious problems with some of 
our parsers.  However, it would be useful to have more complete files, esp. for 
PDFs.  In short, we should keep some truncated documents, but I'd also like to 
get more complete docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to