d) is not a problem.  It was caused by a bit of idiocy in my random file
selection code that allowed for duplicate files...so the list did have 500k
file names, but only ~270k unique file names.

On Mon, Nov 25, 2019 at 10:08 AM Tim Allison <[email protected]> wrote:

> All,
>   I finished the regression tests, and the reports are available here:
> http://162.242.228.174/reports/reports_tika_1.22_vs_1.23-pre-rc1.tgz
>   My takeaways:
>   a) we need to fix the new code in the PDFParser that set's whether or
> not there is a digital signature.  That should be set, not add
>   b) we are getting a few new exceptions on going over the safety maximum
> for byte array allocation in POI.  We can make this configurable at the
> Tika level.
>   c) there are a few new problems with EMF parsing, but these won't harm
> parsing the rest of the file.
>   d) both runs (1.22 and 1.23-pre-rc1) only processed ~250k files, but
> there were ~500k in the list...I need to figure out what went wrong.
>
>   If I find nothing concerning on d), are we ready to roll 1.23-rc1?
>
>               Cheers,
>
>                            Tim
>
> On Fri, Nov 22, 2019 at 8:25 AM Tim Allison <[email protected]> wrote:
>
>> All,
>>   I started the regression tests on a random set of 500k files.  I found
>> this morning that it was _still_ going.  It turns out I had accidentally
>> configured extract images for PDFs, which adds to the processing time and
>> leads to more OOMs.
>>   I restarted the regression tests this morning with that feature turned
>> off.
>>
>>        Best,
>>
>>                    Tim
>>
>

Reply via email to