All,
I finished the regression tests, and the reports are available here:
http://162.242.228.174/reports/reports_tika_1.22_vs_1.23-pre-rc1.tgz
My takeaways:
a) we need to fix the new code in the PDFParser that set's whether or not
there is a digital signature. That should be set, not add
b) we are getting a few new exceptions on going over the safety maximum
for byte array allocation in POI. We can make this configurable at the
Tika level.
c) there are a few new problems with EMF parsing, but these won't harm
parsing the rest of the file.
d) both runs (1.22 and 1.23-pre-rc1) only processed ~250k files, but
there were ~500k in the list...I need to figure out what went wrong.
If I find nothing concerning on d), are we ready to roll 1.23-rc1?
Cheers,
Tim
On Fri, Nov 22, 2019 at 8:25 AM Tim Allison <[email protected]> wrote:
> All,
> I started the regression tests on a random set of 500k files. I found
> this morning that it was _still_ going. It turns out I had accidentally
> configured extract images for PDFs, which adds to the processing time and
> leads to more OOMs.
> I restarted the regression tests this morning with that feature turned
> off.
>
> Best,
>
> Tim
>