All,
New reports are here:
http://162.242.228.174/reports/reports_tika_1.22_vs_1.23-pre-rc1.tgz
I ran these with the most recent 1.23-SNAPSHOT on the full 500k sample.
There are a few things to look into, but nothing that leaps out to me.
Unless there are objections, I'll roll rc1 shortly.
Cheers,
Tim
On Mon, Nov 25, 2019 at 10:57 AM Tim Allison <[email protected]> wrote:
> d) is not a problem. It was caused by a bit of idiocy in my random file
> selection code that allowed for duplicate files...so the list did have 500k
> file names, but only ~270k unique file names.
>
> On Mon, Nov 25, 2019 at 10:08 AM Tim Allison <[email protected]> wrote:
>
>> All,
>> I finished the regression tests, and the reports are available here:
>> http://162.242.228.174/reports/reports_tika_1.22_vs_1.23-pre-rc1.tgz
>> My takeaways:
>> a) we need to fix the new code in the PDFParser that set's whether or
>> not there is a digital signature. That should be set, not add
>> b) we are getting a few new exceptions on going over the safety maximum
>> for byte array allocation in POI. We can make this configurable at the
>> Tika level.
>> c) there are a few new problems with EMF parsing, but these won't harm
>> parsing the rest of the file.
>> d) both runs (1.22 and 1.23-pre-rc1) only processed ~250k files, but
>> there were ~500k in the list...I need to figure out what went wrong.
>>
>> If I find nothing concerning on d), are we ready to roll 1.23-rc1?
>>
>> Cheers,
>>
>> Tim
>>
>> On Fri, Nov 22, 2019 at 8:25 AM Tim Allison <[email protected]> wrote:
>>
>>> All,
>>> I started the regression tests on a random set of 500k files. I found
>>> this morning that it was _still_ going. It turns out I had accidentally
>>> configured extract images for PDFs, which adds to the processing time and
>>> leads to more OOMs.
>>> I restarted the regression tests this morning with that feature turned
>>> off.
>>>
>>> Best,
>>>
>>> Tim
>>>
>>