All,

New reports are here:
http://162.242.228.174/reports/reports_tika_1.22_vs_1.23-pre-rc1.tgz

I ran these with the most recent 1.23-SNAPSHOT on the full 500k sample.
There are a few things to look into, but nothing that leaps out to me.

Unless there are objections, I'll roll rc1 shortly.

Cheers,

      Tim

On Mon, Nov 25, 2019 at 10:57 AM Tim Allison <[email protected]> wrote:

> d) is not a problem.  It was caused by a bit of idiocy in my random file
> selection code that allowed for duplicate files...so the list did have 500k
> file names, but only ~270k unique file names.
>
> On Mon, Nov 25, 2019 at 10:08 AM Tim Allison <[email protected]> wrote:
>
>> All,
>>   I finished the regression tests, and the reports are available here:
>> http://162.242.228.174/reports/reports_tika_1.22_vs_1.23-pre-rc1.tgz
>>   My takeaways:
>>   a) we need to fix the new code in the PDFParser that set's whether or
>> not there is a digital signature.  That should be set, not add
>>   b) we are getting a few new exceptions on going over the safety maximum
>> for byte array allocation in POI.  We can make this configurable at the
>> Tika level.
>>   c) there are a few new problems with EMF parsing, but these won't harm
>> parsing the rest of the file.
>>   d) both runs (1.22 and 1.23-pre-rc1) only processed ~250k files, but
>> there were ~500k in the list...I need to figure out what went wrong.
>>
>>   If I find nothing concerning on d), are we ready to roll 1.23-rc1?
>>
>>               Cheers,
>>
>>                            Tim
>>
>> On Fri, Nov 22, 2019 at 8:25 AM Tim Allison <[email protected]> wrote:
>>
>>> All,
>>>   I started the regression tests on a random set of 500k files.  I found
>>> this morning that it was _still_ going.  It turns out I had accidentally
>>> configured extract images for PDFs, which adds to the processing time and
>>> leads to more OOMs.
>>>   I restarted the regression tests this morning with that feature turned
>>> off.
>>>
>>>        Best,
>>>
>>>                    Tim
>>>
>>

Reply via email to