[jira] [Commented] (TIKA-3103) Tesseract fails to respect timeouts and clean up after itself

Tim Allison (Jira) Wed, 20 May 2020 04:12:21 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112067#comment-17112067
 ]


Tim Allison commented on TIKA-3103:
-----------------------------------

In reverse order, 
{{apache-tika-12291680472524021463.tmp}} look like temp files that Tika 
creates.  Can you tell which file type these are?  Is there a specific parser 
that is failing to clean up after itself?

{{TIKA_streamstore_11144988934311367241.tmp}} does not look like a temp file 
from Tika.  Are you sure your application is cleaning these up?

On the zombie processes, in looking at the source code, we do have a timeout on 
tesseract at least, but, right, that won't trigger if the java process dies.  
We do not have timeouts on the Python processing.  Do you know if you have the 
python image fixers installed?  Are you triggering these, by chance?

How often are you getting restarts of the child process?

> Tesseract fails to respect timeouts and clean up after itself
> -------------------------------------------------------------
>
>                 Key: TIKA-3103
>                 URL: https://issues.apache.org/jira/browse/TIKA-3103
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr
>    Affects Versions: 1.24.1
>            Reporter: Radim Rehurek
>            Priority: Critical
>
> We're using the Tika Server with OCR:
> _java -jar /opt/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_
>  
> Two undersirable things happen:
> h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests 
> should have finished.
> These processes show in _top_ as "tesseract" (OCR) and consume all CPU cores 
> at 100%.
> They eventually die (or finish?) but the machine is unusable in the mean time.
> *Expected behaviour:* Tika cleans up spawned processes after itself: at most 
> after its timeout limit (which is 2 minutes I believe?)
> h3. 2. The temp is full of files like:
> {{root@38acd588ee22:/# ll /tmp/}}
>  {{total 197320}}
>  {{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}}
>  {{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}}
>  {{-rw-r--r-- 1 root root 9273920 May 20 08:56 
> TIKA_streamstore_11144988934311367241.tmp}}
>  {{-rw-r--r-- 1 root root 8938048 May 20 08:57 
> TIKA_streamstore_11649337406504198407.tmp}}
>  {{-rw-r--r-- 1 root root 9478720 May 20 08:56 
> TIKA_streamstore_13551529918743702933.tmp}}
>  {{-rw-r--r-- 1 root root 9151040 May 20 08:57 
> TIKA_streamstore_13568226047805501311.tmp}}
>  {{-rw-r--r-- 1 root root 7701056 May 20 08:56 
> TIKA_streamstore_13908373602714189455.tmp}}
>  {{…}}
>  {{-rw-r--r-- 1 root root 33367 May 20 08:55 
> apache-tika-11167866320029165062.tmp}}
>  {{-rw-r--r-- 1 root root 44353 May 20 08:54 
> apache-tika-1152515137515755865.tmp}}
>  {{-rw-r--r-- 1 root root 245279 May 20 08:52 
> apache-tika-12106368488659105236.tmp}}
>  {{-rw-r--r-- 1 root root 1759 May 20 08:47 
> apache-tika-12291680472524021463.tmp}}
> {{…}}
>  
> slowly filling up the disk.
> *Expected behaviour*: Tika cleans up disk space after itself.
>  
> These bugs I critical for us. What's the best way to avoid these issues?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3103) Tesseract fails to respect timeouts and clean up after itself

Reply via email to