[
https://issues.apache.org/jira/browse/TIKA-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112067#comment-17112067
]
Tim Allison edited comment on TIKA-3103 at 5/20/20, 11:13 AM:
--------------------------------------------------------------
In reverse order,
{{apache-tika-12291680472524021463.tmp}} looks like temp files that Tika
creates. Can you tell which file type these are? Is there a specific parser
that is failing to clean up after itself?
{{TIKA_streamstore_11144988934311367241.tmp}} does not look like a temp file
from Tika. Are you sure your application is not failing to clean these up?
On the zombie processes, in looking at the source code, we do have a timeout on
tesseract at least, but, right, that won't trigger if the java process dies.
We do not have timeouts on the Python processing. Do you know if you have the
python image fixers installed? Are you triggering these, by chance? Wait, no,
you said that tesseract is the process that is orphaned...
How often are you getting restarts of the child process?
Finally, to confirm, you do want tesseract to run, right?
was (Author: [email protected]):
In reverse order,
{{apache-tika-12291680472524021463.tmp}} look like temp files that Tika
creates. Can you tell which file type these are? Is there a specific parser
that is failing to clean up after itself?
{{TIKA_streamstore_11144988934311367241.tmp}} does not look like a temp file
from Tika. Are you sure your application is cleaning these up?
On the zombie processes, in looking at the source code, we do have a timeout on
tesseract at least, but, right, that won't trigger if the java process dies.
We do not have timeouts on the Python processing. Do you know if you have the
python image fixers installed? Are you triggering these, by chance?
How often are you getting restarts of the child process?
> Tesseract fails to respect timeouts and clean up after itself
> -------------------------------------------------------------
>
> Key: TIKA-3103
> URL: https://issues.apache.org/jira/browse/TIKA-3103
> Project: Tika
> Issue Type: Bug
> Components: ocr
> Affects Versions: 1.24.1
> Reporter: Radim Rehurek
> Priority: Critical
>
> We're using the Tika Server with OCR:
> _java -jar /opt/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_
>
> Two undersirable things happen:
> h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests
> should have finished.
> These processes show in _top_ as "tesseract" (OCR) and consume all CPU cores
> at 100%.
> They eventually die (or finish?) but the machine is unusable in the mean time.
> *Expected behaviour:* Tika cleans up spawned processes after itself: at most
> after its timeout limit (which is 2 minutes I believe?)
> h3. 2. The temp is full of files like:
> {{root@38acd588ee22:/# ll /tmp/}}
> {{total 197320}}
> {{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}}
> {{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}}
> {{-rw-r--r-- 1 root root 9273920 May 20 08:56
> TIKA_streamstore_11144988934311367241.tmp}}
> {{-rw-r--r-- 1 root root 8938048 May 20 08:57
> TIKA_streamstore_11649337406504198407.tmp}}
> {{-rw-r--r-- 1 root root 9478720 May 20 08:56
> TIKA_streamstore_13551529918743702933.tmp}}
> {{-rw-r--r-- 1 root root 9151040 May 20 08:57
> TIKA_streamstore_13568226047805501311.tmp}}
> {{-rw-r--r-- 1 root root 7701056 May 20 08:56
> TIKA_streamstore_13908373602714189455.tmp}}
> {{…}}
> {{-rw-r--r-- 1 root root 33367 May 20 08:55
> apache-tika-11167866320029165062.tmp}}
> {{-rw-r--r-- 1 root root 44353 May 20 08:54
> apache-tika-1152515137515755865.tmp}}
> {{-rw-r--r-- 1 root root 245279 May 20 08:52
> apache-tika-12106368488659105236.tmp}}
> {{-rw-r--r-- 1 root root 1759 May 20 08:47
> apache-tika-12291680472524021463.tmp}}
> {{…}}
>
> slowly filling up the disk.
> *Expected behaviour*: Tika cleans up disk space after itself.
>
> These bugs I critical for us. What's the best way to avoid these issues?
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)