[
https://issues.apache.org/jira/browse/TIKA-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Rehurek updated TIKA-3103:
--------------------------------
Description:
We're using the Tika Server with OCR:
_java -jar /pii_tools/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_
This used to work fine in previous versions (1.22, without _-spawnChild_).
But after upgrading to 1.24.1 with _-spawnChild_, two undersirable things
happen:
h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests
should have finished.
These (zombie?) processes show in _top_ as Tesseract and consume all CPU cores
at 100%.
They eventually die but the machine is unusable in the mean time.
*Expected behaviour:* Tika cleans up spawned processes after itself: at most
after its timeout limit (which is 2 minutes I believe?)
h3. 2. The temp is full of files like:
_-rw------- 1 root root 0 May 20 08:47 /tmp/apache-tika-6183308518561170276.tmp_
_-rw-r--r-- 1 root root 140 May 20 08:48
/tmp/apache-tika-6183308518561170276.tmp.txt_
_-rw-r--r-- 1 root root 208416 May 20 08:53
/tmp/apache-tika-6262109250322677208.tmp_
_-rw-r--r-- 1 root root 399550 May 20 08:49
/tmp/apache-tika-6358810719289028940.tmp_
_-rw------- 1 root root 0 May 20 08:55
/tmp/apache-tika-6452032540225217628.tmp_
_-rw-r--r-- 1 root root 368 May 20 09:02
/tmp/apache-tika-6452032540225217628.tmp.txt_
_-rw------- 1 root root 0 May 20 08:46
/tmp/apache-tika-6874839592996549275.tmp_
_-rw-r--r-- 1 root root 3700 May 20 08:48
/tmp/apache-tika-6874839592996549275.tmp.txt_
slowly filling up the disk.
*Expected behaviour*: Tika cleans up disk space after itself.
These bugs I critical for us so we had to revert back to 1.22. What's the best
way to avoid these issues?
was:
We're using the Tika Server with OCR:
_java -jar /pii_tools/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_
This used to work fine in previous versions (1.22, without _-spawnChild_).
But after upgrading to 1.24.1 with _-spawnChild_, two undersirable things
happen:
# The CPU runs at 100% for >10 minutes, long after any Tika requests should
have finished. The processes show in _top_ as Tesseract.
They eventually die but the machine is unusable in the mean time.
*Expected behaviour:* Tika cleans up after itself: at most after its timeout
limit (which is 2 minutes I believe?)
# The temp is full of files like:
_-rw------- 1 root root 0 May 20 08:47 /tmp/apache-tika-6183308518561170276.tmp_
_-rw-r--r-- 1 root root 140 May 20 08:48
/tmp/apache-tika-6183308518561170276.tmp.txt_
_-rw-r--r-- 1 root root 208416 May 20 08:53
/tmp/apache-tika-6262109250322677208.tmp_
_-rw-r--r-- 1 root root 399550 May 20 08:49
/tmp/apache-tika-6358810719289028940.tmp_
_-rw------- 1 root root 0 May 20 08:55 /tmp/apache-tika-6452032540225217628.tmp_
_-rw-r--r-- 1 root root 368 May 20 09:02
/tmp/apache-tika-6452032540225217628.tmp.txt_
_-rw------- 1 root root 0 May 20 08:46 /tmp/apache-tika-6874839592996549275.tmp_
_-rw-r--r-- 1 root root 3700 May 20 08:48
/tmp/apache-tika-6874839592996549275.tmp.txt_
slowly filling up the disk.
*Expected behaviour*: Tika cleans up after itself.
These bugs I critical for us so we had to revert back to 1.22. What's the best
way to avoid these issues?
> Tesseract fails to respect timeouts and clean up after itself
> -------------------------------------------------------------
>
> Key: TIKA-3103
> URL: https://issues.apache.org/jira/browse/TIKA-3103
> Project: Tika
> Issue Type: Bug
> Components: ocr
> Affects Versions: 1.24.1
> Reporter: Radim Rehurek
> Priority: Critical
>
> We're using the Tika Server with OCR:
> _java -jar /pii_tools/tika/tika-server-1.24.1.jar -p 9998 -spawnChild
> -JXmx500m_
>
> This used to work fine in previous versions (1.22, without _-spawnChild_).
> But after upgrading to 1.24.1 with _-spawnChild_, two undersirable things
> happen:
> h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests
> should have finished.
> These (zombie?) processes show in _top_ as Tesseract and consume all CPU
> cores at 100%.
> They eventually die but the machine is unusable in the mean time.
> *Expected behaviour:* Tika cleans up spawned processes after itself: at most
> after its timeout limit (which is 2 minutes I believe?)
> h3. 2. The temp is full of files like:
> _-rw------- 1 root root 0 May 20 08:47
> /tmp/apache-tika-6183308518561170276.tmp_
> _-rw-r--r-- 1 root root 140 May 20 08:48
> /tmp/apache-tika-6183308518561170276.tmp.txt_
> _-rw-r--r-- 1 root root 208416 May 20 08:53
> /tmp/apache-tika-6262109250322677208.tmp_
> _-rw-r--r-- 1 root root 399550 May 20 08:49
> /tmp/apache-tika-6358810719289028940.tmp_
> _-rw------- 1 root root 0 May 20 08:55
> /tmp/apache-tika-6452032540225217628.tmp_
> _-rw-r--r-- 1 root root 368 May 20 09:02
> /tmp/apache-tika-6452032540225217628.tmp.txt_
> _-rw------- 1 root root 0 May 20 08:46
> /tmp/apache-tika-6874839592996549275.tmp_
> _-rw-r--r-- 1 root root 3700 May 20 08:48
> /tmp/apache-tika-6874839592996549275.tmp.txt_
> slowly filling up the disk.
> *Expected behaviour*: Tika cleans up disk space after itself.
>
> These bugs I critical for us so we had to revert back to 1.22. What's the
> best way to avoid these issues?
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)