[
https://issues.apache.org/jira/browse/TIKA-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Rehurek updated TIKA-3103:
--------------------------------
Description:
We're using the Tika Server with OCR:
_java -jar /pii_tools/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_
This used to work fine in previous versions (1.22, without _-spawnChild_).
But after upgrading to 1.24.1 with _-spawnChild_, two undersirable things
happen:
h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests
should have finished.
These (zombie?) processes show in _top_ as Tesseract and consume all CPU cores
at 100%.
They eventually die but the machine is unusable in the mean time.
*Expected behaviour:* Tika cleans up spawned processes after itself: at most
after its timeout limit (which is 2 minutes I believe?)
h3. 2. The temp is full of files like:
{{root@38acd588ee22:/# ll /tmp/}}
{{total 197320}}
{{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}}
{{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}}
{{-rw-r--r-- 1 root root 9273920 May 20 08:56
TIKA_streamstore_11144988934311367241.tmp}}
{{-rw-r--r-- 1 root root 8938048 May 20 08:57
TIKA_streamstore_11649337406504198407.tmp}}
{{-rw-r--r-- 1 root root 9478720 May 20 08:56
TIKA_streamstore_13551529918743702933.tmp}}
{{-rw-r--r-- 1 root root 9151040 May 20 08:57
TIKA_streamstore_13568226047805501311.tmp}}
{{-rw-r--r-- 1 root root 7701056 May 20 08:56
TIKA_streamstore_13908373602714189455.tmp}}
{{-rw-r--r-- 1 root root 8847936 May 20 08:57
TIKA_streamstore_1480509612453630180.tmp}}
{{-rw-r--r-- 1 root root 4612672 May 20 08:57
TIKA_streamstore_15069413591682978216.tmp}}
{{-rw-r--r-- 1 root root 9486912 May 20 08:57
TIKA_streamstore_15221713181998716407.tmp}}
{{-rw-r--r-- 1 root root 5341760 May 20 08:57
TIKA_streamstore_1625697673397832661.tmp}}
{{-rw-r--r-- 1 root root 4637248 May 20 08:57
TIKA_streamstore_16818171974807595017.tmp}}
{{-rw-r--r-- 1 root root 9486912 May 20 08:57
TIKA_streamstore_17417982601345062665.tmp}}
{{-rw-r--r-- 1 root root 10584640 May 20 08:56
TIKA_streamstore_2032295370426928403.tmp}}
{{-rw-r--r-- 1 root root 7930432 May 20 08:56
TIKA_streamstore_2397616717844251306.tmp}}
{{…}}
{{-rw-r--r-- 1 root root 33367 May 20 08:55
apache-tika-11167866320029165062.tmp}}
{{-rw-r--r-- 1 root root 44353 May 20 08:54
apache-tika-1152515137515755865.tmp}}
{{-rw-r--r-- 1 root root 245279 May 20 08:52
apache-tika-12106368488659105236.tmp}}
{{-rw-r--r-- 1 root root 1759 May 20 08:47
apache-tika-12291680472524021463.tmp}}
{{-rw-r--r-- 1 root root 242756 May 20 08:49
apache-tika-12949538006801506982.tmp}}
{{-rw-r--r-- 1 root root 237290 May 20 08:49
apache-tika-13079688841505150289.tmp}}
{{-rw-r--r-- 1 root root 36232 May 20 08:46
apache-tika-14415716489394502082.tmp}}
{{-rw------- 1 root root 0 May 20 08:52 apache-tika-14763602384771268526.tmp}}
{{-rw-r--r-- 1 root root 317 May 20 09:09
apache-tika-14763602384771268526.tmp.txt}}
{{-rw------- 1 root root 0 May 20 08:54 apache-tika-15290421001014637244.tmp}}
{{-rw-r--r-- 1 root root 1912 May 20 09:13
apache-tika-15290421001014637244.tmp.txt}}
{{-rw-r--r-- 1 root root 33367 May 20 08:55
apache-tika-16361958133359282808.tmp}}
{{-rw-r--r-- 1 root root 6851 May 20 08:52
apache-tika-16442252641151531142.tmp}}
{{-rw------- 1 root root 0 May 20 08:52 apache-tika-16625923825737504853.tmp}}
{{-rw-r--r-- 1 root root 377 May 20 09:00
apache-tika-16625923825737504853.tmp.txt}}
{{-rw-r--r-- 1 root root 42924 May 20 08:50
apache-tika-16723588295792292246.tmp}}
slowly filling up the disk.
*Expected behaviour*: Tika cleans up disk space after itself.
These bugs I critical for us so we had to revert back to 1.22. What's the best
way to avoid these issues?
was:
We're using the Tika Server with OCR:
_java -jar /pii_tools/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_
This used to work fine in previous versions (1.22, without _-spawnChild_).
But after upgrading to 1.24.1 with _-spawnChild_, two undersirable things
happen:
h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests
should have finished.
These (zombie?) processes show in _top_ as Tesseract and consume all CPU cores
at 100%.
They eventually die but the machine is unusable in the mean time.
*Expected behaviour:* Tika cleans up spawned processes after itself: at most
after its timeout limit (which is 2 minutes I believe?)
h3. 2. The temp is full of files like:
{{root@38acd588ee22:/# ll /tmp/}}
{{total 197320}}
{{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}}
{{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}}
{{-rw-r--r-- 1 root root 9273920 May 20 08:56
TIKA_streamstore_11144988934311367241.tmp}}
{{-rw-r--r-- 1 root root 8938048 May 20 08:57
TIKA_streamstore_11649337406504198407.tmp}}
{{-rw-r--r-- 1 root root 9478720 May 20 08:56
TIKA_streamstore_13551529918743702933.tmp}}
{{-rw-r--r-- 1 root root 9151040 May 20 08:57
TIKA_streamstore_13568226047805501311.tmp}}
{{-rw-r--r-- 1 root root 7701056 May 20 08:56
TIKA_streamstore_13908373602714189455.tmp}}
{{-rw-r--r-- 1 root root 8847936 May 20 08:57
TIKA_streamstore_1480509612453630180.tmp}}
{{-rw-r--r-- 1 root root 4612672 May 20 08:57
TIKA_streamstore_15069413591682978216.tmp}}
{{-rw-r--r-- 1 root root 9486912 May 20 08:57
TIKA_streamstore_15221713181998716407.tmp}}
{{-rw-r--r-- 1 root root 5341760 May 20 08:57
TIKA_streamstore_1625697673397832661.tmp}}
{{-rw-r--r-- 1 root root 4637248 May 20 08:57
TIKA_streamstore_16818171974807595017.tmp}}
{{-rw-r--r-- 1 root root 9486912 May 20 08:57
TIKA_streamstore_17417982601345062665.tmp}}
{{-rw-r--r-- 1 root root 10584640 May 20 08:56
TIKA_streamstore_2032295370426928403.tmp}}
{{-rw-r--r-- 1 root root 7930432 May 20 08:56
TIKA_streamstore_2397616717844251306.tmp}}
{{…}}
{{-rw-r--r-- 1 root root 33367 May 20 08:55
apache-tika-11167866320029165062.tmp}}
{{-rw-r--r-- 1 root root 44353 May 20 08:54
apache-tika-1152515137515755865.tmp}}
{{-rw-r--r-- 1 root root 245279 May 20 08:52
apache-tika-12106368488659105236.tmp}}
{{-rw-r--r-- 1 root root 1759 May 20 08:47
apache-tika-12291680472524021463.tmp}}
{{-rw-r--r-- 1 root root 242756 May 20 08:49
apache-tika-12949538006801506982.tmp}}
{{-rw-r--r-- 1 root root 237290 May 20 08:49
apache-tika-13079688841505150289.tmp}}
{{-rw-r--r-- 1 root root 36232 May 20 08:46
apache-tika-14415716489394502082.tmp}}
{{-rw------- 1 root root 0 May 20 08:52 apache-tika-14763602384771268526.tmp}}
{{-rw-r--r-- 1 root root 317 May 20 09:09
apache-tika-14763602384771268526.tmp.txt}}
{{-rw------- 1 root root 0 May 20 08:54 apache-tika-15290421001014637244.tmp}}
{{-rw-r--r-- 1 root root 1912 May 20 09:13
apache-tika-15290421001014637244.tmp.txt}}
{{-rw-r--r-- 1 root root 33367 May 20 08:55
apache-tika-16361958133359282808.tmp}}
{{-rw-r--r-- 1 root root 6851 May 20 08:52
apache-tika-16442252641151531142.tmp}}
{{-rw------- 1 root root 0 May 20 08:52 apache-tika-16625923825737504853.tmp}}
{{-rw-r--r-- 1 root root 377 May 20 09:00
apache-tika-16625923825737504853.tmp.txt}}
{{-rw-r--r-- 1 root root 42924 May 20 08:50
apache-tika-16723588295792292246.tmp}}
slowly filling up the disk.
*Expected behaviour*: Tika cleans up disk space after itself.
These bugs I critical for us so we had to revert back to 1.22. What's the best
way to avoid these issues?
> Tesseract fails to respect timeouts and clean up after itself
> -------------------------------------------------------------
>
> Key: TIKA-3103
> URL: https://issues.apache.org/jira/browse/TIKA-3103
> Project: Tika
> Issue Type: Bug
> Components: ocr
> Affects Versions: 1.24.1
> Reporter: Radim Rehurek
> Priority: Critical
>
> We're using the Tika Server with OCR:
> _java -jar /pii_tools/tika/tika-server-1.24.1.jar -p 9998 -spawnChild
> -JXmx500m_
>
> This used to work fine in previous versions (1.22, without _-spawnChild_).
> But after upgrading to 1.24.1 with _-spawnChild_, two undersirable things
> happen:
> h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests
> should have finished.
> These (zombie?) processes show in _top_ as Tesseract and consume all CPU
> cores at 100%.
> They eventually die but the machine is unusable in the mean time.
> *Expected behaviour:* Tika cleans up spawned processes after itself: at most
> after its timeout limit (which is 2 minutes I believe?)
> h3. 2. The temp is full of files like:
> {{root@38acd588ee22:/# ll /tmp/}}
> {{total 197320}}
> {{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}}
> {{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}}
> {{-rw-r--r-- 1 root root 9273920 May 20 08:56
> TIKA_streamstore_11144988934311367241.tmp}}
> {{-rw-r--r-- 1 root root 8938048 May 20 08:57
> TIKA_streamstore_11649337406504198407.tmp}}
> {{-rw-r--r-- 1 root root 9478720 May 20 08:56
> TIKA_streamstore_13551529918743702933.tmp}}
> {{-rw-r--r-- 1 root root 9151040 May 20 08:57
> TIKA_streamstore_13568226047805501311.tmp}}
> {{-rw-r--r-- 1 root root 7701056 May 20 08:56
> TIKA_streamstore_13908373602714189455.tmp}}
> {{-rw-r--r-- 1 root root 8847936 May 20 08:57
> TIKA_streamstore_1480509612453630180.tmp}}
> {{-rw-r--r-- 1 root root 4612672 May 20 08:57
> TIKA_streamstore_15069413591682978216.tmp}}
> {{-rw-r--r-- 1 root root 9486912 May 20 08:57
> TIKA_streamstore_15221713181998716407.tmp}}
> {{-rw-r--r-- 1 root root 5341760 May 20 08:57
> TIKA_streamstore_1625697673397832661.tmp}}
> {{-rw-r--r-- 1 root root 4637248 May 20 08:57
> TIKA_streamstore_16818171974807595017.tmp}}
> {{-rw-r--r-- 1 root root 9486912 May 20 08:57
> TIKA_streamstore_17417982601345062665.tmp}}
> {{-rw-r--r-- 1 root root 10584640 May 20 08:56
> TIKA_streamstore_2032295370426928403.tmp}}
> {{-rw-r--r-- 1 root root 7930432 May 20 08:56
> TIKA_streamstore_2397616717844251306.tmp}}
> {{…}}
> {{-rw-r--r-- 1 root root 33367 May 20 08:55
> apache-tika-11167866320029165062.tmp}}
> {{-rw-r--r-- 1 root root 44353 May 20 08:54
> apache-tika-1152515137515755865.tmp}}
> {{-rw-r--r-- 1 root root 245279 May 20 08:52
> apache-tika-12106368488659105236.tmp}}
> {{-rw-r--r-- 1 root root 1759 May 20 08:47
> apache-tika-12291680472524021463.tmp}}
> {{-rw-r--r-- 1 root root 242756 May 20 08:49
> apache-tika-12949538006801506982.tmp}}
> {{-rw-r--r-- 1 root root 237290 May 20 08:49
> apache-tika-13079688841505150289.tmp}}
> {{-rw-r--r-- 1 root root 36232 May 20 08:46
> apache-tika-14415716489394502082.tmp}}
> {{-rw------- 1 root root 0 May 20 08:52
> apache-tika-14763602384771268526.tmp}}
> {{-rw-r--r-- 1 root root 317 May 20 09:09
> apache-tika-14763602384771268526.tmp.txt}}
> {{-rw------- 1 root root 0 May 20 08:54
> apache-tika-15290421001014637244.tmp}}
> {{-rw-r--r-- 1 root root 1912 May 20 09:13
> apache-tika-15290421001014637244.tmp.txt}}
> {{-rw-r--r-- 1 root root 33367 May 20 08:55
> apache-tika-16361958133359282808.tmp}}
> {{-rw-r--r-- 1 root root 6851 May 20 08:52
> apache-tika-16442252641151531142.tmp}}
> {{-rw------- 1 root root 0 May 20 08:52
> apache-tika-16625923825737504853.tmp}}
> {{-rw-r--r-- 1 root root 377 May 20 09:00
> apache-tika-16625923825737504853.tmp.txt}}
> {{-rw-r--r-- 1 root root 42924 May 20 08:50
> apache-tika-16723588295792292246.tmp}}
>
> slowly filling up the disk.
> *Expected behaviour*: Tika cleans up disk space after itself.
>
> These bugs I critical for us so we had to revert back to 1.22. What's the
> best way to avoid these issues?
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)