revisiting a prior, never resolved issue
tika 2.4.1 'Text extraction failed' errors when dovecot+fts 2.3.19.1
passes embedded *.eml (message/rfc822) files ;
org.apache.tika.parser.mail.RFC822Parser or dovecot ?
https://dovecot.org/list/dovecot/2022-August/125085.html
i now run
dovecot --version
2.3.20 (80a5ac675d)
with
dovecot23_fts_flatcurve
built 20230812 from git.HEAD
on
lsb_release -rd
Description: Fedora release 38 (Thirty Eight)
Release: 38
apache/tika 2.8.0 runs on another box
dovecot passes mail to tika, via config
fts_tika = http://192.0.2.20:9998/tika/
indexing & search work with simple mail passed, and with attachments ... except
with .eml attachments.
sending this message (plain txt email, with .eml as attachment)
Content-Type: multipart/mixed;
boundary="------------O6z0p865JVYqnvxPyzc0aSLA"
Message-ID: <[email protected]>
Date: Sat, 12 Aug 2023 19:37:44 -0400
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Reply-To: [email protected]
From: sender <[email protected]>
Content-Language: en-US
To: [email protected]
Subject: tika test
This is a multi-part message in MIME format.
--------------O6z0p865JVYqnvxPyzc0aSLA
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
 tika test
--------------O6z0p865JVYqnvxPyzc0aSLA
Content-Type: message/rfc822; name="t2.eml"
Content-Disposition: attachment; filename="t2.eml"
Content-Transfer-Encoding: 8bit
Message-ID: <[email protected]>
Date: Sat, 12 Aug 2023 14:32:48 -0400
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Reply-To: [email protected]
From: sender <[email protected]>
Content-Language: en-US
To: [email protected]
Subject: t2
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
 t2
--------------O6z0p865JVYqnvxPyzc0aSLA--
from an off-site server to local postfix+dovecot instance
tika & dovecot logs,
journalctl -f -u tika
Aug 12 19:52:36 svr04tika[2035]: INFO [qtp1472012329-175] 19:52:36,918
org.apache.tika.server.core.TikaLoggingFilter Request URI:
http://192.0.2.20:9998/tika/
Aug 12 19:52:36 svr04tika[2035]: INFO [qtp1472012329-175] 19:52:36,918
org.apache.tika.server.core.resource.TikaResource /tika (message/rfc822)
Aug 12 19:52:36 svr04tika[2035]: WARN [qtp1472012329-175] 19:52:36,918
org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed
(t2.eml)
Aug 12 19:52:36 svr04tika[2035]:
org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:185)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:57)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:357)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:507)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1651)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.Server.handle(Server.java:516)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
~[tika-server-standard-2.8.0.jar:2.8.0]
Aug 12 19:52:36 svr04tika[2035]: at
java.lang.Thread.run(Thread.java:1623) [?:?]
Aug 12 19:52:36 svr04tika[2035]: Aug 12, 2023 7:52:36 PM
org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerProblem
Aug 12 19:52:36 svr04tika[2035]: SEVERE: Problem with writing the data,
class
org.apache.tika.server.core.resource.TikaResource$$Lambda$387/0x00000008012a9300,
ContentType: text/plain
tail -f /var/log/dovecot/*log | grep http-client
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: host 192.0.2.20: Host created
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: host 192.0.2.20: Host session created
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998 (shared): Peer created
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Peer pool created
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Peer created
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: queue http://192.0.2.20:9998: Setting up connection to
192.0.2.20:9998 (1 requests pending)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Linked queue http://192.0.2.20:9998 (1
queues linked)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: queue http://192.0.2.20:9998: Started new connection to
192.0.2.20:9998
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://192.0.2.20/tika/]: Submitted (requests
left=1)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://192.0.2.20/tika/]: Waiting for request
to finish
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Creating 1 new connections to handle
requests (already 0 usable, connecting to 0, closing 0)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Making new connection 1 of 1 (0
connections exist, 0 pending)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Connecting
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Waiting for connect (fd=28) to finish
for max 0 msecs
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: HTTP connection created (1 parallel
connections exist)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Client connected (fd=28)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Connected
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Ready for requests
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Successfully connected (1 connections
exist, 0 pending)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Successfully connected (1 connections
exist, 0 pending)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Using 1 idle connections to handle 1
requests (1 total connections ready)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: queue http://192.0.2.20:9998: Connection to peer 192.0.2.20:9998
claimed request [Req1: PUT http://192.0.2.20:9998/tika/]
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Claimed request [Req1: PUT
http://192.0.2.20:9998/tika/]
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://192.0.2.20/tika/]: Sent header
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: No more requests to service for this peer
(1 connections exist, 0 pending)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Got 422 response for request [Req1:
PUT http://192.0.2.20:9998/tika/]: Unprocessable Entity (took 3 ms + 106 ms in queue)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://192.0.2.20/tika/]: Finished
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: queue http://192.0.2.20:9998: Dropping request [Req1: PUT
http://192.0.2.20:9998/tika/]
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: host 192.0.2.20: Host is idle (timeout = 100 msecs)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://192.0.2.20/tika/]: Free (requests
left=1)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: No requests to service for this peer (1
connections exist, 0 pending)
2023-08-12 19:52:36
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: No more requests queued; going idle
(timeout = 100 msecs)
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: host 192.0.2.20: Idle host timed out
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: host 192.0.2.20: Host destroy
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: host 192.0.2.20: Host session destroy
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: queue http://192.0.2.20:9998: Destroy
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Unlinked queue http://192.0.2.20:9998 (0
queues linked)
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Idle connection timed out
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Connection close
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Connection disconnect
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Disconnected: Connection closed (fd=28)
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Detached peer
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Lost a connection (0 queues linked, 0
connections left, 0 connections pending, 0 requests pending, 0 requests urgent)
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: conn 192.0.2.20:9998 [1]: Connection destroy
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Dropping peer now
2023-08-12 19:52:37
indexer-worker([email protected])<lrVsHUQb2GR1dgAA+IOfAw:xKZLMEQb2GR3dgAA+IOfAw>:
Debug: http-client: peer 192.0.2.20:9998: Peer close
unclear 2 me whether this is a bug in dovecot or in tika, or a matter of
(mis)config.
??
_______________________________________________
dovecot mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
dovecot mailing list -- [email protected]
To unsubscribe send an email to [email protected]