rmuir commented on issue #14454: URL: https://github.com/apache/lucene/issues/14454#issuecomment-2791055271
the test framework had clearly pointed directly at the suspect all along. here is is: ``` 2> Apr 08, 2025 9:14:26 AM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll 2> SEVERE: There are still zombie threads that couldn't be terminated: 2> 1) Thread[id=50, name=Searcher node=R0 tcpPort=33091, state=RUNNABLE, group=TGRP-TestStressNRTReplication] 2> at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method) 2> at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47) 2> at java.base/sun.nio.ch.NioSocketImpl.tryRead(NioSocketImpl.java:256) 2> at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:307) 2> at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:346) 2> at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:796) 2> at java.base/java.net.Socket$SocketInputStream.implRead(Socket.java:1108) 2> at java.base/java.net.Socket$SocketInputStream.read(Socket.java:1095) 2> at java.base/java.net.Socket$SocketInputStream.read(Socket.java:1089) 2> at org.apache.lucene.core@10.2.0-SNAPSHOT/org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:34) 2> at org.apache.lucene.core@10.2.0-SNAPSHOT/org.apache.lucene.store.DataInput.readVLong(DataInput.java:199) 2> at org.apache.lucene.replicator.nrt.TestStressNRTReplication$SearchThread.run(TestStressNRTReplication.java:1079) ``` This is the hung searcher that can't be interrupt()'d. It is blocking on a socket read, which is not good and causes the hang. Here is the relevant code: ```java while (c.sockIn.available() == 0) { if (stop.get()) { break; } if (node.isOpen == false) { throw new IOException("node closed"); } Thread.sleep(1); } version = c.in.readVLong(); // <-- this is the blocking read that hangs forever ``` The looping on `available()` was the "suspicious stuff" as the javadoc Impl Spec states: "The available method of InputStream always returns 0". In general, things are getting crashed here, maybe there are bugs in the test, but i'd rather we have failures instead of hangs. As a first step, I recommend setting `.soTimeout()` on sockets to a reasonable value such as 30s so that reads won't block forever. LockVerifyServer does this, and it never hangs infinitely: `s.setSoTimeout(30000);` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org