rmuir commented on issue #14454:
URL: https://github.com/apache/lucene/issues/14454#issuecomment-2791055271

   the test framework had clearly pointed directly at the suspect all along. 
here is is:
   ```
     2> Apr 08, 2025 9:14:26 AM 
com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
     2> SEVERE: There are still zombie threads that couldn't be terminated:
     2>    1) Thread[id=50, name=Searcher node=R0 tcpPort=33091, 
state=RUNNABLE, group=TGRP-TestStressNRTReplication]
     2>         at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method)
     2>         at 
java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47)
     2>         at 
java.base/sun.nio.ch.NioSocketImpl.tryRead(NioSocketImpl.java:256)
     2>         at 
java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:307)
     2>         at 
java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:346)
     2>         at 
java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:796)
     2>         at 
java.base/java.net.Socket$SocketInputStream.implRead(Socket.java:1108)
     2>         at 
java.base/java.net.Socket$SocketInputStream.read(Socket.java:1095)
     2>         at 
java.base/java.net.Socket$SocketInputStream.read(Socket.java:1089)
     2>         at 
org.apache.lucene.core@10.2.0-SNAPSHOT/org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:34)
     2>         at 
org.apache.lucene.core@10.2.0-SNAPSHOT/org.apache.lucene.store.DataInput.readVLong(DataInput.java:199)
     2>         at 
org.apache.lucene.replicator.nrt.TestStressNRTReplication$SearchThread.run(TestStressNRTReplication.java:1079)
   ```
   
   This is the hung searcher that can't be interrupt()'d. It is blocking on a 
socket read, which is not good and causes the hang. Here is the relevant code:
   ```java
   while (c.sockIn.available() == 0) {
     if (stop.get()) {
       break;
     }
     if (node.isOpen == false) {
       throw new IOException("node closed");
     }
     Thread.sleep(1);
   }
   version = c.in.readVLong();  // <-- this is the blocking read that hangs 
forever
   ```
   
   The looping on `available()` was the "suspicious stuff" as the javadoc Impl 
Spec states: "The available method of InputStream always returns 0".
   
   In general, things are getting crashed here, maybe there are bugs in the 
test, but i'd rather we have failures instead of hangs.
   
   As a first step, I recommend setting `.soTimeout()` on sockets to a 
reasonable value such as 30s so that reads won't block forever. 
LockVerifyServer does this, and it never hangs infinitely: 
`s.setSoTimeout(30000);`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to