satish-mittal opened a new issue, #10861: URL: https://github.com/apache/pinot/issues/10861
Observing that pinot servers are periodically crashing with SIGSEGV after we enabled lucene based text index on one column. Here is one such instance: ``` 2023/06/05 10:57:24.157 INFO [MutableSegmentImpl_viewName__1__9363__20230605T0929Z_view-name] [HelixTaskExecutor-message_handle_thread_30] Trying to close RealtimeSegmentImpl : viewName__1__9363__20230605T0929Z 2023/06/05 10:57:24.157 INFO [MutableSegmentImpl_viewName__1__9363__20230605T0929Z_view-name] [HelixTaskExecutor-message_handle_thread_30] Segment used 4547528251 bytes of memory for 2525493 rows consumed in 5275 seconds 2023/06/05 10:57:24.219 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Closing buffer org.apache.pinot.segment.spi.memory.PinotByteBuffer@51be6026 2023/06/05 10:57:24.237 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Closing buffer org.apache.pinot.segment.spi.memory.PinotByteBuffer@243b600e # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f5aeb45d676, pid=1, tid=5080 # # JRE version: OpenJDK Runtime Environment Temurin-11.0.18+10 (11.0.18+10) (build 11.0.18+10) # Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.18+10 (11.0.18+10, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64) # Problematic frame: # J 8576 c2 java.nio.DirectByteBuffer.getInt(I)I java.base@11.0.18 (28 bytes) @ 0x00007f5aeb45d676 [0x00007f5aeb45d620+0x0000000000000056] # # Core dump will be written. Default location: /core.%e.1.%t # # An error report file with more information is saved as: # /opt/pinot/hs_err_pid1.log 2023/06/05 10:57:24.247 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Closing buffer org.apache.pinot.segment.spi.memory.PinotByteBuffer@5b5aed94 2023/06/05 10:57:24.255 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Closing buffer org.apache.pinot.segment.spi.memory.PinotByteBuffer@79c70989 2023/06/05 10:57:24.260 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Closing buffer org.apache.pinot.segment.spi.memory.PinotByteBuffer@15e05ae4 2023/06/05 10:57:24.263 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Closing buffer org.apache.pinot.segment.spi.memory.PinotByteBuffer@fbfbe40 2023/06/05 10:57:24.281 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Closing buffer org.apache.pinot.segment.spi.memory.PinotByteBuffer@68f9d3aa 2023/06/05 10:57:24.288 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Closing buffer org.apache.pinot.segment.spi.memory.PinotByteBuffer@74c98f35 2023/06/05 10:57:24.289 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Closing buffer org.apache.pinot.segment.spi.memory.PinotByteBuffer@283a45ea 2023/06/05 10:57:24.331 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Deleted file /var/pinot/server/data/index/viewName_REALTIME/consumers/viewName__1__9363__20230605T0929Z.0 2023/06/05 10:57:24.390 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Deleted file /var/pinot/server/data/index/viewName_REALTIME/consumers/viewName__1__9363__20230605T0929Z.1 2023/06/05 10:57:24.426 INFO [MmapMemoryManager] [HelixTaskExecutor-message_handle_thread_30] Deleted file /var/pinot/server/data/index/viewName_REALTIME/consumers/viewName__1__9363__20230605T0929Z.2 Compiled method (c2) 7597812 28400 4 org.apache.pinot.segment.local.realtime.impl.dictionary.StringOffHeapMutableDictionary::get (6 bytes) total in heap [0x00007f5aec681090,0x00007f5aec682058] = 4040 relocation [0x00007f5aec681208,0x00007f5aec681290] = 136 main code [0x00007f5aec6812a0,0x00007f5aec6819c0] = 1824 stub code [0x00007f5aec6819c0,0x00007f5aec6819f8] = 56 oops [0x00007f5aec6819f8,0x00007f5aec681a00] = 8 metadata [0x00007f5aec681a00,0x00007f5aec681a90] = 144 scopes data [0x00007f5aec681a90,0x00007f5aec681d40] = 688 scopes pcs [0x00007f5aec681d40,0x00007f5aec681f20] = 480 dependencies [0x00007f5aec681f20,0x00007f5aec681f30] = 16 handler table [0x00007f5aec681f30,0x00007f5aec681fc0] = 144 nul chk table [0x00007f5aec681fc0,0x00007f5aec682058] = 152 Compiled method (c2) 7597812 35340 4 org.apache.pinot.segment.local.segment.creator.impl.SegmentColumnarIndexCreator::indexRow (2005 bytes) total in heap [0x00007f5aecefd910,0x00007f5aecf0fab0] = 74144 relocation [0x00007f5aecefda88,0x00007f5aecefe318] = 2192 main code [0x00007f5aecefe320,0x00007f5aecf09e80] = 47968 stub code [0x00007f5aecf09e80,0x00007f5aecf0a228] = 936 oops [0x00007f5aecf0a228,0x00007f5aecf0a278] = 80 metadata [0x00007f5aecf0a278,0x00007f5aecf0a5b0] = 824 scopes data [0x00007f5aecf0a5b0,0x00007f5aecf0d6d8] = 12584 scopes pcs [0x00007f5aecf0d6d8,0x00007f5aecf0edb8] = 5856 dependencies [0x00007f5aecf0edb8,0x00007f5aecf0edf8] = 64 handler table [0x00007f5aecf0edf8,0x00007f5aecf0f818] = 2592 nul chk table [0x00007f5aecf0f818,0x00007f5aecf0fab0] = 664 Compiled method (c2) 7597813 28753 4 org.apache.pinot.segment.spi.index.mutable.MutableForwardIndex::getDictId (8 bytes) total in heap [0x00007f5aec081510,0x00007f5aec081c78] = 1896 relocation [0x00007f5aec081688,0x00007f5aec0816c8] = 64 main code [0x00007f5aec0816e0,0x00007f5aec081940] = 608 stub code [0x00007f5aec081940,0x00007f5aec081958] = 24 oops [0x00007f5aec081958,0x00007f5aec081960] = 8 metadata [0x00007f5aec081960,0x00007f5aec0819d0] = 112 scopes data [0x00007f5aec0819d0,0x00007f5aec081af8] = 296 scopes pcs [0x00007f5aec081af8,0x00007f5aec081c18] = 288 dependencies [0x00007f5aec081c18,0x00007f5aec081c28] = 16 handler table [0x00007f5aec081c28,0x00007f5aec081c40] = 24 nul chk table [0x00007f5aec081c40,0x00007f5aec081c78] = 56 ``` Attached timeline logs of one segment for which SIGSEGV happened (in this case viewName__3__11117__20230606T1148Z) to capture the lifecycle. As per my interpretation, the pattern seems to be (same as earlier): - Server tells controller that segment is consumed, but it continues to get HOLD response. - Eventually server gets KEEP response from controller, meaning it is not the committer. - PartitionConsumer then changes the state to RETAINING and calls buildSegmentAndReplace(). However this ends up taking a long time (more than 10 minutes). - At the same time, server also receives Helix message to transition from CONSUMING to ONLINE. This path waits for 10 minutes to close the consumer, but fails to do so. Then it goes ahead to download and replace the segment (Downloading to replace segment while in state RETAINING). - The replace segment flow in Helix handler ends up deleting the segment, whereas partition consumer is still busy indexing rows. [sigsegv-segment.log](https://github.com/apache/pinot/files/11674294/sigsegv-segment.log) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org