satish-mittal opened a new issue, #10861:
URL: https://github.com/apache/pinot/issues/10861

   Observing that pinot servers are periodically crashing with SIGSEGV after we 
enabled lucene based text index on one column. Here is one such instance:
   ```
   2023/06/05 10:57:24.157 INFO 
[MutableSegmentImpl_viewName__1__9363__20230605T0929Z_view-name] 
[HelixTaskExecutor-message_handle_thread_30] Trying to close 
RealtimeSegmentImpl : viewName__1__9363__20230605T0929Z
   2023/06/05 10:57:24.157 INFO 
[MutableSegmentImpl_viewName__1__9363__20230605T0929Z_view-name] 
[HelixTaskExecutor-message_handle_thread_30] Segment used 4547528251 bytes of 
memory for 2525493 rows consumed in 5275 seconds
   2023/06/05 10:57:24.219 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Closing buffer 
org.apache.pinot.segment.spi.memory.PinotByteBuffer@51be6026
   2023/06/05 10:57:24.237 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Closing buffer 
org.apache.pinot.segment.spi.memory.PinotByteBuffer@243b600e
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x00007f5aeb45d676, pid=1, tid=5080
   #
   # JRE version: OpenJDK Runtime Environment Temurin-11.0.18+10 (11.0.18+10) 
(build 11.0.18+10)
   # Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.18+10 (11.0.18+10, mixed 
mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
   # Problematic frame:
   # J 8576 c2 java.nio.DirectByteBuffer.getInt(I)I java.base@11.0.18 (28 
bytes) @ 0x00007f5aeb45d676 [0x00007f5aeb45d620+0x0000000000000056]
   #
   # Core dump will be written. Default location: /core.%e.1.%t
   #
   # An error report file with more information is saved as:
   # /opt/pinot/hs_err_pid1.log
   2023/06/05 10:57:24.247 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Closing buffer 
org.apache.pinot.segment.spi.memory.PinotByteBuffer@5b5aed94
   2023/06/05 10:57:24.255 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Closing buffer 
org.apache.pinot.segment.spi.memory.PinotByteBuffer@79c70989
   2023/06/05 10:57:24.260 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Closing buffer 
org.apache.pinot.segment.spi.memory.PinotByteBuffer@15e05ae4
   2023/06/05 10:57:24.263 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Closing buffer 
org.apache.pinot.segment.spi.memory.PinotByteBuffer@fbfbe40
   2023/06/05 10:57:24.281 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Closing buffer 
org.apache.pinot.segment.spi.memory.PinotByteBuffer@68f9d3aa
   2023/06/05 10:57:24.288 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Closing buffer 
org.apache.pinot.segment.spi.memory.PinotByteBuffer@74c98f35
   2023/06/05 10:57:24.289 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Closing buffer 
org.apache.pinot.segment.spi.memory.PinotByteBuffer@283a45ea
   2023/06/05 10:57:24.331 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Deleted file 
/var/pinot/server/data/index/viewName_REALTIME/consumers/viewName__1__9363__20230605T0929Z.0
   2023/06/05 10:57:24.390 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Deleted file 
/var/pinot/server/data/index/viewName_REALTIME/consumers/viewName__1__9363__20230605T0929Z.1
   2023/06/05 10:57:24.426 INFO [MmapMemoryManager] 
[HelixTaskExecutor-message_handle_thread_30] Deleted file 
/var/pinot/server/data/index/viewName_REALTIME/consumers/viewName__1__9363__20230605T0929Z.2
   Compiled method (c2) 7597812 28400       4       
org.apache.pinot.segment.local.realtime.impl.dictionary.StringOffHeapMutableDictionary::get
 (6 bytes)
    total in heap  [0x00007f5aec681090,0x00007f5aec682058] = 4040
    relocation     [0x00007f5aec681208,0x00007f5aec681290] = 136
    main code      [0x00007f5aec6812a0,0x00007f5aec6819c0] = 1824
    stub code      [0x00007f5aec6819c0,0x00007f5aec6819f8] = 56
    oops           [0x00007f5aec6819f8,0x00007f5aec681a00] = 8
    metadata       [0x00007f5aec681a00,0x00007f5aec681a90] = 144
    scopes data    [0x00007f5aec681a90,0x00007f5aec681d40] = 688
    scopes pcs     [0x00007f5aec681d40,0x00007f5aec681f20] = 480
    dependencies   [0x00007f5aec681f20,0x00007f5aec681f30] = 16
    handler table  [0x00007f5aec681f30,0x00007f5aec681fc0] = 144
    nul chk table  [0x00007f5aec681fc0,0x00007f5aec682058] = 152
   Compiled method (c2) 7597812 35340       4       
org.apache.pinot.segment.local.segment.creator.impl.SegmentColumnarIndexCreator::indexRow
 (2005 bytes)
    total in heap  [0x00007f5aecefd910,0x00007f5aecf0fab0] = 74144
    relocation     [0x00007f5aecefda88,0x00007f5aecefe318] = 2192
    main code      [0x00007f5aecefe320,0x00007f5aecf09e80] = 47968
    stub code      [0x00007f5aecf09e80,0x00007f5aecf0a228] = 936
    oops           [0x00007f5aecf0a228,0x00007f5aecf0a278] = 80
    metadata       [0x00007f5aecf0a278,0x00007f5aecf0a5b0] = 824
    scopes data    [0x00007f5aecf0a5b0,0x00007f5aecf0d6d8] = 12584
    scopes pcs     [0x00007f5aecf0d6d8,0x00007f5aecf0edb8] = 5856
    dependencies   [0x00007f5aecf0edb8,0x00007f5aecf0edf8] = 64
    handler table  [0x00007f5aecf0edf8,0x00007f5aecf0f818] = 2592
    nul chk table  [0x00007f5aecf0f818,0x00007f5aecf0fab0] = 664
   Compiled method (c2) 7597813 28753       4       
org.apache.pinot.segment.spi.index.mutable.MutableForwardIndex::getDictId (8 
bytes)
    total in heap  [0x00007f5aec081510,0x00007f5aec081c78] = 1896
    relocation     [0x00007f5aec081688,0x00007f5aec0816c8] = 64
    main code      [0x00007f5aec0816e0,0x00007f5aec081940] = 608
    stub code      [0x00007f5aec081940,0x00007f5aec081958] = 24
    oops           [0x00007f5aec081958,0x00007f5aec081960] = 8
    metadata       [0x00007f5aec081960,0x00007f5aec0819d0] = 112
    scopes data    [0x00007f5aec0819d0,0x00007f5aec081af8] = 296
    scopes pcs     [0x00007f5aec081af8,0x00007f5aec081c18] = 288
    dependencies   [0x00007f5aec081c18,0x00007f5aec081c28] = 16
    handler table  [0x00007f5aec081c28,0x00007f5aec081c40] = 24
    nul chk table  [0x00007f5aec081c40,0x00007f5aec081c78] = 56
   ```
   
   Attached timeline logs of one segment for which SIGSEGV happened (in this 
case viewName__3__11117__20230606T1148Z) to capture the lifecycle.
   
   As per my interpretation, the pattern seems to be (same as earlier):
   - Server tells controller that segment is consumed, but it continues to get 
HOLD response.
   - Eventually server gets KEEP response from controller, meaning it is not 
the committer.
   - PartitionConsumer then changes the state to RETAINING and calls 
buildSegmentAndReplace(). However this ends up taking a long time (more than 10 
minutes).
   - At the same time, server also receives Helix message to transition from 
CONSUMING to ONLINE. This path waits for 10 minutes to close the consumer, but 
fails to do so. Then it goes ahead to download and replace the segment 
(Downloading to replace segment while in state RETAINING).
   - The replace segment flow in Helix handler ends up deleting the segment, 
whereas partition consumer is still busy indexing rows.
   
   
[sigsegv-segment.log](https://github.com/apache/pinot/files/11674294/sigsegv-segment.log)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to