robot19310 opened a new issue, #57373:
URL: https://github.com/apache/doris/issues/57373

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   3.0.4 
   
   ### What's Wrong?
   
   metaservice cored when transaction timed out. The part of core stack is
   ```
   Program terminated with signal 11, Segmentation fault.
   #0  0x00005559c62fd3e4 in 
doris::cloud::InstanceRecycler::scan_and_recycle(std::__cxx11::basic_string<char,
 std::char_traits<char>, std::allocator<char> >, std::basic_string_view<char, 
std::char_traits<char> >, std::function<int (std::basic_string_view<char, 
std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> 
>)>, std::function<int ()>) ()
   (gdb) quit
   ```
   The /var/log/messages is
   ```
   audit: ANOM_ABEND auid=0 uid=1000 gid=1001 ses=13296 pid=72223 
comm="doris_cloud" exe="/ssd1/doris/doris/ms/lib/doris_cloud" sig=11 res=1
   kernel: doris_cloud[72346]: segfault at 0 ip 00005559c62fd3e4 sp 
00007f0080908380 error 4 in doris_cloud[5559c606b000+105a000]
   kernel: audit: type=1701 audit(1761512714.831:168664): auid=0 uid=1000 
gid=1001 ses=13296 pid=72223 comm="doris_cloud" 
exe="/ssd1/doris/doris/ms/lib/doris_cloud" sig=11 res=1
   kernel: doris_cloud[72332]: segfault at 0 ip 00005559c62fd3e4 sp 
00007f00895163a0 error 4 in doris_cloud[5559c606b000+105a000]
   systemd: Started Process Core Dump (PID 59163/UID 0).
   audit: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 
msg='unit=systemd-coredump@4-59163-0 comm="systemd" 
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
   kernel: audit: type=1130 audit(1761512714.858:168665): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@4-59163-0 
comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? 
res=success'
   systemd-coredump: Core file was truncated to 2147483648 bytes.
   systemd-coredump: Process 72223 (doris_cloud) of user 1000 dumped 
core.#012#012Stack trace of thread 72346:#012#0  0x00005559c62fd3e4 
_ZN5doris5cloud16InstanceRecycler16scan_and_recycleENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt17basic_string_viewIcS5_ESt8functionIFiS9_S9_EESA_IFivEE
 (/ssd1/doris/doris/ms/lib/doris_cloud)#012#012Stack trace of thread 
72233:#012#0  0x00007f00d1e6bde2 n/a (n/a)
   audit: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 
msg='unit=systemd-coredump@4-59163-0 comm="systemd" 
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
   ```
   The metaservice.INFO is
   ```
   W20251027 05:05:14.332929 72346 txn_kv.cpp:432] Operation aborted because 
the transaction timed out
   W20251027 05:05:14.332942 72332 txn_kv.cpp:432] Operation aborted because 
the transaction timed out
   W20251027 05:05:14.332965 72346 recycler.cpp:2290] failed to get kv, 
range=[011072656379636c65000110323938323133363338000110696e6465780001120000000000000000,
   
011072656379636c65000110323938323133363338000110696e6465780001127fffffffffffffff)
 num_scanned=0 txn_get_ret=-1 get_range_retried=0
   W20251027 05:05:14.333004 72332 recycler.cpp:2290] failed to get kv, 
range=[011072656379636c6500011032393832313336333800011073746167650001100001,011072656379
   636c650001103239383231333633380001107374616765000110ff0001) num_scanned=0 
txn_get_ret=-1 get_range_retried=0
   ```
   
   The issue is located in the `scan_and_recycle` function in 
`cloud/src/recycler/recycler.cpp`. The problematic code pattern is:
   
   ```cpp
   std::unique_ptr<RangeGetIterator> it;
   do {
       int get_ret = txn_get(txn_kv_.get(), begin, end, it);
       if (get_ret != 0) {
           // Error handling, it may remain nullptr
           ++get_range_retried;
           std::this_thread::sleep_for(std::chrono::milliseconds(500));
           continue; // Jumps to loop condition check
       }
       // Process data...
   } while (it->more() && !stopped()); // CRASH: it may be nullptr
   ```
   The problem occurs when:
   1. `txn_get` returns non-zero (such as transaction timed out)
   2. `it` remains nullptr or points to an invalid object
   3. `continue` jumps to the loop condition
   4. `it->more()` is called on a null pointer, causing segfault at address 0
   
   
   ### What You Expected?
   
   fix it
   
   ### How to Reproduce?
   
   This is an occasional bug. I'm not sure how to construct a `txn_get()` error 
that does not return 0.
   
   ### Anything Else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to