xujunwei916 opened a new issue, #33819: URL: https://github.com/apache/doris/issues/33819
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version 2.1.2 ### What's Wrong? 使用外部catalog查询hive数据插入到doris时,会有一定概率导致BE宕机,数据量越大宕机的几率就越高。 ## 环境说明 doris 有三个节点物理机,机器配置是64核512G的机器 hive 使用的时CDH 6.3.2版本hive,hive版本是Hive 2.1.1-cdh6.3.2,并开启了kerbores认证 ## 数据量说明 hive表一个日期分区数据是1亿6千万,对应有有1300个左右的子分区,数据存储大约在25G左右, 当我把doris的bucket设置成6个的时候很容易导致be宕机,后修改成24之后是有一定几率宕机,几率小了很多 ## 执行过程 1. 创建catalog ```sql 'type'='hms', 'hive.metastore.uris' = 'thrift://**:9083', 'hive.metastore.sasl.enabled' = 'true', 'hive.metastore.kerberos.principal' = 'hive/_HOST@**', 'dfs.nameservices'='nameservice1', 'dfs.ha.namenodes.nameservice1'='namenode152,namenode154', 'dfs.namenode.rpc-address.nameservice1.namenode152'='**:8020', 'dfs.namenode.rpc-address.nameservice1.namenode154'='**:8020', 'dfs.client.failover.proxy.provider.nameservice1'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', 'hadoop.security.authentication' = 'kerberos', 'hadoop.kerberos.keytab' = '/opt/apache-doris/keytab/**.keytab', 'hadoop.kerberos.principal' = '**@**, 'yarn.resourcemanager.principal' = 'yarn/_HOST@**' , 'file.meta.cache.ttl-second' = '0', 'hive.version' = '2.1.1' ); ``` 2. 创建doris表,且对应hive有一张相同的表,只是字段顺序不一致: ```sql create table if not exists ${DB}.`xx_user` ( `dt` int NOT NULL ,`tenant_code` VARCHAR(32) NOT NULL ,`project_id` VARCHAR(50) NOT NULL ,`user_id` VARCHAR(512) NOT NULL ,`event_scene` VARCHAR(10) NOT NULL -- ... 这里有180左右的字段 ,`xx` STRING ) UNIQUE KEY( `dt` ,`tenant_code` ,`project_id` ,`user_id` ,`event_scene` ) PARTITION BY LIST(`dt`) ( PARTITION `p19700101` VALUES IN (19700101) ) DISTRIBUTED BY HASH(`user_id`) BUCKETS 24 PROPERTIES ( "replication_allocation" = "tag.location.default: 2" ,"enable_unique_key_merge_on_write" = "true" ,"colocate_with" = "colocate_with_group_24" ); ``` 3. 执行sql插入到doris(已经提前添加分区p20230703) ```sql INSERT INTO ${DB}.`xx_user` partition (p20230703) SELECT `dt`, `tenant_code`, `project_id`, `user_id`, `event_scene`, ... FROM hive.${DB}.`xx_user` where dt='20230703' ``` ## 错误日志 be.out ```text *** Query id: c3b5c1bdf7e34938-882f8be86b1f367c *** *** is nereids: 1 *** *** tablet id: 0 *** *** Aborted at 1713409234 (unix time) try "date -d @1713409234" if you are using GNU date *** *** Current BE git commitID: b130df2488 *** *** SIGSEGV unknown detail explain (@0x0) received by PID 60981 (TID 57603 OR 0x7f74a87f6700) from PID 0; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:421 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/lib/jvm/java-1.8.0-openjdk/jre/lib/amd64/server/libjvm.so 2# JVM_handle_linux_signal in /usr/lib/jvm/java-1.8.0-openjdk/jre/lib/amd64/server/libjvm.so 3# signalHandler(int, siginfo_t*, void*) in /usr/lib/jvm/java-1.8.0-openjdk/jre/lib/amd64/server/libjvm.so 4# 0x00007F95F7803400 in /lib64/libc.so.6 5# doris::RuntimeState::is_cancelled() const at /home/zcp/repo_center/doris_release/doris/be/src/runtime/runtime_state.cpp:360 6# doris::DeltaWriterV2::write(doris::vectorized::Block const*, std::vector<unsigned int, std::allocator<unsigned int> > const&, bool) at /home/zcp/repo_center/doris_release/doris/be/src/olap/delta_writer_v2.cpp:164 7# doris::vectorized::VTabletWriterV2::_write_memtable(std::shared_ptr<doris::vectorized::Block>, long, doris::vectorized::Rows const&, std::vector<std::shared_ptr<doris::LoadStreamStub>, std::allocator<std::shared_ptr<doris::LoadStreamStub> > > const&) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer_v2.cpp:461 8# doris::vectorized::VTabletWriterV2::write(doris::vectorized::Block&) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer_v2.cpp:413 9# _ZN5doris10vectorized15AsyncWriterSinkINS0_15VTabletWriterV2EXadsoKcL_ZNS0_19VOLAP_TABLE_SINK_V2EEEEE4sendEPNS_12RuntimeStateEPNS0_5BlockEb at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/async_writer_sink.h:89 10# doris::PlanFragmentExecutor::open_vectorized_internal() at /home/zcp/repo_center/doris_release/doris/be/src/runtime/plan_fragment_executor.cpp:342 11# doris::PlanFragmentExecutor::open() at /home/zcp/repo_center/doris_release/doris/be/src/runtime/plan_fragment_executor.cpp:274 12# doris::PlanFragmentExecutor::execute() at /home/zcp/repo_center/doris_release/doris/be/src/runtime/plan_fragment_executor.cpp:405 13# doris::FragmentMgr::_exec_actual(std::shared_ptr<doris::PlanFragmentExecutor>, std::function<void (doris::RuntimeState*, doris::Status*)> const&) at /home/zcp/repo_center/doris_release/doris/be/src/runtime/fragment_mgr.cpp:459 14# std::_Function_handler<void (), doris::FragmentMgr::exec_plan_fragment(doris::TExecPlanFragmentParams const&, std::function<void (doris::RuntimeState*, doris::Status*)> const&)::$_0>::_M_invoke(std::_Any_data const&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291 15# doris::ThreadPool::dispatch_thread() in /opt/apache-doris/be/lib/doris_be 16# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:499 17# start_thread in /lib64/libpthread.so.0 18# clone in /lib64/libc.so.6 ``` be.WARNING ```text W20240418 10:49:42.835407 64553 status.h:380] meet error status: [INTERNAL_ERROR]Couldn't deserialize thrift msg: No more data to read. 0# doris::Status doris::deserialize_thrift_msg<tparquet::PageHeader>(unsigned char const*, unsigned int*, bool, tparquet::PageHeader*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thrift_util.h:151 1# doris::vectorized::PageReader::next_page_header() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449 2# doris::vectorized::ColumnChunkReader::next_page() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449 3# doris::vectorized::ScalarColumnReader::read_column_data(COW<doris::vectorized::IColumn>::immutable_ptr<doris::vectorized::IColumn>&, std::shared_ptr<doris::vectorized::IDataType const>&, doris::vectorized::ColumnSelectVector&, unsigned long, unsigned long*, bool*, bool) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449 4# doris::vectorized::RowGroupReader::_read_column_data(doris::vectorized::Block*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, unsigned long, unsigned long*, bool*, doris::vectorized::ColumnSelectVector&) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:414 5# doris::vectorized::RowGroupReader::next_batch(doris::vectorized::Block*, unsigned long, unsigned long*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:314 6# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:444 7# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449 8# doris::vectorized::VScanner::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/vscanner.cpp:0 9# doris::vectorized::VScanner::get_block_after_projects(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449 10# doris::vectorized::ScannerScheduler::_scanner_scan(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:345 11# std::_Function_handler<void (), doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_3>::_M_invoke(std::_Any_data const&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701 12# doris::ThreadPool::dispatch_thread() at /home/zcp/repo_center/doris_release/doris/be/src/util/threadpool.cpp:0 13# doris::Thread::supervise_thread(void*) at /var/local/ldb_toolchain/bin/../usr/include/pthread.h:562 14# start_thread 15# clone ``` ### What You Expected? 如果是因为数据量的问题,应该是sql失败而不应该doris的be宕机。 ### How to Reproduce? _No response_ ### Anything Else? _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org