This is an automated email from the ASF dual-hosted git repository. yiguolei pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push: new 80baca2643 [Docs](memory) Admin-manual adds mem tracker, memory exceeds limit, OOM analysis (#14419) 80baca2643 is described below commit 80baca2643fcee50f9679cd91c97f08ce471ab03 Author: Xinyi Zou <zouxiny...@gmail.com> AuthorDate: Wed Nov 30 18:02:05 2022 +0800 [Docs](memory) Admin-manual adds mem tracker, memory exceeds limit, OOM analysis (#14419) --- .../memory-management/be-oom-analysis.md | 84 ++++++++++ .../memory-limit-exceeded-analysis.md | 179 +++++++++++++++++++++ .../memory-management/memory-tracker.md | 98 +++++++++++ .../memory-management/be-oom-analysis.md | 84 ++++++++++ .../memory-limit-exceeded-analysis.md | 179 +++++++++++++++++++++ .../memory-management/memory-tracker.md | 98 +++++++++++ 6 files changed, 722 insertions(+) diff --git a/docs/en/docs/admin-manual/memory-management/be-oom-analysis.md b/docs/en/docs/admin-manual/memory-management/be-oom-analysis.md new file mode 100644 index 0000000000..aa3061a727 --- /dev/null +++ b/docs/en/docs/admin-manual/memory-management/be-oom-analysis.md @@ -0,0 +1,84 @@ +--- +{ + "title": "BE OOM Analysis", + "language": "en" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# BE OOM Analysis + +<version since="1.2.0"> + +Ideally, in [Memory Limit Exceeded Analysis](../admin-manual/memory-management/memory-limit-exceeded-analysis.md), we regularly detect the remaining available memory of the operating system and respond in time when the memory is insufficient , such as triggering the memory GC to release the cache or cancel the memory overrun query, but because refreshing process memory statistics and memory GC both have a certain lag, and it is difficult for us to completely catch all large memory applic [...] + +## Solution +Refer to [BE Configuration Items](../admin-manual/config/be-config.md) to reduce `mem_limit` and increase `max_sys_mem_available_low_water_mark_bytes` in `be.conf`. + +## Memory analysis +If you want to further understand the memory usage location of the BE process before OOM and reduce the memory usage of the process, you can refer to the following steps to analyze. + +1. `dmesg -T` confirms the time of OOM and the process memory at the time of OOM. + +2. Check whether there is a `Memory Tracker Summary` log at the end of be/log/be.INFO. If it indicates that BE has detected memory overrun, go to step 3, otherwise go to step 8. +``` +Memory Tracker Summary: + Type=consistency, Used=0(0 B), Peak=0(0 B) + Type=batch_load, Used=0(0 B), Peak=0(0 B) + Type=clone, Used=0(0 B), Peak=0(0 B) + Type=schema_change, Used=0(0 B), Peak=0(0 B) + Type=compaction, Used=0(0 B), Peak=0(0 B) + Type=load, Used=0(0 B), Peak=0(0 B) + Type=query, Used=206.67 MB(216708729 B), Peak=565.26 MB(592723181 B) + Type=global, Used=930.42 MB(975614571 B), Peak=1017.42 MB(1066840223 B) + Type=tc/jemalloc_cache, Used=51.97 MB(54494616 B), Peak=-1.00 B(-1 B) + Type=process, Used=1.16 GB(1246817916 B), Peak=-1.00 B(-1 B) + MemTrackerLimiter Label=Orphan, Type=global, Limit=-1.00 B(-1 B), Used=474.20 MB(497233597 B), Peak=649.18 MB(680718208 B) + MemTracker Label=BufferAllocator, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=LoadChannelMgr, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=StorageEngine, Parent Label=Orphan, Used=320.56 MB(336132488 B), Peak=322.56 MB(338229824 B) + MemTracker Label=SegCompaction, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=SegmentMeta, Parent Label=Orphan, Used=948.64 KB(971404 B), Peak=943.64 KB(966285 B) + MemTracker Label=TabletManager, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DataPageCache, Type=global, Limit=-1.00 B(-1 B), Used=455.22 MB(477329882 B), Peak=454.18 MB(476244180 B) + MemTrackerLimiter Label=IndexPageCache, Type=global, Limit=-1.00 B(-1 B), Used=1.00 MB(1051092 B), Peak=0(0 B) + MemTrackerLimiter Label=SegmentCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DiskIO, Type=global, Limit=2.47 GB(2655423201 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=ChunkAllocator, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=LastestSuccessChannelCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DeleteBitmap AggCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) +``` + +3. When the end of be/log/be.INFO before OOM contains the system memory exceeded log, refer to [Memory Limit Exceeded Analysis](../admin-manual/memory-management/memory-limit-exceeded-analysis. The log analysis method in md) looks at the memory usage of each category of the process. If the current `type=query` memory usage is high, if the query before OOM is known, continue to step 4, otherwise continue to step 5; if the current `type=load` memory usage is more, continue to step 6, if th [...] + +4. `type=query` query memory usage is high, and the query before OOM is known, such as test cluster or scheduled task, restart the BE node, refer to [Memory Tracker](../admin-manual/memory-management/memory -tracker.md) View real-time memory tracker statistics, retry the query after `set global enable_profile=true`, observe the memory usage location of specific operators, confirm whether the query memory usage is reasonable, and further consider optimizing SQL memory usage, such as adjus [...] + +5. `type=query` query memory usage is high, and the query before OOM is unknown, such as in an online cluster, then search `Deregister query/load memory tracker from the back to the front in `be/log/be.INFO`, queryId` and `Register query/load memory tracker, query/load id`, if the same query id prints the above two lines of logs at the same time, it means that the query or import is successful. If there is only Register but no Deregister, the query or import is still before OOM In this w [...] + +6. `type=load` imports a lot of memory. + +7. When the `type=global` memory is used for a long time, continue to check the `type=global` detailed statistics in the second half of the `Memory Tracker Summary` log. When DataPageCache, IndexPageCache, SegmentCache, ChunkAllocator, LastestSuccessChannelCache, etc. use a lot of memory, refer to [BE Configuration Item](../admin-manual/config/be-config.md) to consider modifying the size of the cache; when Orphan memory usage is too large, Continue the analysis as follows. + - If the sum of the tracker statistics of `Parent Label=Orphan` only accounts for a small part of the Orphan memory, it means that there is currently a large amount of memory that has no accurate statistics, such as the memory of the brpc process. At this time, you can consider using the heap profile [Memory Tracker]( ../community/developer-guide/debug-tool.md) to further analyze memory locations. + - If the tracker statistics of `Parent Label=Orphan` account for most of Orphan’s memory, when `Label=TabletManager` uses a lot of memory, further check the number of tablets in the cluster. If there are too many tablets, delete them and they will not be used table or data; when `Label=StorageEngine` uses too much memory, further check the number of segment files in the cluster, and consider manually triggering compaction if the number of segment files is too large; + +8. If `be/log/be.INFO` does not print the `Memory Tracker Summary` log before OOM, it means that BE did not detect the memory limit in time, observe Grafana memory monitoring to confirm the memory growth trend of BE before OOM, if OOM is reproducible, consider adding `memory_debug=true` in `be.conf`, after restarting the cluster, the cluster memory statistics will be printed every second, observe the last `Memory Tracker Summary` log before OOM, and continue to step 3 for analysis; + +</version> diff --git a/docs/en/docs/admin-manual/memory-management/memory-limit-exceeded-analysis.md b/docs/en/docs/admin-manual/memory-management/memory-limit-exceeded-analysis.md new file mode 100644 index 0000000000..fad526c37e --- /dev/null +++ b/docs/en/docs/admin-manual/memory-management/memory-limit-exceeded-analysis.md @@ -0,0 +1,179 @@ +--- +{ + "title": "Memory Limit Exceeded Analysis", + "language": "en" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Memory Limit Exceeded Analysis + +<version since="1.2.0"> + +When the query or import error `Memory limit exceeded` is reported, the possible reasons are: the process memory exceeds the limit, the remaining available memory of the system is insufficient, and the memory limit for a single query execution is exceeded. +``` +ERROR 1105 (HY000): errCode = 2, detailMessage = Memory limit exceeded:<consuming tracker:<xxx>, xxx. backend 172.1.1.1 process memory used xxx GB, limit xxx GB. If query tracker exceeded, `set exec_mem_limit=8G ` to change limit, details mem usage see be. INFO. +``` + +## The process memory exceeds the limit OR the remaining available memory of the system is insufficient +When the following error is returned, it means that the process memory exceeds the limit, or the remaining available memory of the system is insufficient. The specific reason depends on the memory statistics. +``` +ERROR 1105 (HY000): errCode = 2, detailMessage = Memory limit exceeded:<consuming tracker:<Query#Id=3c88608cf35c461d-95fe88969aa6fc30>, process memory used 2.68 GB exceed limit 2.47 GB or sys mem available 50.95 GB less than low water mark 3.20 GB, failed alloc size 2.00 MB>, executing msg:<execute:<ExecNode:VAGGREGATION_NODE (id=7)>>. backend 172.1.1.1 process memory used 2.68 GB, limit 2.47 GB. If query tracker exceeded, `set exec_mem_limit =8G` to change limit, details mem usage see be.INFO +``` + +### Error message analysis +The error message is divided into three parts: +1. `Memory limit exceeded:<consuming tracker:<Query#Id=3c88608cf35c461d-95fe88969aa6fc30>`: It is found that the memory limit is exceeded during the memory application process of query `3c88608cf35c461d-95fe88969aa6fc30`. +2. `process memory used 2.68 GB exceed limit 2.47 GB or sys mem available 50.95 GB less than low water mark 3.20 GB, failed alloc size 2.00 MB`: The reason for exceeding the limit is that the 2.68GB of memory used by the BE process exceeds the limit of 2.47GB limit, the value of limit comes from mem_limit * system MemTotal in be.conf, which is equal to 80% of the total memory of the operating system by default. The remaining available memory of the current operating system is 50.95 GB, w [...] +3. `executing msg:<execute:<ExecNode:VAGGREGATION_NODE (id=7)>>, backend 172.24.47.117 process memory used 2.68 GB, limit 2.47 GB`: The location of this memory application is `ExecNode:VAGGREGATION_NODE (id= 7)>`, the current IP of the BE node is 172.1.1.1, and print the memory statistics of the BE node again. + +### Log Analysis +At the same time, you can find the following log in log/be.INFO to confirm whether the memory usage of the current process meets expectations. The log is also divided into three parts: +1. `Process Memory Summary`: process memory statistics. +2. `Alloc Stacktrace`: The stack that triggers the memory overrun detection, which is not necessarily the location of the large memory application. +3. `Memory Tracker Summary`: Process memory tracker statistics, refer to [Memory Tracker](../admin-manual/memory-management/memory-tracker.md) to analyze the location of memory usage. +Notice: +1. The printing interval of the process memory overrun log is 1s. After the process memory exceeds the limit, the memory applications in most locations of BE will sense it, and try to make a predetermined callback method, and print the process memory overrun log, so if the log is If the value of Try Alloc is small, you don’t need to pay attention to `Alloc Stacktrace`, just analyze `Memory Tracker Summary` directly. +2. When the process memory exceeds the limit, BE will trigger memory GC. + +``` +W1127 17:23:16.372572 19896 mem_tracker_limiter.cpp:214] System Mem Exceed Limit Check Faild, Try Alloc: 1062688 +Process Memory Summary: + process memory used 2.68 GB limit 2.47 GB, sys mem available 50.95 GB min reserve 3.20 GB, tc/jemalloc allocator cache 51.97 MB +Alloc Stacktrace: + @ 0x50028e8 doris::MemTrackerLimiter::try_consume() + @ 0x50027c1 doris::ThreadMemTrackerMgr::flush_untracked_mem<>() + @ 0x595f234 malloc + @ 0xb888c18 operator new() + @ 0x8f316a2 google::LogMessage::Init() + @ 0x5813fef doris::FragmentExecState::coordinator_callback() + @ 0x58383dc doris::PlanFragmentExecutor::send_report() + @ 0x5837ea8 doris::PlanFragmentExecutor::update_status() + @ 0x58355b0 doris::PlanFragmentExecutor::open() + @ 0x5815244 doris::FragmentExecState::execute() + @ 0x5817965 doris::FragmentMgr::_exec_actual() + @ 0x581fffb std::_Function_handler<>::_M_invoke() + @ 0x5a6f2c1 doris::ThreadPool::dispatch_thread() + @ 0x5a6843f doris::Thread::supervise_thread() + @ 0x7feb54f931ca start_thread + @ 0x7feb5576add3 __GI___clone + @ (nil) (unknown) + +Memory Tracker Summary: + Type=consistency, Used=0(0 B), Peak=0(0 B) + Type=batch_load, Used=0(0 B), Peak=0(0 B) + Type=clone, Used=0(0 B), Peak=0(0 B) + Type=schema_change, Used=0(0 B), Peak=0(0 B) + Type=compaction, Used=0(0 B), Peak=0(0 B) + Type=load, Used=0(0 B), Peak=0(0 B) + Type=query, Used=206.67 MB(216708729 B), Peak=565.26 MB(592723181 B) + Type=global, Used=930.42 MB(975614571 B), Peak=1017.42 MB(1066840223 B) + Type=tc/jemalloc_cache, Used=51.97 MB(54494616 B), Peak=-1.00 B(-1 B) + Type=process, Used=1.16 GB(1246817916 B), Peak=-1.00 B(-1 B) + MemTrackerLimiter Label=Orphan, Type=global, Limit=-1.00 B(-1 B), Used=474.20 MB(497233597 B), Peak=649.18 MB(680718208 B) + MemTracker Label=BufferAllocator, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=LoadChannelMgr, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=StorageEngine, Parent Label=Orphan, Used=320.56 MB(336132488 B), Peak=322.56 MB(338229824 B) + MemTracker Label=SegCompaction, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=SegmentMeta, Parent Label=Orphan, Used=948.64 KB(971404 B), Peak=943.64 KB(966285 B) + MemTracker Label=TabletManager, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DataPageCache, Type=global, Limit=-1.00 B(-1 B), Used=455.22 MB(477329882 B), Peak=454.18 MB(476244180 B) + MemTrackerLimiter Label=IndexPageCache, Type=global, Limit=-1.00 B(-1 B), Used=1.00 MB(1051092 B), Peak=0(0 B) + MemTrackerLimiter Label=SegmentCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DiskIO, Type=global, Limit=2.47 GB(2655423201 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=ChunkAllocator, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=LastestSuccessChannelCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DeleteBitmap AggCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) +``` + +### System remaining available memory calculation +When the available memory of the system in the error message is less than the low water mark, it is also treated as a process memory limit. The value of the available memory of the system comes from `MemAvailable` in `/proc/meminfo`. When `MemAvailable` is insufficient, continue to use the memory The application may return std::bad_alloc or cause OOM of the BE process. Because both refreshing process memory statistics and BE memory GC have a certain lag, a small part of the memory buffer [...] + +Among them, `MemAvailable` is the total amount of memory that the operating system can provide to the user process without triggering swap as much as possible given by the operating system considering the current free memory, buffer, cache, memory fragmentation and other factors. A simple calculation Formula: MemAvailable = MemFree - LowWaterMark + (PageCache - min(PageCache / 2, LowWaterMark)), which is the same as the `available` value seen by cmd `free`, for details, please refer to: +https://serverfault.com/questions/940196/why-is-memaavailable-a-lot-less-than-memfreebufferscached +https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34e431b0ae398fc54ea69ff85ec700722c9da773 + +The low water mark defaults to a maximum of 1.6G, calculated based on `MemTotal`, `vm/min_free_kbytes`, `confg::mem_limit`, `config::max_sys_mem_available_low_water_mark_bytes`, and avoid wasting too much memory. Among them, `MemTotal` is the total memory of the system, and the value also comes from `/proc/meminfo`; `vm/min_free_kbytes` is the buffer reserved by the operating system for the memory GC process, and the value is usually between 0.4% and 5%. `vm/min_free_kbytes` may be 5% on [...] + +## Query or import a single execution memory limit +When the following error is returned, it means that the memory limit of a single execution has been exceeded. +``` +ERROR 1105 (HY000): errCode = 2, detailMessage = Memory limit exceeded:<consuming tracker:<Query#Id=f78208b15e064527-a84c5c0b04c04fcf>, failed alloc size 1.03 MB, exceeded tracker:<Query#Id=f78208b15e064527-a84c5c0b04c04fcf>, limit 100.00 MB, peak used 99.29 MB, current used 99.25 MB>, executing msg:<execute:<ExecNode:VHASH_JOIN_NODE (id=4)>>. backend 172.24.47.117 process memory used 1.13 GB, limit ex 98.92 GB cery tracker If , `set exec_mem_limit=8G` to change limit, details mem usage [...] +``` + +### Error message analysis +The error message is divided into three parts: +1. `Memory limit exceeded:<consuming tracker:<Query#Id=f78208b15e064527-a84c5c0b04c04fcf>`: It is found that the memory limit is exceeded during the memory application process of query `f78208b15e064527-a84c5c0b04c04fcf`. +2. `failed alloc size 1.03 MB, exceeded tracker:<Query#Id=f78208b15e064527-a84c5c0b04c04fcf>, limit 100.00 MB, peak used 99.29 MB, current used 99.25 MB`: The memory requested this time is 1.03 MB The current consumption of `f78208b15e064527-a84c5c0b04c04fcf` memory tracker is 99.28MB plus 1.03MB, which exceeds the limit of 100MB. The value of limit comes from `exec_mem_limit` in session veriables, and the default is 4G. +3. `executing msg:<execute:<ExecNode:VHASH_JOIN_NODE (id=4)>>. backend 172.24.47.117 process memory used 1.13 GB, limit 98.92 GB. If query tracker exceeds, `set exec_mem_limit=8G` to change limit, details mem usage see be.INFO.`: The location of this memory application is `VHASH_JOIN_NODE (id=4)`, and it prompts that `set exec_mem_limit` can be used to increase the memory limit of a single query. + +### Log Analysis +After `set global enable_profile=true`, you can print a log in log/be.INFO when a single query memory exceeds the limit, to confirm whether the current query memory usage meets expectations. +At the same time, you can find the following logs in log/be.INFO to confirm whether the current query memory usage meets expectations. The logs are also divided into three parts: +1. `Process Memory Summary`: process memory statistics. +2. `Alloc Stacktrace`: The stack that triggers the memory overrun detection, which is not necessarily the location of the large memory application. +3. `Memory Tracker Summary`: The memory tracker statistics of the current query, you can see the memory and peak value currently used by each operator. For details, please refer to [Memory Tracker](../admin-manual/memory-management/memory -tracker.md). +Note: A query will only print the log once after the memory exceeds the limit. At this time, multiple threads of the query will sense it and try to wait for the memory to be released, or cancel the current query. If the value of Try Alloc in the log is small, there is no need to pay attention` Alloc Stacktrace`, just analyze `Memory Tracker Summary` directly. + +``` +W1128 01:34:11.016165 357796 mem_tracker_limiter.cpp:191] Memory limit exceeded:<consuming tracker:<Query#Id=78208b15e064527-a84c5c0b04c04fcf>, failed alloc size 4.00 MB, exceeded tracker:<Query#Id=78208b15e064527-a84c5c0b04c04fcf>, limit 100.00 MB, peak used 98.59 MB, +current used 96.88 MB>, executing msg:<execute:<ExecNode:VHASH_JOIN_NODE (id=2)>>. backend 172.24.47.117 process memory used 1.13 GB, limit 98.92 GB. If query tracker exceed, `set exec_mem_limit=8G` to change limit, details mem usage see be.INFO. +Process Memory Summary: + process memory used 1.13 GB limit 98.92 GB, sys mem available 45.15 GB min reserve 3.20 GB, tc/jemalloc allocator cache 27.62 MB +Alloc Stacktrace: + @ 0x66cf73a doris::vectorized::HashJoinNode::_materialize_build_side() + @ 0x69cb1ee doris::vectorized::VJoinNodeBase::open() + @ 0x66ce27a doris::vectorized::HashJoinNode::open() + @ 0x5835dad doris::PlanFragmentExecutor::open_vectorized_internal() + @ 0x58351d2 doris::PlanFragmentExecutor::open() + @ 0x5815244 doris::FragmentExecState::execute() + @ 0x5817965 doris::FragmentMgr::_exec_actual() + @ 0x581fffb std::_Function_handler<>::_M_invoke() + @ 0x5a6f2c1 doris::ThreadPool::dispatch_thread() + @ 0x5a6843f doris::Thread::supervise_thread() + @ 0x7f6faa94a1ca start_thread + @ 0x7f6fab121dd3 __GI___clone + @ (nil) (unknown) + +Memory Tracker Summary: + MemTrackerLimiter Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Type=query, Limit=100.00 MB(104857600 B), Used=64.75 MB(67891182 B), Peak=104.70 MB(109786406 B) + MemTracker Label=Scanner#QueryId=78208b15e064527-a84c5c0b04c04fcf, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=RuntimeFilterMgr, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=2.09 KB(2144 B), Peak=0(0 B) + MemTracker Label=BufferedBlockMgr2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=ExecNode:VHASH_JOIN_NODE (id=2), Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=-61.44 MB(-64426656 B), Peak=290.33 KB(297296 B) + MemTracker Label=ExecNode:VEXCHANGE_NODE (id=9), Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=6.12 KB(6264 B), Peak=5.84 KB(5976 B) + MemTracker Label=VDataStreamRecvr:78208b15e064527-a84c5c0b04c04fd2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=6.12 KB(6264 B), Peak=5.84 KB(5976 B) + MemTracker Label=ExecNode:VEXCHANGE_NODE (id=10), Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=-41.20 MB(-43198024 B), Peak=1.46 MB(1535656 B) + MemTracker Label=VDataStreamRecvr:78208b15e064527-a84c5c0b04c04fd2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=-41.20 MB(-43198024 B), Peak=1.46 MB(1535656 B) + MemTracker Label=VDataStreamSender:78208b15e064527-a84c5c0b04c04fd2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=2.34 KB(2400 B), Peak=0(0 B) + MemTracker Label=Scanner#QueryId=78208b15e064527-a84c5c0b04c04fcf, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=58.12 MB(60942224 B), Peak=57.41 MB(60202848 B) + MemTracker Label=RuntimeFilterMgr, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=BufferedBlockMgr2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=ExecNode:VNewOlapScanNode(customer) (id=1), Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=9.55 MB(10013424 B), Peak=10.20 MB(10697136 B) + MemTracker Label=VDataStreamSender:78208b15e064527-a84c5c0b04c04fd1, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=59.80 MB(62701880 B), Peak=59.16 MB(62033048 B) + MemTracker Label=Scanner#QueryId=78208b15e064527-a84c5c0b04c04fcf, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=RuntimeFilterMgr, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=13.62 KB(13952 B), Peak=0(0 B) + MemTracker Label=BufferedBlockMgr2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=ExecNode:VNewOlapScanNode(lineorder) (id=0), Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=6.03 MB(6318064 B), Peak=4.02 MB(4217664 B) + MemTracker Label=VDataStreamSender:78208b15e064527-a84c5c0b04c04fd0, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=2.34 KB(2400 B), Peak=0(0 B) +``` + +</version> diff --git a/docs/en/docs/admin-manual/memory-management/memory-tracker.md b/docs/en/docs/admin-manual/memory-management/memory-tracker.md new file mode 100644 index 0000000000..8fa8fa8d16 --- /dev/null +++ b/docs/en/docs/admin-manual/memory-management/memory-tracker.md @@ -0,0 +1,98 @@ +--- +{ + "title": "Memory Tracker", + "language": "en" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Memory Tracker + +<version since="1.2.0"> + +The Memory Tracker records the memory usage of the Doris BE process, including the memory used in the life cycle of tasks such as query, import, Compaction, and Schema Change, as well as various caches for memory control and analysis. + +## principle + +Each query, import and other tasks in the system will create its own Memory Tracker when it is initialized, and put the Memory Tracker into TLS (Thread Local Storage) during execution, and each memory application and release of the BE process will be in the Mem Hook Consume the Memory Tracker in the middle, and display it after the final summary. + +For detailed design and implementation, please refer to: +https://cwiki.apache.org/confluence/display/DORIS/DSIP-002%3A+Refactor+memory+tracker+on+BE +https://shimo.im/docs/DT6JXDRkdTvdyV3G + +## View statistics + +The real-time memory statistics results can be viewed through Doris BE's Web page http://ip:http_port/mem_tracker. +For the memory statistics results of historical queries, you can view the `peakMemoryBytes` of each query in `fe/log/fe.audit.log`, or search `Deregister query/load memory tracker, queryId` in `be/log/be.INFO` `View memory peaks per query on a single BE. + +### Home `/mem_tracker` + + +1. Type: Divide the memory used by Doris BE into the following categories +- process: The total memory of the process, the sum of all other types. +- global: Global Memory Tracker with the same life cycle and process, such as each Cache, Tablet Manager, Storage Engine, etc. +- query: the in-memory sum of all queries. +- load: Sum of all imported memory. +- tc/jemalloc_cache: The general memory allocator TCMalloc or Jemalloc cache, you can view the original profile of the memory allocator in real time at http://ip:http_port/memz. +- compaction, schema_change, consistency, batch_load, clone: corresponding to the memory sum of all Compaction, Schema Change, Consistency, Batch Load, and Clone tasks respectively. + +2. Current Consumption(Bytes): current memory value, unit B. +3. Current Consumption(Normalize): .G.M.K formatted output of the current memory value. +4. Peak Consumption (Bytes): The peak value of the memory after the BE process is started, the unit is B, and it will be reset after the BE restarts. +5. Peak Consumption(Normalize): The .G.M.K formatted output of the memory peak value after the BE process starts, and resets after the BE restarts. + +### Global Type `/mem_tracker?type=global` + + +1. Label: Memory Tracker name +2. Parent Label: It is used to indicate the parent-child relationship between two Memory Trackers. The memory recorded by the Child Tracker is a subset of the Parent Tracker. There may be intersections between the memories recorded by different Trackers with the same Parent. + +- Orphan: Tracker consumed by default. Memory that does not specify a tracker will be recorded in Orphan by default. In addition to the Child Tracker subdivided below, Orphan also includes some memory that is inconvenient to accurately subdivide and count, including BRPC. + - LoadChannelMgr: The sum of the memory of all imported Load Channel stages, used to write the scanned data to the Segment file on disk, a subset of Orphan. + - StorageEngine:, the memory consumed by the storage engine during loading the data directory, a subset of Orphan. + - SegCompaction: The memory sum of all SegCompaction tasks, a subset of Orphan. + - SegmentMeta: memory use by segment meta data such as footer or index page, a subset of Orphan. + - TabletManager: The memory consumed by the storage engine get, add, and delte Tablet, a subset of Orphan. + - BufferAllocator: Only used for memory multiplexing in the non-vectorized Partitioned Agg process, a subset of Orphan. + +- DataPageCache: Used to cache data Pages to speed up Scan. +- IndexPageCache: The index used to cache the data Page, used to speed up Scan. +- SegmentCache: Used to cache opened Segments, such as index information. +- DiskIO: Used to cache Disk IO data, only used in non-vectorization. +- ChunkAllocator: Used to cache power-of-2 memory blocks, and reuse memory at the application layer. +- LastestSuccessChannelCache: Used to cache the LoadChannel of the import receiver. +- DeleteBitmap AggCache: Gets aggregated delete_bitmap on rowset_id and version. + +### Query Type `/mem_tracker?type=query` + + +1. Limit: The upper limit of memory used by a single query, `show session variables` to view and modify `exec_mem_limit`. +2. Label: The label naming rule of the Tracker for a single query is `Query#Id=xxx`. +3. Parent Label: Parent is the Tracker record of `Query#Id=xxx` to query the memory used by different operators during execution. + +### Load Type `/mem_tracker?type=load` + + +1. Limit: The import is divided into two stages: Fragment Scan and Load Channel to write Segment to disk. The upper memory limit of the Scan phase can be viewed and modified through `show session variables`; the segment write disk phase does not have a separate memory upper limit for each import, but the total upper limit of all imports, corresponding to `load_process_max_memory_limit_percent` and ` in be.conf load_process_max_memory_limit_bytes`. +2. Label: The label naming rule of a single import Scan stage Tracker is `Load#Id=xxx`; the Label naming rule of a single import Segment write disk stage Tracker is `LoadChannel#senderIp=xxx#loadID=xxx`. +3. Parent Label: Parent is the Tracker of `Load#Id=xxx`, which records the memory used by different operators during the import Scan stage; Parent is the Tracker of `LoadChannelMgrTrackerSet`, which records the Insert and The memory used by the Flush disk process is associated with the last `loadID` of the Label to write to the disk stage Tracker of the Segment. + +</version> diff --git a/docs/zh-CN/docs/admin-manual/memory-management/be-oom-analysis.md b/docs/zh-CN/docs/admin-manual/memory-management/be-oom-analysis.md new file mode 100644 index 0000000000..257e93d95c --- /dev/null +++ b/docs/zh-CN/docs/admin-manual/memory-management/be-oom-analysis.md @@ -0,0 +1,84 @@ +--- +{ + "title": "BE OOM分析", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# BE OOM分析 + +<version since="1.2.0"> + +理想情况下,在 [Memory Limit Exceeded Analysis](../admin-manual/memory-management/memory-limit-exceeded-analysis.md) 中我们定时检测操作系统剩余可用内存,并在内存不足时及时响应,如触发内存GC释放缓存或cancel内存超限的查询,但因为刷新进程内存统计和内存GC都具有一定的滞后性,同时我们很难完全catch所有大内存申请,在集群压力过大时仍有OOM风险。 + +## 解决方法 +参考 [BE 配置项](../admin-manual/config/be-config.md) 在`be.conf`中调小`mem_limit`,调大`max_sys_mem_available_low_water_mark_bytes`。 + +## 内存分析 +若希望进一步了解 OOM 前BE进程的内存使用位置,减少进程内存使用,可参考如下步骤分析。 + +1. `dmesg -T`确认 OOM 的时间和 OOM 时的进程内存。 + +2. 查看 be/log/be.INFO 的最后是否有 `Memory Tracker Summary` 日志,如果有说明 BE 已经检测到内存超限,则继续步骤3,否则继续步骤8 +``` +Memory Tracker Summary: + Type=consistency, Used=0(0 B), Peak=0(0 B) + Type=batch_load, Used=0(0 B), Peak=0(0 B) + Type=clone, Used=0(0 B), Peak=0(0 B) + Type=schema_change, Used=0(0 B), Peak=0(0 B) + Type=compaction, Used=0(0 B), Peak=0(0 B) + Type=load, Used=0(0 B), Peak=0(0 B) + Type=query, Used=206.67 MB(216708729 B), Peak=565.26 MB(592723181 B) + Type=global, Used=930.42 MB(975614571 B), Peak=1017.42 MB(1066840223 B) + Type=tc/jemalloc_cache, Used=51.97 MB(54494616 B), Peak=-1.00 B(-1 B) + Type=process, Used=1.16 GB(1246817916 B), Peak=-1.00 B(-1 B) + MemTrackerLimiter Label=Orphan, Type=global, Limit=-1.00 B(-1 B), Used=474.20 MB(497233597 B), Peak=649.18 MB(680718208 B) + MemTracker Label=BufferAllocator, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=LoadChannelMgr, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=StorageEngine, Parent Label=Orphan, Used=320.56 MB(336132488 B), Peak=322.56 MB(338229824 B) + MemTracker Label=SegCompaction, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=SegmentMeta, Parent Label=Orphan, Used=948.64 KB(971404 B), Peak=943.64 KB(966285 B) + MemTracker Label=TabletManager, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DataPageCache, Type=global, Limit=-1.00 B(-1 B), Used=455.22 MB(477329882 B), Peak=454.18 MB(476244180 B) + MemTrackerLimiter Label=IndexPageCache, Type=global, Limit=-1.00 B(-1 B), Used=1.00 MB(1051092 B), Peak=0(0 B) + MemTrackerLimiter Label=SegmentCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DiskIO, Type=global, Limit=2.47 GB(2655423201 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=ChunkAllocator, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=LastestSuccessChannelCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DeleteBitmap AggCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) +``` + +3. 当 OOM 前 be/log/be.INFO 的最后包含系统内存超限的日志时,参考 [Memory Limit Exceeded Analysis](../admin-manual/memory-management/memory-limit-exceeded-analysis.md) 中的日志分析方法,查看进程每个类别的内存使用情况。若当前是`type=query`内存使用较多,若已知 OOM 前的查询继续步骤4,否则继续步骤5;若当前是`type=load`内存使用多继续步骤6,若当前是`type=global`内存使用多继续步骤7。 + +4. `type=query`查询内存使用多,且已知 OOM 前的查询时,比如测试集群或定时任务,重启BE节点,参考 [Memory Tracker](../admin-manual/memory-management/memory-tracker.md) 查看实时 memory tracker 统计,`set global enable_profile=true`后重试查询,观察具体算子的内存使用位置,确认查询内存使用是否合理,进一步考虑优化SQL内存使用,比如调整join顺序。 + +5. `type=query`查询内存使用多,且未知 OOM 前的查询时,比如位于线上集群,则在`be/log/be.INFO`从后向前搜`Deregister query/load memory tracker, queryId` 和 `Register query/load memory tracker, query/load id`,同一个query id若同时打出上述两行日志则表示查询或导入成功,若只有 Register 没有 Deregister,则这个查询或导入在 OOM 前仍在运行,这样可以得到OOM 前所有正在运行的查询和导入,按照步骤4的方法对可疑大内存查询分析其内存使用。 + +6. `type=load`导入内存使用多时。 + +7. `type=global`内存使用多时,继续查看`Memory Tracker Summary`日志后半部分已经打出得`type=global`详细统计。当 DataPageCache、IndexPageCache、SegmentCache、ChunkAllocator、LastestSuccessChannelCache 等内存使用多时,参考 [BE 配置项](../admin-manual/config/be-config.md) 考虑修改cache的大小;当 Orphan 内存使用过多时,如下继续分析。 + - 若`Parent Label=Orphan`的tracker统计值相加只占 Orphan 内存的小部分,则说明当前有大量内存没有准确统计,比如 brpc 过程的内存,此时可以考虑借助 heap profile [Memory Tracker](../community/developer-guide/debug-tool.md) 中的方法进一步分析内存位置。 + - 若`Parent Label=Orphan`的tracker统计值相加占 Orphan 内存的大部分,当`Label=TabletManager`内存使用多时,进一步查看集群 Tablet 数量,若 Tablet 数量过多则考虑删除过时不会被使用的表或数据;当`Label=StorageEngine`内存使用过多时,进一步查看集群 Segment 文件个数,若 Segment 文件个数过多则考虑手动触发compaction; + +8. 若`be/log/be.INFO`没有在 OOM 前打印出`Memory Tracker Summary`日志,说明 BE 没有及时检测出内存超限,观察 Grafana 内存监控确认BE在 OOM 前的内存增长趋势,若 OOM 可复现,考虑在`be.conf`中增加`memory_debug=true`,重启集群后会每秒打印集群内存统计,观察 OOM 前的最后一次`Memory Tracker Summary`日志,继续步骤3分析; + +</version> diff --git a/docs/zh-CN/docs/admin-manual/memory-management/memory-limit-exceeded-analysis.md b/docs/zh-CN/docs/admin-manual/memory-management/memory-limit-exceeded-analysis.md new file mode 100644 index 0000000000..432c3232b6 --- /dev/null +++ b/docs/zh-CN/docs/admin-manual/memory-management/memory-limit-exceeded-analysis.md @@ -0,0 +1,179 @@ +--- +{ + "title": "内存超限错误分析", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# 内存超限错误分析 + +<version since="1.2.0"> + +当查询或导入报错`Memory limit exceeded`时,可能的原因:进程内存超限、系统剩余可用内存不足、超过单次查询执行的内存上限。 +``` +ERROR 1105 (HY000): errCode = 2, detailMessage = Memory limit exceeded:<consuming tracker:<xxx>, xxx. backend 172.1.1.1 process memory used xxx GB, limit xxx GB. If query tracker exceed, `set exec_mem_limit=8G` to change limit, details mem usage see be.INFO. +``` + +## 进程内存超限 OR 系统剩余可用内存不足 +当返回如下报错时,说明进程内存超限,或者系统剩余可用内存不足,具体原因看内存统计值。 +``` +ERROR 1105 (HY000): errCode = 2, detailMessage = Memory limit exceeded:<consuming tracker:<Query#Id=3c88608cf35c461d-95fe88969aa6fc30>, process memory used 2.68 GB exceed limit 2.47 GB or sys mem available 50.95 GB less than low water mark 3.20 GB, failed alloc size 2.00 MB>, executing msg:<execute:<ExecNode:VAGGREGATION_NODE (id=7)>>. backend 172.1.1.1 process memory used 2.68 GB, limit 2.47 GB. If query tracker exceed, `set exec_mem_limit=8G` to change limit, details mem usage see be.INFO +``` + +### 错误信息分析 +错误信息分为三部分: +1、`Memory limit exceeded:<consuming tracker:<Query#Id=3c88608cf35c461d-95fe88969aa6fc30>`:当前正在执行query `3c88608cf35c461d-95fe88969aa6fc30`的内存申请过程中发现内存超限。 +2、`process memory used 2.68 GB exceed limit 2.47 GB or sys mem available 50.95 GB less than low water mark 3.20 GB, failed alloc size 2.00 MB`:超限的原因是 BE 进程使用的内存 2.68GB 超过了 2.47GB 的limit,limit的值来自 be.conf 中的 mem_limit * system MemTotal,默认等于操作系统总内存的80%,当前操作系统剩余可用内存 50.95 GB 仍高于最低水位 3.2GB,本次尝试申请 2MB 的内存。 +3、`executing msg:<execute:<ExecNode:VAGGREGATION_NODE (id=7)>>, backend 172.24.47.117 process memory used 2.68 GB, limit 2.47 GB`:本次内存申请的位置是`ExecNode:VAGGREGATION_NODE (id=7)>`,当前BE节点的IP是 172.1.1.1,以及再次打印BE节点的内存统计。 + +### 日志分析 +同时可以在 log/be.INFO 中找到如下日志,确认当前进程内存使用是否符合预期,日志同样分为三部分: +1、`Process Memory Summary`:进程内存统计。 +2、`Alloc Stacktrace`:触发内存超限检测的栈,这不一定是大内存申请的位置。 +3、`Memory Tracker Summary`:进程 memory tracker 统计,参考 [Memory Tracker](../admin-manual/memory-management/memory-tracker.md) 分析使用内存的位置。 +注意: +1、进程内存超限日志的打印间隔是1s,进程内存超限后,BE大多数位置的内存申请都会感知,并尝试做出预定的回调方法,并打印进程内存超限日志,所以如果日志中 Try Alloc 的值很小,则无须关注`Alloc Stacktrace`,直接分析`Memory Tracker Summary`即可。 +2、当进程内存超限后,BE会触发内存GC。 + +``` +W1127 17:23:16.372572 19896 mem_tracker_limiter.cpp:214] System Mem Exceed Limit Check Faild, Try Alloc: 1062688 +Process Memory Summary: + process memory used 2.68 GB limit 2.47 GB, sys mem available 50.95 GB min reserve 3.20 GB, tc/jemalloc allocator cache 51.97 MB +Alloc Stacktrace: + @ 0x50028e8 doris::MemTrackerLimiter::try_consume() + @ 0x50027c1 doris::ThreadMemTrackerMgr::flush_untracked_mem<>() + @ 0x595f234 malloc + @ 0xb888c18 operator new() + @ 0x8f316a2 google::LogMessage::Init() + @ 0x5813fef doris::FragmentExecState::coordinator_callback() + @ 0x58383dc doris::PlanFragmentExecutor::send_report() + @ 0x5837ea8 doris::PlanFragmentExecutor::update_status() + @ 0x58355b0 doris::PlanFragmentExecutor::open() + @ 0x5815244 doris::FragmentExecState::execute() + @ 0x5817965 doris::FragmentMgr::_exec_actual() + @ 0x581fffb std::_Function_handler<>::_M_invoke() + @ 0x5a6f2c1 doris::ThreadPool::dispatch_thread() + @ 0x5a6843f doris::Thread::supervise_thread() + @ 0x7feb54f931ca start_thread + @ 0x7feb5576add3 __GI___clone + @ (nil) (unknown) + +Memory Tracker Summary: + Type=consistency, Used=0(0 B), Peak=0(0 B) + Type=batch_load, Used=0(0 B), Peak=0(0 B) + Type=clone, Used=0(0 B), Peak=0(0 B) + Type=schema_change, Used=0(0 B), Peak=0(0 B) + Type=compaction, Used=0(0 B), Peak=0(0 B) + Type=load, Used=0(0 B), Peak=0(0 B) + Type=query, Used=206.67 MB(216708729 B), Peak=565.26 MB(592723181 B) + Type=global, Used=930.42 MB(975614571 B), Peak=1017.42 MB(1066840223 B) + Type=tc/jemalloc_cache, Used=51.97 MB(54494616 B), Peak=-1.00 B(-1 B) + Type=process, Used=1.16 GB(1246817916 B), Peak=-1.00 B(-1 B) + MemTrackerLimiter Label=Orphan, Type=global, Limit=-1.00 B(-1 B), Used=474.20 MB(497233597 B), Peak=649.18 MB(680718208 B) + MemTracker Label=BufferAllocator, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=LoadChannelMgr, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=StorageEngine, Parent Label=Orphan, Used=320.56 MB(336132488 B), Peak=322.56 MB(338229824 B) + MemTracker Label=SegCompaction, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTracker Label=SegmentMeta, Parent Label=Orphan, Used=948.64 KB(971404 B), Peak=943.64 KB(966285 B) + MemTracker Label=TabletManager, Parent Label=Orphan, Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DataPageCache, Type=global, Limit=-1.00 B(-1 B), Used=455.22 MB(477329882 B), Peak=454.18 MB(476244180 B) + MemTrackerLimiter Label=IndexPageCache, Type=global, Limit=-1.00 B(-1 B), Used=1.00 MB(1051092 B), Peak=0(0 B) + MemTrackerLimiter Label=SegmentCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DiskIO, Type=global, Limit=2.47 GB(2655423201 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=ChunkAllocator, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=LastestSuccessChannelCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) + MemTrackerLimiter Label=DeleteBitmap AggCache, Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B) +``` + +### 系统剩余可用内存计算 +当错误信息中系统可用内存小于低水位线时,同样当作进程内存超限处理,其中系统可用内存的值来自于`/proc/meminfo`中的`MemAvailable`,当`MemAvailable`不足时继续内存申请可能返回 std::bad_alloc 或者导致BE进程OOM,因为刷新进程内存统计和BE内存GC都具有一定的滞后性,所以预留小部分内存buffer作为低水位线,尽可能避免OOM。 + +其中`MemAvailable`是操作系统综合考虑当前空闲的内存、buffer、cache、内存碎片等因素给出的一个在尽可能不触发swap的情况下可以提供给用户进程使用的内存总量,一个简单的计算公式: MemAvailable = MemFree - LowWaterMark + (PageCache - min(PageCache / 2, LowWaterMark)),和 cmd `free`看到的`available`值相同,具体可参考: +https://serverfault.com/questions/940196/why-is-memavailable-a-lot-less-than-memfreebufferscached +https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34e431b0ae398fc54ea69ff85ec700722c9da773 + +低水位线默认最大1.6G,根据`MemTotal`、`vm/min_free_kbytes`、`confg::mem_limit`、`config::max_sys_mem_available_low_water_mark_bytes`共同算出,并避免浪费过多内存。其中`MemTotal`是系统总内存,取值同样来自`/proc/meminfo`;`vm/min_free_kbytes`是操作系统给内存GC过程预留的buffer,取值通常在 0.4% 到 5% 之间,某些云服务器上`vm/min_free_kbytes`可能为5%,这会导致直观上系统可用内存比真实值少;调大`config::max_sys_mem_available_low_water_mark_bytes`将在大于16G内存的机器上,为Full GC预留更多的内存buffer,反之调小将尽可能充分使用内存。 + +## 查询或导入单次执行内存超限 +当返回如下报错时,说明超过单次执行内存限制。 +``` +ERROR 1105 (HY000): errCode = 2, detailMessage = Memory limit exceeded:<consuming tracker:<Query#Id=f78208b15e064527-a84c5c0b04c04fcf>, failed alloc size 1.03 MB, exceeded tracker:<Query#Id=f78208b15e064527-a84c5c0b04c04fcf>, limit 100.00 MB, peak used 99.29 MB, current used 99.25 MB>, executing msg:<execute:<ExecNode:VHASH_JOIN_NODE (id=4)>>. backend 172.24.47.117 process memory used 1.13 GB, limit 98.92 GB. If query tracker exceed, `set exec_mem_limit=8G` to change limit, details mem u [...] +``` + +### 错误信息分析 +错误信息分为三部分: +1、`Memory limit exceeded:<consuming tracker:<Query#Id=f78208b15e064527-a84c5c0b04c04fcf>`:当前正在执行query `f78208b15e064527-a84c5c0b04c04fcf`的内存申请过程中发现内存超限。 +2、`failed alloc size 1.03 MB, exceeded tracker:<Query#Id=f78208b15e064527-a84c5c0b04c04fcf>, limit 100.00 MB, peak used 99.29 MB, current used 99.25 MB`:本次尝试申请 1.03MB 的内存,但此时query `f78208b15e064527-a84c5c0b04c04fcf` memory tracker 的当前 consumption 为 99.28MB 加上 1.03MB 后超过了 100MB 的limit,limit的值来自 session veriables 中的 `exec_mem_limit`,默认4G。 +3、`executing msg:<execute:<ExecNode:VHASH_JOIN_NODE (id=4)>>. backend 172.24.47.117 process memory used 1.13 GB, limit 98.92 GB. If query tracker exceed, `set exec_mem_limit=8G` to change limit, details mem usage see be.INFO.`:本次内存申请的位置是`VHASH_JOIN_NODE (id=4)`,并提示可通过 `set exec_mem_limit` 来调高单次查询的内存上限。 + +### 日志分析 +`set global enable_profile=true`后,可以在单次查询内存超限时,在 log/be.INFO 中打印日志,用于确认当前查询内存使用是否符合预期。 +同时可以在 log/be.INFO 中找到如下日志,确认当前查询内存使用是否符合预期,日志同样分为三部分: +1、`Process Memory Summary`:进程内存统计。 +2、`Alloc Stacktrace`:触发内存超限检测的栈,这不一定是大内存申请的位置。 +3、`Memory Tracker Summary`:当前查询的 memory tracker 统计,可以看到查询每个算子当前使用的内存和峰值,具体可参考 [Memory Tracker](../admin-manual/memory-management/memory-tracker.md)。 +注意:一个查询在内存超限后只会打印一次日志,此时查询的多个线程都会感知,并尝试等待内存释放,或者cancel当前查询,如果日志中 Try Alloc 的值很小,则无须关注`Alloc Stacktrace`,直接分析`Memory Tracker Summary`即可。 + +``` +W1128 01:34:11.016165 357796 mem_tracker_limiter.cpp:191] Memory limit exceeded:<consuming tracker:<Query#Id=78208b15e064527-a84c5c0b04c04fcf>, failed alloc size 4.00 MB, exceeded tracker:<Query#Id=78208b15e064527-a84c5c0b04c04fcf>, limit 100.00 MB, peak used 98.59 MB, +current used 96.88 MB>, executing msg:<execute:<ExecNode:VHASH_JOIN_NODE (id=2)>>. backend 172.24.47.117 process memory used 1.13 GB, limit 98.92 GB. If query tracker exceed, `set exec_mem_limit=8G` to change limit, details mem usage see be.INFO. +Process Memory Summary: + process memory used 1.13 GB limit 98.92 GB, sys mem available 45.15 GB min reserve 3.20 GB, tc/jemalloc allocator cache 27.62 MB +Alloc Stacktrace: + @ 0x66cf73a doris::vectorized::HashJoinNode::_materialize_build_side() + @ 0x69cb1ee doris::vectorized::VJoinNodeBase::open() + @ 0x66ce27a doris::vectorized::HashJoinNode::open() + @ 0x5835dad doris::PlanFragmentExecutor::open_vectorized_internal() + @ 0x58351d2 doris::PlanFragmentExecutor::open() + @ 0x5815244 doris::FragmentExecState::execute() + @ 0x5817965 doris::FragmentMgr::_exec_actual() + @ 0x581fffb std::_Function_handler<>::_M_invoke() + @ 0x5a6f2c1 doris::ThreadPool::dispatch_thread() + @ 0x5a6843f doris::Thread::supervise_thread() + @ 0x7f6faa94a1ca start_thread + @ 0x7f6fab121dd3 __GI___clone + @ (nil) (unknown) + +Memory Tracker Summary: + MemTrackerLimiter Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Type=query, Limit=100.00 MB(104857600 B), Used=64.75 MB(67891182 B), Peak=104.70 MB(109786406 B) + MemTracker Label=Scanner#QueryId=78208b15e064527-a84c5c0b04c04fcf, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=RuntimeFilterMgr, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=2.09 KB(2144 B), Peak=0(0 B) + MemTracker Label=BufferedBlockMgr2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=ExecNode:VHASH_JOIN_NODE (id=2), Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=-61.44 MB(-64426656 B), Peak=290.33 KB(297296 B) + MemTracker Label=ExecNode:VEXCHANGE_NODE (id=9), Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=6.12 KB(6264 B), Peak=5.84 KB(5976 B) + MemTracker Label=VDataStreamRecvr:78208b15e064527-a84c5c0b04c04fd2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=6.12 KB(6264 B), Peak=5.84 KB(5976 B) + MemTracker Label=ExecNode:VEXCHANGE_NODE (id=10), Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=-41.20 MB(-43198024 B), Peak=1.46 MB(1535656 B) + MemTracker Label=VDataStreamRecvr:78208b15e064527-a84c5c0b04c04fd2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=-41.20 MB(-43198024 B), Peak=1.46 MB(1535656 B) + MemTracker Label=VDataStreamSender:78208b15e064527-a84c5c0b04c04fd2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=2.34 KB(2400 B), Peak=0(0 B) + MemTracker Label=Scanner#QueryId=78208b15e064527-a84c5c0b04c04fcf, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=58.12 MB(60942224 B), Peak=57.41 MB(60202848 B) + MemTracker Label=RuntimeFilterMgr, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=BufferedBlockMgr2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=ExecNode:VNewOlapScanNode(customer) (id=1), Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=9.55 MB(10013424 B), Peak=10.20 MB(10697136 B) + MemTracker Label=VDataStreamSender:78208b15e064527-a84c5c0b04c04fd1, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=59.80 MB(62701880 B), Peak=59.16 MB(62033048 B) + MemTracker Label=Scanner#QueryId=78208b15e064527-a84c5c0b04c04fcf, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=RuntimeFilterMgr, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=13.62 KB(13952 B), Peak=0(0 B) + MemTracker Label=BufferedBlockMgr2, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=0(0 B), Peak=0(0 B) + MemTracker Label=ExecNode:VNewOlapScanNode(lineorder) (id=0), Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=6.03 MB(6318064 B), Peak=4.02 MB(4217664 B) + MemTracker Label=VDataStreamSender:78208b15e064527-a84c5c0b04c04fd0, Parent Label=Query#Id=78208b15e064527-a84c5c0b04c04fcf, Used=2.34 KB(2400 B), Peak=0(0 B) +``` + +</version> diff --git a/docs/zh-CN/docs/admin-manual/memory-management/memory-tracker.md b/docs/zh-CN/docs/admin-manual/memory-management/memory-tracker.md new file mode 100644 index 0000000000..befc9d743d --- /dev/null +++ b/docs/zh-CN/docs/admin-manual/memory-management/memory-tracker.md @@ -0,0 +1,98 @@ +--- +{ + "title": "内存跟踪器", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# 内存跟踪器 + +内存跟踪器(Memory Tracker)记录了 Doris BE 进程内存使用,包括查询、导入、Compaction、Schema Change 等任务生命周期中使用的内存,以及各项缓存,用于内存控制和分析。 + +<version since="1.2.0"> + +## 原理 + +系统中每个查询、导入等任务初始化时都会创建自己的 Memory Tracker,在执行过程中将 Memory Tracker 放入 TLS(Thread Local Storage)中,BE进程的每次内存申请和释放,都将在 Mem Hook 中消费 Memory Tracker,并最终汇总后展示。 + +详细设计实现可以参阅: +https://cwiki.apache.org/confluence/display/DORIS/DSIP-002%3A+Refactor+memory+tracker+on+BE +https://shimo.im/docs/DT6JXDRkdTvdyV3G + +## 查看统计结果 + +实时的内存统计结果通过 Doris BE 的 Web 页面查看 http://ip:http_port/mem_tracker。 +历史查询的内存统计结果可以查看`fe/log/fe.audit.log`中每个查询的`peakMemoryBytes`,或者在`be/log/be.INFO`中搜索`Deregister query/load memory tracker, queryId`查看单个BE上每个查询的内存峰值。 + +### 首页 `/mem_tracker` + + +1. Type: 将 Doris BE 使用的内存分为如下几类 +- process: 进程总内存,所有其他type的总和。 +- global: 生命周期和进程相同的全局 Memory Tracker,例如各个Cache、Tablet Manager、Storage Engine等。 +- query: 所有查询的内存总和。 +- load: 所有导入的内存总和。 +- tc/jemalloc_cache: 通用内存分配器 TCMalloc 或 Jemalloc 的缓存,在 http://ip:http_port/memz 可以实时查看到内存分配器原始的profile。 +- compaction、schema_change、consistency、batch_load、clone: 分别对应所有Compaction、Schema Change、Consistency、Batch Load、Clone任务的内存总和。 + +2. Current Consumption(Bytes): 当前内存值,单位B。 +3. Current Consumption(Normalize): 当前内存值的 .G.M.K 格式化输出。 +4. Peak Consumption(Bytes): BE进程启动后的内存峰值,单位B,BE重启后重置。 +5. Peak Consumption(Normalize): BE进程启动后内存峰值的 .G.M.K 格式化输出,BE重启后重置。 + +### Global Type `/mem_tracker?type=global` + + +1. Label: Memory Tracker名称 +2. Parent Label: 用于表明两个 Memory Tracker 的父子关系,Child Tracker 记录的内存是 Parent Tracker 的子集,Parent 相同的不同 Tracker 记录的内存可能存在交集。 + +- Orphan: 默认消费的 Tracker,没有单独指定 Tracker 的内存将默认记录到 Orphan,Orphan 中除了下述细分的 Child Tracker 外,还包括 BRPC 在内的一些不方便准确细分统计的内存。 + - LoadChannelMgr: 所有导入的 Load Channel 阶段内存总和,用于将 Scan 后的数据写入到磁盘的 Segment 文件中,Orphan的子集。 + - StorageEngine:,存储引擎加载数据目录过程中消耗的内存,Orphan的子集。 + - SegCompaction: 所有 SegCompaction 任务的内存总和,Orphan的子集。 + - SegmentMeta: memory use by segment meta data such as footer or index page,Orphan的子集。 + - TabletManager: 存储引擎 get、add、delte Tablet 过程中消耗的内存,Orphan的子集。 + - BufferAllocator: 仅用于非向量化Partitioned Agg过程中的内存复用,Orphan的子集。 + +- DataPageCache: 用于缓存数据 Page,用于加速 Scan。 +- IndexPageCache: 用于缓存数据 Page 的索引,用于加速 Scan。 +- SegmentCache: 用于缓存已打开的 Segment,如索引信息。 +- DiskIO: 用于缓存 Disk IO 数据,仅在非向量化使用。 +- ChunkAllocator: 用于缓存2的幂大小的内存块,在应用层内存复用。 +- LastestSuccessChannelCache: 用于缓存导入接收端的 LoadChannel。 +- DeleteBitmap AggCache: Gets aggregated delete_bitmap on rowset_id and version。 + +### Query Type `/mem_tracker?type=query` + + +1. Limit: 单个查询使用的内存上限,`show session variables`查看和修改`exec_mem_limit`。 +2. Label: 单个查询的 Tracker 的 Label 命名规则为`Query#Id=xxx`。 +3. Parent Label: Parent 是 `Query#Id=xxx` 的 Tracker 记录查询不同算子执行过程使用的内存。 + +### Load Type `/mem_tracker?type=load` + + +1. Limit: 导入分为 Fragment Scan 和 Load Channel 写 Segment 到磁盘两个阶段。Scan 阶段的内存上限通过`show session variables`查看和修改`load_mem_limit`;Segment 写磁盘阶段每个导入没有单独的内存上限,而是所有导入的总上限,对应 be.conf 中的 `load_process_max_memory_limit_percent`和`load_process_max_memory_limit_bytes`。 +2. Label: 单个导入 Scan 阶段 Tracker 的 Label 命名规则为`Load#Id=xxx`;单个导入 Segment 写磁盘阶段 Tracker 的 Label 命名规则为`LoadChannel#senderIp=xxx#loadID=xxx`。 +3. Parent Label: Parent是 `Load#Id=xxx` 的 Tracker 记录导入 Scan 阶段不同算子执行过程使用的内存;Parent是 `LoadChannelMgrTrackerSet` 的 Tracker 记录 Segment 写磁盘阶段每个中间数据结构 MemTable 的 Insert 和 Flush 磁盘过程使用的内存,用 Label 最后的 `loadID` 关联 Segment 写磁盘阶段 Tracker。 + +</version> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org