kangkaisen opened a new issue #2714: Critical bug: Doris BE is not high available URL: https://github.com/apache/incubator-doris/issues/2714 **Describe the bug** If there are many stream load jobs, when one BE done a longer time and restart, there will be a lot of query fail. ``` W0108 20:37:14.658995 192765 olap_scanner.cpp:122] fail to init reader.res=-214 W0108 20:37:14.659014 192765 olap_scanner.cpp:63] OlapScanner preapre failed, status:failed to initialize storage reader. tablet=516580.1990635423.784b48ecc429e107-79027b76e46f3191, res=-214, backend=10.26.45.28 W0108 20:37:14.671918 192759 rowset_graph.cpp:194] fail to find path in version_graph. spec_version: 0-520 W0108 20:37:14.672857 192759 tablet.cpp:489] status:-214, tablet:262415.864251184.9d4129a7bc52c626-9485230b00cf3791, missed version for version:0-520 W0108 20:37:14.672900 192759 tablet.cpp:982] 262415.864251184.9d4129a7bc52c626-9485230b00cf3791 has 1 missed version:520-520, W0108 20:37:14.672914 192759 olap_sca ``` **To Reproduce** 1. Keep loading data to one table 2. Keep query this table 3. Make a BE done a long time(10 minutes) 4. Restart the BE 5. Some query will fail **Additional context** ``` 2020-01-09 10:18:08,899 WARN 86 [TabletInvertedIndex.tabletReport():162] replica 173673758 of tablet 173673756 on backend 126581199 need recovery. replica in FE: [replicaId=173673758, BackendId=126581199, version=721476, versionHash=2865055093552332621, dataSize=2710247, rowCount=63684, lastFailedVersion=-1, lastFailedVersionHash=0, lastSuccessVersion=721476, lastSuccessVersionHash=2865055093552332621, lastFailedTimestamp=-1, schemaHash=997644236, state=NORMAL], report version 721411-3302667986958222485, report schema hash: 997644236, is bad: unknown, is version missing: true 2020-01-09 10:18:09,670 WARN 86 [ReportHandler.handleRecoverTablet():738] find 187 tablets with report version less than version i n meta, or is set bad, on backend 126581199, they need clone or force recovery 2020-01-09 10:18:09,670 WARN 86 [ReportHandler.handleRecoverTablet():744] force recovery is disable. try reset the tablets' versio n or set it as bad, and waiting clone 2020-01-09 10:18:09,672 WARN 86 [Replica.updateVersionInfoForRecovery():262] update replica 173673795 on backend 126581199's versi on for recovery. version: 721476-2865055093552332621:721411--1. last failed version: -1-0:721412--1, last success version: 721476- 2865055093552332621:721476-2865055093552332621 ``` The reason is: **If a BE down a lone time and then restart, there should be some missing versions for the tablet in the downed BE. And we don't handle the missing version replica for the downed BE and send the query to downed BE.**
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org