kangkaisen opened a new issue #2714: Critical bug: Doris BE is not high 
available
URL: https://github.com/apache/incubator-doris/issues/2714
 
 
   **Describe the bug**
   If there are many stream load jobs, when one BE done a longer time and 
restart, there will be a lot of query fail.
   ```
   W0108 20:37:14.658995 192765 olap_scanner.cpp:122] fail to init 
reader.res=-214
   W0108 20:37:14.659014 192765 olap_scanner.cpp:63] OlapScanner preapre 
failed, status:failed to initialize storage reader. 
tablet=516580.1990635423.784b48ecc429e107-79027b76e46f3191, res=-214, 
backend=10.26.45.28
   W0108 20:37:14.671918 192759 rowset_graph.cpp:194] fail to find path in 
version_graph. spec_version: 0-520
   W0108 20:37:14.672857 192759 tablet.cpp:489] status:-214, 
tablet:262415.864251184.9d4129a7bc52c626-9485230b00cf3791, missed version for 
version:0-520
   W0108 20:37:14.672900 192759 tablet.cpp:982] 
262415.864251184.9d4129a7bc52c626-9485230b00cf3791 has 1 missed version:520-520,
   W0108 20:37:14.672914 192759 olap_sca
   ```
   
   **To Reproduce**
   
   1. Keep loading data to one table
   2. Keep query this table
   3. Make a BE done a long time(10 minutes)
   4.  Restart the BE
   5. Some query will fail
   
   **Additional context**
   ```
   2020-01-09 10:18:08,899 WARN 86 [TabletInvertedIndex.tabletReport():162] 
replica 173673758 of tablet 173673756 on backend 126581199 need recovery. 
replica in FE: [replicaId=173673758, BackendId=126581199, version=721476, 
versionHash=2865055093552332621, dataSize=2710247, rowCount=63684, 
lastFailedVersion=-1, lastFailedVersionHash=0, lastSuccessVersion=721476, 
lastSuccessVersionHash=2865055093552332621, lastFailedTimestamp=-1, 
schemaHash=997644236, state=NORMAL], report version 721411-3302667986958222485, 
report schema hash: 997644236, is bad: unknown, is version missing: true
   
   2020-01-09 10:18:09,670 WARN 86 [ReportHandler.handleRecoverTablet():738] 
find 187 tablets with report version less than version i
   n meta, or is set bad, on backend 126581199, they need clone or force 
recovery
   2020-01-09 10:18:09,670 WARN 86 [ReportHandler.handleRecoverTablet():744] 
force recovery is disable. try reset the tablets' versio
   n or set it as bad, and waiting clone
   
   
   2020-01-09 10:18:09,672 WARN 86 [Replica.updateVersionInfoForRecovery():262] 
update replica 173673795 on backend 126581199's versi
   on for recovery. version: 721476-2865055093552332621:721411--1. last failed 
version: -1-0:721412--1, last success version: 721476-
   2865055093552332621:721476-2865055093552332621
   ```
   
   The reason is:
   **If a BE down a lone time and then restart, there should be some missing 
versions for the tablet in the downed  BE.  And we don't handle the missing 
version replica for the downed BE and send the query to downed BE.**
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to