Quanlong Huang created IMPALA-14969:
---------------------------------------

             Summary: False completion in HdfsScanNode when scanner failed to 
open due to cancellation
                 Key: IMPALA-14969
                 URL: https://issues.apache.org/jira/browse/IMPALA-14969
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
            Reporter: Quanlong Huang
            Assignee: Quanlong Huang


False completion in HdfsScanNode when scanner failed to open due to cancellation

When using MT_DOP=0, HdfsScanNode spawns scanner threads to fill its RowBatch 
queue asynchronously. The fragment instance thread invokes 
HdfsScanNode::GetNext() to get RowBatches from the queue. If it gets nothing 
(the queue is shutdown), eos (EndOfStream) is set to true, which means the 
execution completes. In most of the completes, HdfsScanNode::GetNext() returns 
an OK Status and sets eos=true. Setting eos=true also adds a "Last Batch 
Returned" event in the Node Lifecycle Event Timeline in profile.

However, the RowBatch queue could also be shutdown due to scanner aborts 
unexpectedly, e.g. when the fragment instance is cancelled.
Fragment instance cancellation code path:
  ControlService::CancelQueryFInstances -> QueryState::Cancel -> 
FragmentInstanceState::Cancel -> RuntimeState::Cancel -> set is_cancelled_ to 
true. ScannerContext::cancelled() will return true due to this.
Scanner thread cancellation code path example:
  HdfsScanNode::ScannerThread -> ProcessSplit -> CreateAndOpenScannerHelper -> 
HdfsParquetScanner::Open -> ProcessFooter -> ScannerContext::Stream::ReadBytes 
-> GetBytes -> GetBytesInternal -> GetNextBuffer -> Found 
ScannerContext::cancelled() is true, returns 
Status::CancelledInternal("ScannerContext").
  The non-OK Status returned from CreateAndOpenScannerHelper() is used in 
HdfsScanNode::SetError -> SetDoneInternal -> thread_state_.Shutdown() shutdown 
the RowBatch queue. 

In this case, HdfsScanNode::GetNext() returns a non-OK status but still set 
eos=true. Then query profile shows "Last Batch Returned" in this node, which is 
misleading especially if nothing has been returned from the node. It's more 
confusing when the cancellation is triggered internally, e.g. when the 
coordinator fragment reaches its limit, coordinator cancells all the fragment 
instances. The query finishes successfully but some HdfsScanNodes have 
RowsRead=0 and RowsReturned=0, seems like the table/partitions are empty or 
files are skipped by runtime filters.

profile_scan_false_completion.txt is an example. In the ExecSummary, ScanNode 
on tpch_nested_parquet.customer just returns 6 rows.
{noformat}
Operator                    #Hosts  #Inst   Avg Time  Max Time  #Rows  Est. 
#Rows  Peak Mem  Est. Peak Mem  Detail     
-------------------------------------------------------------------------------------------------------------------------------------------
F02:ROOT                         1      1   11.344us  11.344us                  
    4.01 MB        4.00 MB     
06:EXCHANGE                      1      1    5.603us   5.603us     10          
10  16.00 KB       16.00 KB  UNPARTITIONED     
F01:EXCHANGE SENDER              3      3   34.231us  48.626us                  
   358.00 B       80.00 KB     
00:UNION                         3      3    0.000ns   0.000ns     10          
10  41.00 KB              0  CANCELLED     
02:SUBPLAN                       3      3    0.000ns   0.000ns      0       
1.50M  33.00 KB              0  CANCELLED     
|--05:NESTED LOOP JOIN           3      3   57.657us  63.183us      0          
10  24.00 KB        20.00 B  CROSS JOIN, CANCELLED     
|  |--03:SINGULAR ROW SRC        3      3    0.000ns   0.000ns      0           
1         0              0  CANCELLED     
|  04:UNNEST                     3      3   10.494us  11.744us      0          
10         0              0  c.c_orders arr, CANCELLED     
01:SCAN HDFS                     3      3  785.984us   1.113ms      6     
150.00K  16.01 MB      176.00 MB  tpch_nested_parquet.customer c{noformat}

Here is one of the misleading node profile:
{noformat}
        HDFS_SCAN_NODE (id=1):(Total: 1.113ms, non-child: 1.113ms, % non-child: 
100.00%)
          Table Name: tpch_nested_parquet.customer
          Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:2/192.55 
MB
          Hdfs Read Thread Concurrency Bucket: 0:0% 1:0% 2:0% 3:0% 4:0% 5:0% 
6:0% 7:0% 8:0% 9:0% 10:0% 11:0% 12:0% 13:0% 14:0% 15:0% 16:0% 17:0% 18:0% 19:0% 
20:0% 21:0% 22:0% 23:0% 24:0% 25:0% 26:0% 27:0% 28:0% 29:0% 30:0% 31:0% 32:0%
          File Formats: PARQUET/Unknown(Skipped):2
          ExecOption: Codegen enabled: 2 out of 2, PARQUET Codegen Enabled
          Node Lifecycle Event Timeline: 19.806ms
             - Open Started: 18.758ms (18.758ms)
             - Open Finished: 18.780ms (22.875us)
             - First Batch Requested: 18.788ms (7.723us)
             - First Batch Returned: 19.790ms (1.001ms)
             - Last Batch Returned: 19.790ms (69.000ns)
             - Closed: 19.806ms (16.062us)
           ...
           - RowsRead: 0 (0)
           - RowsReturned: 0 (0){noformat}

Though it shows "Last Batch Returned", the RowsRead and RowsReturned are all 0. 
"File Formats: PARQUET/Unknown(Skipped):2" can explain a little bit, but there 
are no reason to explain why the files are skipped. Note that no runtime 
filters in this case.

It makes more sense to remove the "Last Batch Returned" event, which also helps 
HBO to skip tracking the cardinality of the scan node (IMPALA-14843).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to