zclllyybb commented on issue #63917:
URL: https://github.com/apache/doris/issues/63917#issuecomment-4586105093

   Breakwater-GitHub-Analysis-Slot: slot_377b20f15c47
   
   Initial maintainer triage for `apache/doris#63917`.
   
   I checked the live issue metadata and the Doris `2.1.11-rc01` code path that 
matches the reported affected version. The issue currently has no labels. The 
report is credible, but the current material is not enough to produce a 
one-command reproducer yet.
   
   What the supplied log proves:
   
   - The stream load request was accepted by BE and a transaction was opened: 
`txn_id=11142908`, `query_id=ee4ad23d0593a0dd-b8e586767dbe51b2`.
   - The failure happens while the JSON scanner is opening the stream reader, 
before JSON parsing or tablet writing can proceed.
   - The preceding cumulative-compaction log line is very likely unrelated. The 
failing stack is in the stream-load file reader path.
   
   Code-path evidence in 2.1.11:
   
   - `StreamLoadAction::_process_put()` creates a `StreamLoadPipe`, stores it 
in `ctx->body_sink` / `ctx->pipe`, registers the context through 
`new_load_stream_mgr()->put(ctx->id, ctx)`, then starts 
`StreamLoadExecutor::execute_plan_fragment(ctx)` for streaming formats.
   - `NewJsonReader::_open_file_reader()` handles `TFileType::FILE_STREAM` by 
calling `FileFactory::create_pipe_reader(_range.load_id, ...)`.
   - `FileFactory::create_pipe_reader()` does only one lookup for this error: 
`ExecEnv::GetInstance()->new_load_stream_mgr()->get(load_id)`. If that returns 
null, it emits the exact error in the issue: `unknown stream load id: 
<query_id>`.
   - The normal removal paths for this same context include the stream-load 
fragment finish callback in `StreamLoadExecutor::execute_plan_fragment()` and 
HTTP request cleanup in `StreamLoadAction::free_handler_ctx()`.
   
   Maintainer judgment:
   
   This should be treated as a BE stream-load context lifetime/ordering issue 
unless the client disconnected before sending the request body. The error is 
not caused by invalid JSON content by itself. For this stack to happen, the 
scan side has a `FILE_STREAM` range with the correct load id, but the in-memory 
stream-load context has already disappeared from `NewLoadStreamMgr` by the time 
`NewJsonReader` opens the pipe reader.
   
   The highest-risk area to inspect is the race between:
   
   1. `StreamLoadAction::_process_put()` registering the context and 
asynchronously starting the pipeline fragment;
   2. `StreamLoadAction::free_handler_ctx()` cancelling/removing the context 
when the HTTP request is freed or the sender is gone;
   3. the finish callback in `StreamLoadExecutor::execute_plan_fragment()` 
removing the same load id before all scan-side open paths are done.
   
   Recommended next steps:
   
   1. Ask the reporter for the complete BE log around this `query_id` / load 
id, especially lines before and after the shown snippet, and the final 
stream-load HTTP response body if one was returned.
   2. Ask for the exact stream-load request headers generated by DataX: 
`format`, `read_json_by_line`, `strip_outer_array`, `Content-Length` vs 
`Transfer-Encoding: chunked`, timeout, and whether the client connection closed 
or retried at that timestamp.
   3. Ask for the FE log around `streamLoadPut` for the same load id and the 
relevant FE/BE config values, especially pipeline-load related settings and 
stream-load timeout/body-size settings.
   4. For a code fix, focus on making the stream-load pipe registration 
lifetime cover all scanner initialization paths. A robust direction is to avoid 
removing the `NewLoadStreamMgr` entry from a per-fragment/per-request cleanup 
path until the stream-load handle path has reached the final state, or to 
pass/hold a direct shared reference to the stream-load context/pipe so scanner 
initialization does not depend on a late global lookup by load id.
   
   Missing information for a definitive reproducer:
   
   - Minimal table DDL and stream-load command or DataX job config.
   - Whether the error is stable or intermittent.
   - Full BE/FE logs for the same `query_id`.
   - The HTTP response body returned to the client.
   - Whether the client uses chunked transfer or a fixed `Content-Length`, and 
whether it closes the connection early.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to