zclllyybb commented on issue #64483:
URL: https://github.com/apache/doris/issues/64483#issuecomment-4699555930

   Breakwater-GitHub-Analysis-Slot: slot_c4e837598f90
   
   Initial analysis: this looks like a real row-binlog metadata construction 
bug on current upstream/master. I checked the refreshed `upstream/master` ref 
at `4dda859061c`, and the proposed root cause is consistent with the code path.
   
   Evidence:
   - `testutil::enable_row_binlog()` builds the row-binlog schema by appending 
`__DORIS_BINLOG_LSN__`, `__DORIS_BINLOG_OP__`, and `__DORIS_BINLOG_TIMESTAMP__` 
after the normal columns. In the reported test case, the LSN column should 
therefore be column id 2, and the timestamp column should be column id 4.
   - `TabletMeta::init_schema_from_thrift()` converts `TTabletSchema` directly 
into `TabletSchemaPB` by adding `ColumnPB`s. It propagates special indexes that 
exist on `TTabletSchema`, such as `delete_sign_idx`, but there is no thrift 
field for the row-binlog LSN/timestamp indexes, and this function does not 
derive those indexes from the column names.
   - `TabletMeta::init_from_pb()` then calls `TabletSchema::init_from_pb()` for 
`row_binlog_schema`. That path assigns `_binlog_lsn_col_idx` and 
`_binlog_timestamp_col_idx` from the protobuf fields, whose defaults are `-1`.
   - `RowBinlogSegmentWriter::init()` hard-checks 
`_tablet_schema->binlog_lsn_col_idx() >= 0`, so it aborts even though the 
`__DORIS_BINLOG_LSN__` column exists in the schema's column list.
   
   Recommended fix:
   - In `TabletMeta::init_schema_from_thrift()`, while iterating 
`tablet_schema.columns`, set `tablet_schema_pb->set_binlog_lsn_col_idx(...)` 
and `tablet_schema_pb->set_binlog_timestamp_col_idx(...)` when 
`tcolumn.column_name` matches `BINLOG_LSN_COL` and `BINLOG_TIMESTAMP_COL`. 
Alternatively, route this conversion through the normal `TabletSchema` 
append/to-pb path so all special-column indexes are derived consistently.
   - Add a regression assertion that the created row-binlog tablet schema has 
`binlog_lsn_col_idx() == 2` and `binlog_timestamp_col_idx() == 4`, not only 
that `field_index("__DORIS_BINLOG_LSN__") >= 0`. `field_index` can pass while 
the persisted special-column index remains `-1`.
   - Keep `GroupRowsetWriterTest.sub_writer_rollback` as the end-to-end guard; 
after the metadata fix it should no longer abort in 
`RowBinlogSegmentWriter::init()`.
   
   Missing information:
   - For this BE unit-test failure, the issue already contains enough 
information to identify the bug.
   - If the same failure is observed outside BE UT, please also provide the 
exact Doris commit SHA, the row-binlog table DDL/binlog properties, and the BE 
fatal log/core stack from that environment. If any tablet meta was created by 
the broken code before the fix, maintainers should also consider whether 
`TabletSchema::init_from_pb()` needs a compatibility fallback that derives 
missing row-binlog indexes from column names when the protobuf fields are still 
`-1`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to