github-actions[bot] commented on code in PR #60374:
URL: https://github.com/apache/doris/pull/60374#discussion_r2893506686
##########
be/src/vec/exec/format/parquet/byte_array_dict_decoder.cpp:
##########
@@ -81,6 +81,13 @@ Status
ByteArrayDictDecoder::read_dict_values_to_column(MutableColumnPtr& doris_
MutableColumnPtr ByteArrayDictDecoder::convert_dict_column_to_string_column(
const ColumnInt32* dict_column) {
auto res = ColumnString::create();
+ if (_dict_items.empty()) {
+ if (dict_column->size() > 0) {
+ LOG(ERROR) << "Attempt to convert dict column with empty
dictionary, column size: "
Review Comment:
**Defensive programming concern (per AGENTS.md):** If `_dict_items.empty()`
and `dict_column->size() > 0`, this silently returns an empty column (0 rows),
dropping data. Per the coding standards, this should not be masked with an `if`
check + LOG:
- If this condition is truly unreachable (because all-null columns should
never produce non-null dict indices), then use `DORIS_CHECK(dict_column->size()
== 0)` instead of the `if` + LOG(ERROR).
- If it IS reachable, then returning 0 rows silently drops data, which is a
correctness issue.
Suggested:
```cpp
if (_dict_items.empty()) {
DORIS_CHECK(dict_column->size() == 0) << "Attempt to convert dict column
with empty dictionary, column size: " << dict_column->size();
return res;
}
```
##########
be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:
##########
@@ -417,12 +417,21 @@ Status ColumnChunkReader<IN_COLLECTION,
OFFSET_INDEX>::_decode_dict_page() {
if (!dict_loaded) {
// Load and decompress dictionary page from file
if (_block_compress_codec != nullptr) {
+ auto dict_num = header->dictionary_page_header.num_values;
+ if (dict_num == 0 && uncompressed_size != 0) {
+ return Status::IOError(
+ "Dictionary page's num_values is {} but
uncompressed_size is {}", dict_num,
+ uncompressed_size);
+ }
Review Comment:
`DCHECK` disappears in RELEASE builds (known high-risk pattern per coding
standards). Since this is an invariant assertion (if `dict_num != 0`,
`compressed_data` must have been populated by `get_page_data`), this should be
`DORIS_CHECK(!compressed_data.empty())` to provide protection in production.
##########
be/src/vec/exec/format/parquet/fix_length_dict_decoder.hpp:
##########
@@ -81,7 +81,7 @@ class FixLengthDictDecoder final : public BaseDictDecoder {
ColumnSelectVector& select_vector, bool
is_dict_filter) {
size_t non_null_size = select_vector.num_values() -
select_vector.num_nulls();
if (doris_column->is_column_dictionary() &&
- assert_cast<ColumnDictI32&>(*doris_column).dict_size() == 0) {
+ assert_cast<ColumnDictI32&>(*doris_column).dict_size() == 0 &&
!_dict_items.empty()) {
std::vector<StringRef> dict_items;
Review Comment:
Same concern as in
`ByteArrayDictDecoder::convert_dict_column_to_string_column` — this LOG(ERROR)
+ return-empty pattern is defensive programming that masks potential data
corruption. Should use `DORIS_CHECK(dict_column->size() == 0)` instead of the
`if` + LOG(ERROR) guard, to crash on invariant violation rather than silently
dropping rows.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]