github-actions[bot] commented on code in PR #60358:
URL: https://github.com/apache/doris/pull/60358#discussion_r2876590603
##########
be/src/olap/rowset/segment_v2/ann_index/ann_index_iterator.cpp:
##########
@@ -38,6 +49,8 @@ Status AnnIndexIterator::read_from_index(const IndexParam&
param) {
// _context may be unset in some test scenarios; pass nullptr IOContext in
that case.
io::IOContext* io_ctx = (_context != nullptr) ? _context->io_ctx : nullptr;
+ LOG_INFO("_context of ann index iterator is {}", (_context != nullptr) ?
"not null" : "null");
Review Comment:
**Remove this LOG_INFO.** This fires on every ANN query call and provides no
diagnostic value in production. It was likely added for debugging.
##########
be/src/olap/rowset/segment_v2/index_file_reader.cpp:
##########
@@ -42,6 +43,10 @@ Status IndexFileReader::init(int32_t read_buffer_size, const
io::IOContext* io_c
Status IndexFileReader::_init_from(int32_t read_buffer_size, const
io::IOContext* io_ctx) {
auto index_file_full_path =
InvertedIndexDescriptor::get_index_file_path_v2(_index_path_prefix);
+ LOG(INFO) << "[DEBUG] IndexFileReader::_init_from start,
index_path_prefix: "
+ << _index_path_prefix << ", index_file_full_path: " <<
index_file_full_path
+ << ", read_buffer_size: " << read_buffer_size;
Review Comment:
**Critical: Remove debug logging before merge.** This file has ~15 `[DEBUG]
LOG(INFO)` statements that will fire on every index file read (not just ANN
index). This is the hot path for all inverted index operations and will flood
production logs. These should be removed entirely or downgraded to `VLOG_DEBUG`.
##########
be/src/olap/rowset/segment_v2/ann_index/ann_index_reader.cpp:
##########
@@ -87,23 +89,30 @@ Status AnnIndexReader::load_index(io::IOContext* io_ctx) {
_vector_index->set_type(_index_type);
RETURN_IF_ERROR(_vector_index->load(compound_dir->get()));
} catch (CLuceneError& err) {
+ LOG_ERROR("Failed to load ann index: {}", err.what());
return Status::Error<ErrorCode::INVERTED_INDEX_CLUCENE_ERROR>(
"CLuceneError occur when open ann idx file, error msg:
{}", err.what());
}
return Status::OK();
});
}
-Status AnnIndexReader::query(io::IOContext* io_ctx, AnnTopNParam* param,
AnnIndexStats* stats) {
+bool AnnIndexReader::try_load_index(io::IOContext* io_ctx) {
#ifndef BE_TEST
- {
- SCOPED_TIMER(&(stats->load_index_costs_ns));
- RETURN_IF_ERROR(load_index(io_ctx));
- double load_costs_ms =
static_cast<double>(stats->load_index_costs_ns.value()) / 1000.0;
- DorisMetrics::instance()->ann_index_load_costs_ms->increment(
- static_cast<int64_t>(load_costs_ms));
+ Status st = load_index(io_ctx);
Review Comment:
**Lost metrics tracking.** The original code had:
```cpp
SCOPED_TIMER(&(stats->load_index_costs_ns));
RETURN_IF_ERROR(load_index(io_ctx));
double load_costs_ms = ...
DorisMetrics::instance()->ann_index_load_costs_ms->increment(...);
```
By moving load to `try_load_index()` (which has no `stats` parameter), both
`load_index_costs_ns` timing and `ann_index_load_costs_ms` metrics are
completely lost. You should either pass the stats object to `try_load_index()`
or record the timing at the call site in `segment_iterator.cpp`.
##########
be/src/olap/rowset/segment_v2/ann_index/faiss_ann_index.cpp:
##########
@@ -291,6 +291,31 @@ doris::Status FaissVectorIndex::add(vectorized::Int64 n,
const float* vec) {
return doris::Status::OK();
}
+vectorized::Int64 FaissVectorIndex::get_min_train_rows() const {
+ // For IVF indexes, the minimum number of training points should be at
least
+ // equal to the number of clusters (nlist). FAISS requires this for
k-means clustering.
+ vectorized::Int64 ivf_min = 0;
+ if (_params.index_type == FaissBuildParameter::IndexType::IVF) {
+ ivf_min = _params.ivf_nlist;
+ }
+
+ // Calculate minimum training rows required by the quantizer
+ vectorized::Int64 quantizer_min = 0;
+ if (_params.quantizer == FaissBuildParameter::Quantizer::PQ) {
+ // For PQ, we need to make sure each sub-quantizer has enough training
vectors.
+ // See code from contrib/faiss/faiss/impl/ProductQuantizer.cpp::65
+ quantizer_min = (1LL << _params.pq_nbits) * 100;
+ } else if (_params.quantizer == FaissBuildParameter::Quantizer::SQ4 ||
+ _params.quantizer == FaissBuildParameter::Quantizer::SQ8) {
+ // For SQ, use a minimum of 20 training vectors, similar to IVF's
nlist * 2 with nlist=10
+ quantizer_min = 1;
Review Comment:
**Comment/code mismatch.** Comment says "use a minimum of 20 training
vectors, similar to IVF's nlist * 2 with nlist=10" but the code sets
`quantizer_min = 1`. Should this be `quantizer_min = 20`? Or if 1 is correct,
the comment should be updated.
##########
be/src/olap/rowset/segment_v2/ann_index/ann_index_writer.cpp:
##########
@@ -151,16 +152,50 @@ int64_t AnnIndexColumnWriter::size() const {
}
Status AnnIndexColumnWriter::finish() {
+ vectorized::Int64 min_train_rows = _vector_index->get_min_train_rows();
+
+ // Check if we have enough rows to train the index
// train/add the remaining data
- if (!_float_array.empty()) {
+ if (_float_array.empty()) {
+ if (_need_save_index) {
+ return _vector_index->save(_dir.get());
+ } else {
+ // No data was added at all. This can happen if the segment has 0
rows
+ // or all rows were filtered out. We need to delete the directory
entry
+ // to avoid writing an empty/invalid index file.
+ LOG_INFO("No data to train/add for ANN index. Skipping index
building.");
+ return _index_file_writer->delete_index(_index_meta);
+ }
+ } else {
DCHECK(_float_array.size() % _vector_index->get_dimension() == 0);
vectorized::Int64 num_rows = _float_array.size() /
_vector_index->get_dimension();
- RETURN_IF_ERROR(_vector_index->train(num_rows, _float_array.data()));
- RETURN_IF_ERROR(_vector_index->add(num_rows, _float_array.data()));
- _float_array.clear();
+
+ if (num_rows >= min_train_rows) {
+ RETURN_IF_ERROR(_vector_index->train(num_rows,
_float_array.data()));
+ RETURN_IF_ERROR(_vector_index->add(num_rows, _float_array.data()));
+ return _vector_index->save(_dir.get());
+ } else {
+ // It happens to have not enough data to train.
+ // If we have data to add before, we still need to save the index.
+ if (_need_save_index) {
+ RETURN_IF_ERROR(_vector_index->add(num_rows,
_float_array.data()));
+ return _vector_index->save(_dir.get());
+ } else {
+ // Not enough data to train and no data added before.
+ // Means this is a very small segment, we can skip the index
building.
+ // We need to delete the directory entry from
index_file_writer to avoid
+ // writing an empty/invalid index file which causes
"IndexInput read past EOF" error.
+ LOG_INFO(
+ "Remaining data size {} is less than minimum {} rows
required for ANN "
+ "index "
+ "training. Skipping index building for this segment.",
+ num_rows, min_train_rows);
+ return _index_file_writer->delete_index(_index_meta);
+ }
+ }
}
- return _vector_index->save(_dir.get());
+ return Status::OK();
Review Comment:
**Dead code.** This `return Status::OK()` is unreachable because every
branch in the `if (_float_array.empty()) ... else ...` block above already
returns. Consider removing it to avoid confusion.
##########
be/src/olap/rowset/segment_v2/ann_index/ann_index_writer.cpp:
##########
@@ -151,16 +152,50 @@ int64_t AnnIndexColumnWriter::size() const {
}
Status AnnIndexColumnWriter::finish() {
+ vectorized::Int64 min_train_rows = _vector_index->get_min_train_rows();
+
+ // Check if we have enough rows to train the index
// train/add the remaining data
- if (!_float_array.empty()) {
+ if (_float_array.empty()) {
+ if (_need_save_index) {
+ return _vector_index->save(_dir.get());
+ } else {
+ // No data was added at all. This can happen if the segment has 0
rows
+ // or all rows were filtered out. We need to delete the directory
entry
+ // to avoid writing an empty/invalid index file.
+ LOG_INFO("No data to train/add for ANN index. Skipping index
building.");
+ return _index_file_writer->delete_index(_index_meta);
+ }
+ } else {
DCHECK(_float_array.size() % _vector_index->get_dimension() == 0);
vectorized::Int64 num_rows = _float_array.size() /
_vector_index->get_dimension();
- RETURN_IF_ERROR(_vector_index->train(num_rows, _float_array.data()));
- RETURN_IF_ERROR(_vector_index->add(num_rows, _float_array.data()));
- _float_array.clear();
+
+ if (num_rows >= min_train_rows) {
+ RETURN_IF_ERROR(_vector_index->train(num_rows,
_float_array.data()));
+ RETURN_IF_ERROR(_vector_index->add(num_rows, _float_array.data()));
+ return _vector_index->save(_dir.get());
+ } else {
+ // It happens to have not enough data to train.
+ // If we have data to add before, we still need to save the index.
+ if (_need_save_index) {
+ RETURN_IF_ERROR(_vector_index->add(num_rows,
_float_array.data()));
Review Comment:
**Potential correctness concern:** When `_need_save_index == true` (previous
chunks were trained+added) but the remaining rows are fewer than
`min_train_rows`, you call `add()` without `train()`. For IVF indexes, this
means the remaining vectors are added to an index whose quantizer was trained
on previous batches. This is generally fine for FAISS since the quantizer is
already trained, but it means these vectors won't benefit from the last batch's
clustering. Worth a comment explaining this is intentional.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]