ioeric added a comment. Thanks for the patch!
In https://reviews.llvm.org/D50517#1194955, @kbobyrev wrote: > Complete the tests, finish the implementation. > > One thought about prefix match suggestion: we should either make it more > explicit for the index (e.g. introduce `prefixMatch` and dispatch > `fuzzyMatch` to prefix matching in case query only contains one "true" > symbol) or document this properly. While, as I wrote earlier, I totally > support the idea of prefix matching queries of length 1 it might not align > with some user expectations and it's also very implicit if we just generate > tokens this way and don't mention it anywhere in the `DexIndex` > implementation. > > @ioeric, @ilya-biryukov any thoughts? (copied my inline comment :) We should definitely add documentation about it. It should be pretty simple IMO. As the behavior should be easy to infer from samples, and it shouldn't be too surprising for users, I think it would be OK to consider it as implementation detail (like details in how exactly trigrams are generated) without exposing new interfaces for them. ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:26 -// FIXME(kbobyrev): Deal with short symbol symbol names. A viable approach would -// be generating unigrams and bigrams here, too. This would prevent symbol index -// from applying fuzzy matching on a tremendous number of symbols and allow -// supplementary retrieval for short queries. -// -// Short names (total segment length <3 characters) are currently ignored. +// FIXME(kbobyrev): Posting lists for incomplete trigrams (one/two symbols) are +// likely to be very dense and hence require special attention so that the index ---------------- It's a nice to optimization have when we run into oversized posting lists, but this is not necessarily restricted to unigram posting lists. I think the FIXME should live near the general posting list code. I think it's probably also ok to leave it out; it's hard to forget if we do run into problem in the future ;) ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:74 + // symbol of the identifier. + if (!FoundFirstSymbol) { + FoundFirstSymbol = true; ---------------- Could this be pulled out of the loop? I think what we want is just `LowercaseIdentifier[0]` right? I'd probably also pulled that into a function, as the function body is getting larger. ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:87 + + Chars = {{LowercaseIdentifier[I], LowercaseIdentifier[J], END_MARKER, 0}}; + const auto Bigram = Token(Token::Kind::Trigram, Chars.data()); ---------------- I think we could be more restrictive on bigram generation. I think a bigram prefix of identifier and a bigram prefix of the HEAD substring should work pretty well in practice. For example, for `StringStartsWith`, you would have `st$` and `ss$` (prefix of "SSW"). WDYT? ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:115 +// FIXME(kbobyrev): Correctly handle empty trigrams "$$$". std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) { // Apply fuzzy matching text segmentation. ---------------- It seems to me that what we need for short queries is simply: ``` if (Query.empty()) { // return empty token } if (Query.size() == 1) return {Query + "$$"}; if (Query.size() == 2) return {Query + "$"}; // Longer queries... ``` ? ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:39 +// member of Token even though it's Trigram-specific? +const auto END_MARKER = '$'; + ---------------- Any reason why this should be exposed? ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:62 /// belongs to more than one class it is only inserted once. +// FIXME(kbobyrev): Document somewhere in DexIndex that for queries of size 1 +// it will do prefix matching instead of fuzzy matching on the identifier names, ---------------- The behavior should be easy to infer from samples. As long as it's not totally expected, I think it would be OK to consider treat as implementation detail (like details in how trigrams are generated). ================ Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:74 +/// For short queries (Query contains less than 3 letters and digits) this +/// returns a single trigram with all valid symbols. std::vector<Token> generateQueryTrigrams(llvm::StringRef Query); ---------------- I'm not quite sure what this means. Could you elaborate? https://reviews.llvm.org/D50517 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits