omtcyfz added inline comments.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:53
+ SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+ : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
+
----------------
ioeric wrote:
> ioeric wrote:
> > Should we also incorporate `Type` into hash?
> I'm wondering if we should use different hashes for different token types.
> For example, a trigram token "xyz" can be encoded in a 4 -byte int with
> `('x'<<16) & ('y'<<8) & 'z'`, which seems cheaper than `std::hash`.
Discussed internally: we probably shouldn't since there will be collisions even
inside a single token namespace when Data will be too long.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:44
+generateTrigrams(const std::vector<llvm::SmallString<10>> &Segments) {
+ llvm::DenseSet<SearchToken> UniqueTrigrams;
+ std::vector<SearchToken> Trigrams;
----------------
ioeric wrote:
> Could we replace `UniqaueTrigrams`+`Trigrams` with a dense map from hash to
> token?
I'm not sure how to iterate through the result then. Basically, having Trigrams
ensures that these trigrams can be iterated after the function execution and
inserted into the inverted index. Otherwise I should either expose a callback
to add the generated trigrams or do something similar. Should I do that instead?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:57
+ // this case.
+ for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end();
+ ++FirstSegment) {
----------------
ioeric wrote:
> nit: Maybe S1, S2, S3 instead of FirstSegment, ...?
I think it's way less explicit and slightly more confusing. I try not to have
one-letter names so I guess it's better to have full names instead.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:68
+ SearchToken Trigram(
+ (*FirstSegment + *SecondSegment + *ThirdSegment).str(),
+ SearchToken::Kind::Trigram);
----------------
ioeric wrote:
> ioeric wrote:
> > This seems wrong... wouldn't this give you a concatenation of three
> > segments?
> For trigrams, it might make sense to put 3 chars into a `SmallVector<3>` (can
> be reused) and std::move it into the constructor. Might be cheaper than
> creating a std::string
But `std::string` would be created anyway, wouldn't it be?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:87
+
+ for (size_t Position = 0; Position + 2 < Segment.size(); ++Position)
+ Trigrams.push_back(
----------------
ioeric wrote:
> nit: `Position < Segment.size() - 2` seems more commonly used.
If `Segment.size()` is 1 then this would be `1U - 2U`, which would be something
very large.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:138
+
+ for (size_t Index = SegmentStart; Index + 1 < SymbolName.size(); ++Index) {
+ const char CurrentSymbol = SymbolName[Index];
----------------
ioeric wrote:
> Maybe first split name on `_` and them run further upper-lower segmentation
> on the split segments?
Resolving both of these for now (though they're not really fixed) since I will
rethink the algorithm given Sam's patch.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:52
+
+ Custom,
+ };
----------------
sammccall wrote:
> what is this for?
Tags, for example. But yes, for now it's not very clear and probably not
needed. Dropped this one.
https://reviews.llvm.org/D49417
_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits