ioeric added a comment.
Some high-level comments before jumping into details.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.cpp:23
+
+// FIXME(kbobyrev): Deal with short symbol symbol names.
+std::vector<SearchAtom> generateSearchAtoms(StringRef SymbolName) {
----------------
maybe also add how short symbol names are handled in the current algorithm and
what the potential solution would be.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.cpp:28
+ // Apply lowercase text normalization.
+ for (auto &Token : Tokens)
+ std::for_each(Token.begin(), Token.end(), ::tolower);
----------------
Any reason not to do this in segmentation?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:27
+
+/// \brief Hashable SearchAtom, which represents a search token primitive.
+///
----------------
Any reason to call this `Atom` instead of `Token`? `Token` seems to be a more
commonly used name for this (e.g.
https://yomguithereal.github.io/mnemonist/inverted-index#tokens).
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:45
+public:
+ enum class Namespace : short {
+ Trigram,
----------------
`Namespace` can be easily confused with namespaces in clang world. Maybe `Kind`
or `Type`?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:51
+
+ SearchAtom() = default;
+ SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
----------------
What is an empty atom? Why do we need it?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:52
+ SearchAtom() = default;
+ SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+ : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
----------------
Why default `Type` to `Trigram`?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:52
+ SearchAtom() = default;
+ SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+ : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
----------------
ioeric wrote:
> Why default `Type` to `Trigram`?
Please add documentation. What is the semantic of `Data`?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:53
+ SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+ : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
+
----------------
Should we also incorporate `Type` into hash?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:53
+ SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+ : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
+
----------------
ioeric wrote:
> Should we also incorporate `Type` into hash?
I'm wondering if we should use different hashes for different token types. For
example, a trigram token "xyz" can be encoded in a 4 -byte int with `('x'<<16)
& ('y'<<8) & 'z'`, which seems cheaper than `std::hash`.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:62
+
+ const llvm::StringRef getData() const { return Data; }
+
----------------
nit:
s/getData/data/
s/getType/type/
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:76
+
+/// \brief Splits unqualified symbol name into tokens for trigram generation.
+///
----------------
nit: I'd avoid calling these `tokens`, as it they can be confused with tokens
for the invert index. How about `segments`? Similarly, `tokenize` should
probably be something like `segment` or `segmentIdentifier`.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:141
+/// trigram belongs to more than one class it is only inserted once.
+std::vector<SearchAtom> generateSearchAtoms(llvm::StringRef SymbolName);
+
----------------
`generateSearchAtoms` is too generalized a name for this. Maybe just
`trigram()`? I also think this could take a list of segments from `tokenize`.
https://reviews.llvm.org/D49417
_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits