kbobyrev added inline comments.

================
Comment at: clang-tools-extra/clangd/index/dex/DexIndex.h:45
+  void build(std::shared_ptr<std::vector<const Symbol *>> Symbols,
+             llvm::ArrayRef<std::string> URISchemes);
 
----------------
sammccall wrote:
> ioeric wrote:
> > URI schemes are property of `Symbols`. We shouldn't need to pass specify 
> > the schemes again. Dex can collect all possible URI schemes when building 
> > the index.
> > 
> > I think we could generate proximity paths for index symbols without knowing 
> > about schemes (when building the index). For example, we could 
> > `canonicalize` the URI (as in `FileDistance.cpp`), for example, by 
> > converting `test:///x/y/z` to `/test:/x/y/z`.  For a query, we would first 
> > convert query proximity paths to URIs (of all possible schemes), perform 
> > the same canonicalization on URIs, and then use the canonicalized paths to 
> > compute proximity.
> We had an offline discussion about URIs which might be interesting.
> 
> Fetching posting lists for all URI schemes at query time seems wasteful, 
> @ilya-biryukov pointed out that in practice each file is going to exist in 
> some preferred URI space, and we should only compare that one.
> The only obstacle to doing that is that the preference order for URI schemes 
> is not global, each place we need it it's accepted as configuration.
> 
> Maybe we should change that: as part of registering the URI schemes, we 
> define a preferred order. And instead of the current operation `makeURI(File, 
> scheme)` we should just offer APIs like `makePreferredURI(File)` and 
> `makeFileURI(File)`.
I also thought about extracting schemes during the build stage earlier today, 
couldn't figure out whether it's a good thing: it's probably tiny overhead for 
the build stage (if we just pass schemes the server is aware of we avoid 
looking for new `scheme:` in the prefixes), but it might be worth considering 
the architecture (and the fact that even though the server might be aware of 
many schemes, some of them might not occur in the index and that would induce 
small overhead on the query side).

Could you please elaborate your suggestion about the `canonicalize`? I'm not 
sure I caught that: IIUC the current approach (which doesn't `canonicalize` 
URIs) should be sufficient for our needs. I'm not sure why we would want 
`/test:/x/y/z/` over `test:///x/y/z/` if the extracted tokens are `test:///`, 
`test:///x/`, `test:///x/y/`, `test:///x/y/z/` (which is the case right now).


================
Comment at: clang-tools-extra/clangd/index/dex/DexIndex.h:64
 
-private:
+  /// For fuzzyFind() Dex retrieves getRetrievalItemsMultiplier() more items
+  /// than requested via Req.MaxCandidateCount in the first stage of filtering.
----------------
ioeric wrote:
> Why are values of multipliers interesting to users? Could these be 
> implementation details in the cpp file?
Actually, my understanding is that users might want to have full access to the 
multipliers at some point to control the performance/quality ratio.

And it's also useful for the tests: otherwise the last one would have to 
hard-code number of generated symbols to ensure only boosted ones are in the 
returned list. It would have to be updated each time these internal multipliers 
are and we might update them often/make logic less clear (by allowing users to 
control these parameters).

What do you think?


https://reviews.llvm.org/D51481



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to