[PATCH] D53433: [clangd] auto-index stores symbols per-file instead of per-TU.

Sam McCall via Phabricator via cfe-commits Mon, 22 Oct 2018 09:59:33 -0700

sammccall added inline comments.


================
Comment at: clangd/index/Background.h:68
+  FileSymbols IndexedSymbols;
+  FileDigests IndexedFileDigests; // Keyed by file URIs.
 
----------------
Keying this with file URIs seems suspicious to me - why this change?

It seems to be motivated by the current implementation where the partitioning 
is done post-hoc by looking at strings that contain file URIs.
But this seems likely to change in the long term, and producing data that's 
sharded in the first place. At that point it'll be just as natural and 
semantically cleaner just to use file paths, I think.

If performance of converting URI -> path is a concern in the meantime, can we 
just add a cache to that loop?


================
Comment at: clangd/index/FileIndex.cpp:121
+  if (MergeSymbols) {
+    // Merge slabs into a single slab and only keep alive the merged.
+    SymbolSlab::Builder Merged;
----------------
I don't think you want to build a new slab and copy all the strings here.

This seems like it's saving data (you're not keeping everything alive!) but in 
practice that only allows GC of data that changes, and most data doesn't 
change. So mostly this just introduces another copy of the data.

So I think you just want a `vector<Symbol> MergedSymbols`, and either push all 
the pointers in there or use `concat(make_pointee_range(SymbolPointers), 
MergedSymbols)`


================
Comment at: clangd/index/FileIndex.h:49
+/// rename the existing FileSymbols to something else e.g. TUSymbols?
+class SymbolsGroupedByFiles {
+public:
----------------
ioeric wrote:
> sammccall wrote:
> > `FileSymbols` isn't actually that opinionated about the data it manages:
> >  - the keys ("filenames") just identify shards so we can tell whether a new 
> > shard is an addition or replaces an existing one.
> >  - the values (tu symbols/refs) are merged using pretty generic logic, we 
> > don't look at the details.
> > 
> > I think we should be able to use `FileSymbols` mostly unmodified, and store 
> > digests in parallel, probably in `BackgroundIndexer`. Is there a strong 
> > reason to store "main file" digests separately to header digests?
> > 
> > There are a couple of weak points:
> >  - The merging makes subtle assumptions: for symbols it picks one copy 
> > arbitrarily, assuming there are many copies (merging would be expensive) 
> > and they're mostly interchangeable duplicates. We could add a constructor 
> > or `buildIndex()` param to FileSymbols to control this, similar to the 
> > IndexType param. (For refs it assumes no duplicates and simply 
> > concatenates, which is also fine for our purposes).
> >  - `FileSymbols` locks internally to be threadsafe while keeping critical 
> > sections small. To keep digests in sync with FileSymbols contents we 
> > probably have to lock externally with a second mutex. This is a little ugly 
> > but not a real problem I think.
> Good idea! Thanks!
> 
> > FileSymbols locks internally to be threadsafe while keeping critical 
> > sections small. To keep digests in sync with FileSymbols contents we 
> > probably have to lock externally with a second mutex. This is a little ugly 
> > but not a real problem I think.
> This doesn't seem to be a problem as `Background::index()` currently assumes 
> single-thread? I can either make it thread-safe now or add a `FIXME`. WDYT?
Right, currently we're only reading/writing with a single thread. The mutex 
thing is more for the future. Don't bother with a FIXME, when we add multiple 
threads we'll need to work out what to lock.


================
Comment at: clangd/index/FileIndex.h:41
+using FileDigest = decltype(llvm::SHA1::hash({}));
+using FileDigests = llvm::StringMap<FileDigest>;
+
----------------
as noted elsewhere I think this is the wrong place for this decl.
The values are somewhat self-explanatory, but the keys need to be documented.


================
Comment at: clangd/index/FileIndex.h:68
   // The index keeps the symbols alive.
+  // If \p MergeSymbols is true, this will merge symbols from different files;
+  // otherwise, a random one is picked, which is less accurate but faster to
----------------
can we pull out something like `enum class DuplicateHandling { PickOne, Merge 
}`?

I think that's much clearer at the callsite than `/*MergeSymbols=*/true`, and 
not really longer.


================
Comment at: clangd/index/IndexAction.h:12
 #define LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_INDEX_ACTION_H
+#include "FileIndex.h"
 #include "SymbolCollector.h"
----------------
This looks like an inverted dependency. The FileDigests struct should be part 
of symbolcollector, no?


================
Comment at: clangd/index/IndexAction.h:33
+    std::function<void(RefSlab)> RefsCallback,
+    std::function<void(FileDigests)> FileDigestsCallback);
 
----------------
thinking about what we eventually want here:
 - the index action needs to tell the auto-indexer what the file content or 
digests are
 - the auto-indexer needs to tell the index action which files to index symbols 
from
 - this decision is in turn based on the digests

What about this design:
 - `createStaticIndexingAction` accepts a filter-function: "should we index 
symbols from file X"?
 - `SymbolCollector` invokes this once per file and caches the result.
 - the signature is something like `bool(SourceManager&, FileEntry*)`
 - auto-index passes a function that reads the contents from the FileEntry, 
checks whether the hash is up-to-date to return true/false, and **stores the 
resulting hashes** for later

That way the caching/hash stuff never actually leaks outside auto-index that 
needs it, and we get the interface we need to precisely avoid generating 
Symbols we don't want.

(That function could just always return true in this patch, if the filtering 
isn't trivial).

It would mean that auto-index would need to redundantly compute the URI once 
per file, I don't think that matters.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D53433



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D53433: [clangd] auto-index stores symbols per-file instead of per-TU.

Reply via email to