[PATCH] D40897: [clangd] Introduce a "Symbol" class.

Sam McCall via Phabricator via cfe-commits Wed, 13 Dec 2017 00:51:56 -0800

sammccall added inline comments.


================
Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.
----------------
malaperle wrote:
> sammccall wrote:
> > malaperle wrote:
> > > malaperle wrote:
> > > > sammccall wrote:
> > > > > hokein wrote:
> > > > > > malaperle wrote:
> > > > > > > sammccall wrote:
> > > > > > > > hokein wrote:
> > > > > > > > > malaperle wrote:
> > > > > > > > > > I think it would be nice to have methods as an interface to 
> > > > > > > > > > get this data instead of storing them directly. So that an 
> > > > > > > > > > index-on-disk could go fetch the data. Especially the 
> > > > > > > > > > occurrences which can take a lot of memory (I'm working on 
> > > > > > > > > > a branch that does that). But perhaps defining that 
> > > > > > > > > > interface is not within the scope of this patch and could 
> > > > > > > > > > be better discussed in D40548 ?
> > > > > > > > > I agree. We can't load all the symbol occurrences into the 
> > > > > > > > > memory since they are too large. We need to design interface 
> > > > > > > > > for the symbol occurrences. 
> > > > > > > > > 
> > > > > > > > > We could discuss the interface here, but CodeCompletion is 
> > > > > > > > > the main thing which this patch focuses on. 
> > > > > > > > > We can't load all the symbol occurrences into the memory 
> > > > > > > > > since they are too large
> > > > > > > > 
> > > > > > > > I've heard this often, but never backed up by data :-)
> > > > > > > > 
> > > > > > > > Naively an array of references for a symbol could be doc ID + 
> > > > > > > > offset + length, let's say 16 bytes.
> > > > > > > > 
> > > > > > > > If a source file consisted entirely of references to 
> > > > > > > > 1-character symbols separated by punctuation (1 reference per 2 
> > > > > > > > bytes) then the total size of these references would be 8x the 
> > > > > > > > size of the source file - in practice much less. That's not 
> > > > > > > > very big.
> > > > > > > > 
> > > > > > > > (Maybe there are edge cases with macros/templates, but we can 
> > > > > > > > keep them under control)
> > > > > > > I'd have to break down how much memory it used by what, I'll come 
> > > > > > > back to you on that. Indexing llvm with ClangdIndexDataStorage, 
> > > > > > > which is pretty packed is about 200MB. That's already a lot 
> > > > > > > considering we want to index code bases many times bigger. But 
> > > > > > > I'll try to come up with more precise numbers. I'm open to 
> > > > > > > different strategies.
> > > > > > > 
> > > > > > I can see two points here:
> > > > > > 
> > > > > > 1. For all symbol occurrences of a TU, it is not quite large, and 
> > > > > > we can keep them in memory.
> > > > > > 2. For all symbol occurrences of a whole project, it's not a good 
> > > > > > idea to load all of them into memory (For LLVM project, the size of 
> > > > > > YAML dataset is ~1.2G).  
> > > > > (This is still a sidebar - not asking for any changes)
> > > > > 
> > > > > The YAML dataset is not a good proxy for how big the data is (at 
> > > > > least without an effort to estimate constant factor).
> > > > > And "it's not a good idea" isn't an assertion that can hold without 
> > > > > reasons, assumptions, and data.
> > > > > If the size turns out to be, say, 120MB for LLVM, and we want to 
> > > > > scale to 10x that, and we're spending 500MB/file for ASTs, then it 
> > > > > might well be a good trade.
> > > > > The YAML dataset is not a good proxy for how big the data is (at 
> > > > > least without an effort to estimate constant factor).
> > > > 
> > > > Indeed. I'll try to come up with more realistic numbers. There are 
> > > > other things not accounted for in the 16 bytes mentioned above, like 
> > > > storing roles and relations.
> > > > 
> > > > > 500MB/file for ASTs
> > > > 
> > > > What do you mean? 500MB worth of occurrences per file? Or Preambles 
> > > > perhaps?
> > > > What do you mean? 500MB worth of occurrences per file? Or Preambles 
> > > > perhaps?
> > > 
> > > Oh I see, the AST must be in memory for fast reparse. I just tried 
> > > opening 3 files at the same time I it was already around 500MB. Hmm, 
> > > that's a bit alarming.
> > Right, just that we have to consider RAM usage for the index in the context 
> > of clangd's overall requirements - if other parts of clangd use 1G of ram 
> > for typical work on a large project, then we shouldn't rule out spending a 
> > couple of 100MB on the index if it adds a lot of value.
> Agreed we have to consider the overall requirements. I think over 1GB of RAM 
> is not good for our use case, whether is comes from the AST or index. I think 
> it's perfectly normal if we have different requirements but we can see 
> discuss how to design things so there are options to use either more RAM or 
> disk space. It seems the AST would be the most important factor for now so 
> perhaps it's something we should start investigating/discussing.
Agree, this is another thing we can discuss tonight.


================
Comment at: clangd/index/Index.h:52
+private:
+  friend class llvm::DenseMapInfo<clang::clangd::SymbolID>;
+
----------------
ioeric wrote:
> warning: class 'DenseMapInfo' was previously declared as a struct 
> [-Wmismatched-tags]
Fixed in rCTE320554


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897



_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D40897: [clangd] Introduce a "Symbol" class.

Reply via email to