[PATCH] D124422: [Serialization] Improve encoding of small SourceRanges.

Sam McCall via Phabricator via cfe-commits Tue, 26 Apr 2022 03:28:13 -0700

sammccall added a comment.

In D124422#3474110 <https://reviews.llvm.org/D124422#3474110>, @ilya-biryukov 
wrote:


> Yeah, that sounds promising. Various source locations inside the same nodes 
> are probably very close. Probably even across nodes.

Yeah. The reason to avoid compressing across nodes is it breaks lazy-loading. 
(Also I'm not sure whether related decls end up proximate in the output). Maybe 
there's a way around it, but this is much lower-hanging I think.

> If we are willing to take on a bit more complexity, we could probably get 
> even bigger wins by storing all source locations in a single array  having 
> indices into that array in the actual positions where we need the source 
> locations.

This is an interesting idea, I wonder whether it will be a win.

The delta-encodings are efficient as you note, but we have to represent 
SourceLocations twice. (value + index).
I think the #bits we save storing indices vs SourceLocations is similar to the 
#bits we need to encode the deltas.
If these cancel out then the savings we get are from deduplicating 
SourceLocations in the array - I suspect there aren't actually that many 
duplicates, we'd be lucky to halve the array size.

Rough example with made-up numbers:

- today we store raw SourceLocations, N * ~28 bits
- maybe simple delta-encoding maybe reduces this by 8 to N * 20 bits average
- sorted SourceLocation array (no delta-encoding, no deduplication) is N * 28, 
indices maybe N * 23 (assuming distance of 2^5 between sourcelocations)
- delta-encoding the array means it takes conservatively N * 7 bits (5 bits 
average delta + VBR overhead + variance)
- delta-encoding the indices yields the same 8 bits as before => N * 15
- deduplcating the halves the entries and reduces the size of indices by 1 bit
- this yields N/2*7 for the array + N * 14 for indices = N * 17.5

So we come out ahead (20->17.5) if deduplication halves #locations. If it 
reduces by a third, it's 20->19 instead.
This seems likely to be an improvement, over simple delta, but much smaller 
than delta vs status quo.

> I suspect there are requirements that will force us to store the array in 
> chunks and allow searching for a particular chunk.
>
> However, that's really adding quite a bit of complexity and I'm not sure 
> preamble sizes are such a big deal outside our environment. E.g. even 10% 
> savings are borderline for that kind of complexity.

Yeah, I don't plan to pursue this for now because the complexity/reward scares 
me, but if we get really squeezed for size we could revisit.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124422/new/

https://reviews.llvm.org/D124422

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D124422: [Serialization] Improve encoding of small SourceRanges.

Reply via email to