Thanks Toke for the answer, let me comment inline : On 26 November 2015 at 08:32, Toke Eskildsen <t...@statsbiblioteket.dk> wrote:
> On Wed, 2015-11-25 at 15:56 +0000, Alessandro Benedetti wrote: > > I would like to have docValues because facets are going to be heavy on > > those fields. > > > *Faceting approach * > > *1) *Indexing the human readable field value > > Technically this will be a SORTED or SORTED_SET, which again means that > a pool of terms is maintained for each segment. The mapping from > documents to terms are done using ordinals, which are not comparable > across segments. > Thank you very much, I missed that part in my initial analysis. I overlooked the fact that a segment is a fully working Lucene index, and actually they are independent each others ( in term dictionary for example) . So the ordinal resolution is absolutely something to consider. > > > Facets will be returned readable, out of the box. > > I can not see any cons in this approach, I would say it is the standard > one. > > With multiple segments, the terms from each segment must somehow be > aligned, do avoid duplicate entries in the result. This can either be > done be creating a segment_ordinal->global_ordinal map upon first > faceting call (facet.method=fc) or by on-the-fly comparison of top-X > terms from each segment (facet.method=fcs). Either way, there is a > performance penalty. > > > - When calculating faceting, in memory it is used the ordinal for each > > term, which means in memory we don't waste space for the actual term, > or > > waste the time looking up for the value until the very end of the > process, > > after the counts are done . > > The segment_ordinal->global_ordinal requires memory linear to the number > of unique values in the field. If fcs is used, there will be more term > lookups. > > > *2)* Correlate outside the search system each term to a custom ID. Index > > the custom ID. After facets are calculated resolve the ID and show the > > human readable labels. > > Assuming the ID is an integer (about the only thing that makes sense), > this ensures that the IDs are comparable across segments, so no > segment->global mapping is needed. This removes the performance penalty > described above and is (as far as I understand) the principle behind > Lucene faceting. > Ok, so in the case of Integer faceting, we don't do the ordinal resolution and we count directly the integer values , right ? > > On the other hand, this approach is indeed more complicated and it > introduces another hotspot both for indexing (as document construction > requires a lookup in the term provider) and searching (for resolving the > final terms). > > I agree. > > > If we had a hashing method String->long and guaranteed that there would > be no collisions (or we accepted the occasional faulty result), then we > could avoid the segment->global map as well as the centralized term > server. To my knowledge, this has not yet been attempted. > > Thank you very much ! > - Toke Eskildsen > > > -- -------------------------- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England