Thanks Toke for the answer, let me comment inline :

On 26 November 2015 at 08:32, Toke Eskildsen <t...@statsbiblioteket.dk> wrote:

> On Wed, 2015-11-25 at 15:56 +0000, Alessandro Benedetti wrote:
> > I would like to have docValues because facets are going to be heavy on
> > those fields.
>
> > *Faceting approach *
> > *1) *Indexing the human readable field value
>
> Technically this will be a SORTED or SORTED_SET, which again means that
> a pool of terms is maintained for each segment. The mapping from
> documents to terms are done using ordinals, which are not comparable
> across segments.
>

Thank you very much, I missed that part in my initial analysis.
I overlooked the fact that a segment is a fully working Lucene index, and
actually they are independent each others ( in term dictionary for example)
.
So the ordinal resolution is absolutely something to consider.


>
> > Facets will be returned readable, out of the box.
> > I can not see any cons in this approach, I would say it is the standard
> one.
>
> With multiple segments, the terms from each segment must somehow be
> aligned, do avoid duplicate entries in the result. This can either be
> done be creating a segment_ordinal->global_ordinal map upon first
> faceting call (facet.method=fc) or by on-the-fly comparison of top-X
> terms from each segment (facet.method=fcs). Either way, there is a
> performance penalty.
>
> >    - When calculating faceting, in memory it is used the ordinal for each
> >    term, which means in memory we don't waste space for the actual term,
> or
> >    waste the time looking up for the value until the very end of the
> process,
> >    after the counts are done .
>
> The segment_ordinal->global_ordinal requires memory linear to the number
> of unique values in the field. If fcs is used, there will be more term
> lookups.
>
> > *2)* Correlate outside the search system each term to a custom ID. Index
> > the custom ID. After facets are calculated resolve the ID and show the
> > human readable labels.
>
> Assuming the ID is an integer (about the only thing that makes sense),
> this ensures that the IDs are comparable across segments, so no
> segment->global mapping is needed. This removes the performance penalty
> described above and is (as far as I understand) the principle behind
> Lucene faceting.
>

Ok, so in the case of Integer faceting, we don't do the ordinal resolution
and we count directly the integer values , right ?

>
> On the other hand, this approach is indeed more complicated and it
> introduces another hotspot both for indexing (as document construction
> requires a lookup in the term provider) and searching (for resolving the
> final terms).
>
> I agree.


>
>
> If we had a hashing method String->long and guaranteed that there would
> be no collisions (or we accepted the occasional faulty result), then we
> could avoid the segment->global map as well as the centralized term
> server. To my knowledge, this has not yet been attempted.
>
>
Thank you very much !


> - Toke Eskildsen
>
>
>


-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Reply via email to