Toke,
You are absolutely right, concatenating term is a possible solution. I
found faceting is quite complicated in this case, but it was a hot fix
which we delivered to production.

Torben,
This problem arise quite often, beside of these two approaches discussed
there, also possible to approach SpanQueries and TermPositions - you can
check our experience here:
http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html
http://vimeo.com/album/2012142/video/33817062
Our current way is BlockJoin which is really performant in case of batched
updates: http://blog.griddynamics.com/2012/08/block-join-query-performs.html.
Bad thing that there is no open facet component for block join. We
have a
code, but are not ready to share it, yet.

On Mon, Oct 8, 2012 at 12:44 PM, Toke Eskildsen <t...@statsbiblioteket.dk>wrote:

> On Mon, 2012-10-08 at 08:42 +0200, Torben Honigbaum wrote:
> > sorry, my fault. This was one of my first ideas. My problem is, that
> > I've 1.000.000 documents, each with about 20 attributes. Additionally
> > each document has between 200 and 500 option-value pairs. So if I
> > denormalize the data, it means that I've 1.000.000 x 350 (200 + 500 /
> > 2) = 350.000.000 documents, each with 20 attributes.
>
> If you have a few hundred or less distinct primary attributes (the A, B,
> C's in your example), you could create a new field for each of them:
>
> </doc>
>   <str name="id">3</str>
>   <str name="options">A B C D</str>
>   <str name="option_A">200</str>
>   <str name="option_B">400</str>
>   <str name="option_C">240</str>
>   <str name="option_D">310</str>
>   ...
>   ...
> </doc>
>
> Query for "options:A" and facet on field "option_A" to get facets for
> the specific field.
>
> This normalization does increase the index size due to duplicated
> secondary values between the option-fields, but since our assumption is
> a relatively small amount of primary values, it should not be too much.
>
>
> Alternatively, if you have many distinct primary attributes, index the
> pairs as Jack suggests:
> </doc>
>   <str name="id">3</str>
>   <str name="options">A B C D</str>
>   <str name="option">A=200</str>
>   <str name="option">B=400</str>
>   <str name="option">C=240</str>
>   <str name="option">D=310</str>
>   ...
>   ...
> </doc>
>
> Query for "options:A" and facet on field "option" with
> field.prefix="A=". Your result will be A=200 (2), A=450 (1)... so you'll
> have to strip "<whatever>=" before display.
>
> This normalization is potentially a lot heavier than the previous one,
> as we have distinct_primaries * distinct_secondaries distinct values.
>
> Worst case, where every document only contains distinct combinations of
> primary/secondary, we have 350M distinct option-values, which is quite
> heavy for a single box to facet on. Whether that is better or worse that
> 350M documents, I don't know.
>
> > Is denormalization the only way to handle this problem? I
>
> What you are trying to do does look quite a lot like hierarchical
> faceting, which Solr does not support directly. But even if you apply
> one of the experimental patches, it does not mitigate the potential
> combinatorial explosion of your primary & secondary values.
>
> So that leaves the question: How many distinct combinations of primary
> and secondary values do you have?
>
> Regards,
> Toke Eskildsen
>
>


-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

Reply via email to