Toke, You are absolutely right, concatenating term is a possible solution. I found faceting is quite complicated in this case, but it was a hot fix which we delivered to production.
Torben, This problem arise quite often, beside of these two approaches discussed there, also possible to approach SpanQueries and TermPositions - you can check our experience here: http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html http://vimeo.com/album/2012142/video/33817062 Our current way is BlockJoin which is really performant in case of batched updates: http://blog.griddynamics.com/2012/08/block-join-query-performs.html. Bad thing that there is no open facet component for block join. We have a code, but are not ready to share it, yet. On Mon, Oct 8, 2012 at 12:44 PM, Toke Eskildsen <t...@statsbiblioteket.dk>wrote: > On Mon, 2012-10-08 at 08:42 +0200, Torben Honigbaum wrote: > > sorry, my fault. This was one of my first ideas. My problem is, that > > I've 1.000.000 documents, each with about 20 attributes. Additionally > > each document has between 200 and 500 option-value pairs. So if I > > denormalize the data, it means that I've 1.000.000 x 350 (200 + 500 / > > 2) = 350.000.000 documents, each with 20 attributes. > > If you have a few hundred or less distinct primary attributes (the A, B, > C's in your example), you could create a new field for each of them: > > </doc> > <str name="id">3</str> > <str name="options">A B C D</str> > <str name="option_A">200</str> > <str name="option_B">400</str> > <str name="option_C">240</str> > <str name="option_D">310</str> > ... > ... > </doc> > > Query for "options:A" and facet on field "option_A" to get facets for > the specific field. > > This normalization does increase the index size due to duplicated > secondary values between the option-fields, but since our assumption is > a relatively small amount of primary values, it should not be too much. > > > Alternatively, if you have many distinct primary attributes, index the > pairs as Jack suggests: > </doc> > <str name="id">3</str> > <str name="options">A B C D</str> > <str name="option">A=200</str> > <str name="option">B=400</str> > <str name="option">C=240</str> > <str name="option">D=310</str> > ... > ... > </doc> > > Query for "options:A" and facet on field "option" with > field.prefix="A=". Your result will be A=200 (2), A=450 (1)... so you'll > have to strip "<whatever>=" before display. > > This normalization is potentially a lot heavier than the previous one, > as we have distinct_primaries * distinct_secondaries distinct values. > > Worst case, where every document only contains distinct combinations of > primary/secondary, we have 350M distinct option-values, which is quite > heavy for a single box to facet on. Whether that is better or worse that > 350M documents, I don't know. > > > Is denormalization the only way to handle this problem? I > > What you are trying to do does look quite a lot like hierarchical > faceting, which Solr does not support directly. But even if you apply > one of the experimental patches, it does not mitigate the potential > combinatorial explosion of your primary & secondary values. > > So that leaves the question: How many distinct combinations of primary > and secondary values do you have? > > Regards, > Toke Eskildsen > > -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>