Re: Problem with relating values in two multi value fields

Toke Eskildsen Mon, 08 Oct 2012 01:44:58 -0700

On Mon, 2012-10-08 at 08:42 +0200, Torben Honigbaum wrote:
> sorry, my fault. This was one of my first ideas. My problem is, that
> I've 1.000.000 documents, each with about 20 attributes. Additionally
> each document has between 200 and 500 option-value pairs. So if I
> denormalize the data, it means that I've 1.000.000 x 350 (200 + 500 /
> 2) = 350.000.000 documents, each with 20 attributes.


If you have a few hundred or less distinct primary attributes (the A, B,
C's in your example), you could create a new field for each of them:

</doc>
  <str name="id">3</str>
  <str name="options">A B C D</str>
  <str name="option_A">200</str>
  <str name="option_B">400</str>
  <str name="option_C">240</str>
  <str name="option_D">310</str>
  ...
  ...
</doc>

Query for "options:A" and facet on field "option_A" to get facets for
the specific field.

This normalization does increase the index size due to duplicated
secondary values between the option-fields, but since our assumption is
a relatively small amount of primary values, it should not be too much.


Alternatively, if you have many distinct primary attributes, index the
pairs as Jack suggests:
</doc>
  <str name="id">3</str>
  <str name="options">A B C D</str>
  <str name="option">A=200</str>
  <str name="option">B=400</str>
  <str name="option">C=240</str>
  <str name="option">D=310</str>
  ...
  ...
</doc>

Query for "options:A" and facet on field "option" with
field.prefix="A=". Your result will be A=200 (2), A=450 (1)... so you'll
have to strip "<whatever>=" before display.

This normalization is potentially a lot heavier than the previous one,
as we have distinct_primaries * distinct_secondaries distinct values. 

Worst case, where every document only contains distinct combinations of
primary/secondary, we have 350M distinct option-values, which is quite
heavy for a single box to facet on. Whether that is better or worse that
350M documents, I don't know.

> Is denormalization the only way to handle this problem? I 

What you are trying to do does look quite a lot like hierarchical
faceting, which Solr does not support directly. But even if you apply
one of the experimental patches, it does not mitigate the potential
combinatorial explosion of your primary & secondary values.

So that leaves the question: How many distinct combinations of primary
and secondary values do you have?

Regards,
Toke Eskildsen

Re: Problem with relating values in two multi value fields

Reply via email to