Help with denormalizing issues

Eric Reeves Mon, 05 Oct 2009 18:20:12 -0700

Hi there,

I'm evaluating Solr as a replacement for our current search server, and am 
trying to determine what the best strategy would be to implement our business 
needs.  Our problem is that we have a catalog schema with products and skus, 
one to many.  The most relevant content being indexed is at the product level, 
in the name and description fields.  However we are interested in filtering by 
sku attributes, and in particular making multiple filters apply to a single 
sku.  For example, find a product that contains a sku that is both blue and on 
sale.  No approach I've tried at collapsing the sku data into the product 
document works for this.  If we put the data in separate fields, there's no way 
to apply multiple filters to the same sku. and if we concatenate all of the 
relevant sku data into a single multivalued field then as I understand it, this 
is just indexed as one large field with extra whitespace between the individual 
entries, so there's still no way to enforce that an AND filter query applies to 
the same sku.


One approach I was considering was to create separate indexes for products and 
skus, and store the product IDs in the sku documents.  Then we could apply our 
own filters to the initially generated list, based on unique query parameters.  
I thought creating a component between query and facet would be a good place to 
add such a filter, but further research seems to indicate that this would break 
paging and sorting.  The only other thing I can think of would be to subclass 
QueryComponent itself, which looks rather daunting-the process() method has no 
hooks for this sort of thing, it seems I would have to copy the entire existing 
implementation and add them myself, which looks to be a fair chunk of work and 
brittle to changes in the trunk code.  Ideally it would be nice to be able to 
handle certain fq parameters in a completely different way, perhaps using a 
custom query parser, but I haven't wrapped my head around how those work.  Does 
any of this sound remotely doable?  Any advice?

The other suggestion we are looking at was given to us by our current search 
provider, which is to index the skus themselves.  It looks as if we may be able 
to make this work using the field collapsing patch from SOLR-236.  I have some 
concerns about this approach though: 1) It will make for a much larger index 
and longer indexing times (products can have 10 or more skus in our catalog).  
2) Because the indexing will be copying the description and name from the 
product it will be indexing the same content more than once, and the number of 
times per product will vary based on the number of skus.  I'm concerned that 
this may skew the scoring algorithm, in particular the inverse frequency part.  
3) I'm not sure about the performance of the field collapsing patch, I've read 
contradictory reports on the web.

I apologize if this is a bit rambling.  If anyone has any advice for our 
situation it would be very helpful.

Thanks,
Eric

Help with denormalizing issues

Reply via email to