Multi-Valued Faceting

J.J. Larrea Wed, 06 Dec 2006 11:54:28 -0800

Many thanks to Hoss, Yonik, et al. for their excellent efforts tobring faceted browsing to the masses! In most cases it works great!

But for some fields I have a need which is unfortunately not filledby the current faceting code. These fields (Author Name, forexample) have too many discrete Term values to be handled by thecached-filter mechanism for facet counting, but require countingmultiple Terms per document and so are not handled by the alternatefacet mechanism based on FieldCache. So I think I need to dive intothe SOLR code, but I first wanted to check to make sure nobody elseis working on something like this, and secondly to get feedback onthe best implementation approach.

(As background to those not familiar with the internals, for asingle-valued non-tokenized non-Boolean field the currentSimpleFacets implementation uses a FieldCache borrowed from theLucene sorting mechanism to quickly retrieve the indexed Term valuefor each Document; otherwise a series of queries cached as bitmapsare done for each possible Term value, ANDed with a hitlist bitmap,and the resulting bits counted)

My thought was that the simplest approach would be to subclassFieldCacheImpl to introduce a getMultiStringIndex method derived fromgetStringIndex, defining and returning a MultiStringIndex classwhich stores order as int[][] rather than int[]; a variant ofSimpleFacets.getFieldCacheCounts would simply need an inner loop totally each of the Document's Term indexes for that field.

With multi-valuedness no longer being a useful criterion forautomatically choosing between the filter-based and modifiedFieldCache-based mechanisms, there then would need to be an alternatecriterion, either implicit or explicit. Does anyone have any ideashow best to do that? For example, is there a way to quicklydetermine the number of distinct Term values for a field withoutenumerating to the end, so the ratio of Terms to Documents can beused?

An entirely alternate approach (briefly suggested in a comment inSimpleFacets) for fields indexed with term vectors would be to simplycall getTermFreqVector, for each hit and store (term text, tally) ina HashTable, or (term text, index) in a HT which could be cached withtallies generated per-query in an array as they are now, in thelatter case building a field-cache dynamically based on actual queryresults. Does anyone have any insight on how efficient that may ormay not be?

And if I have gotten something dreadfully wrong in my understandingof current implementation or proposed enhancement, I would appreciategetting straightened out.


Thanks,
J.J. Larrea

Multi-Valued Faceting

Reply via email to