Many thanks to Hoss, Yonik, et al. for their excellent efforts to bring faceted browsing to the masses! In most cases it works great!

But for some fields I have a need which is unfortunately not filled by the current faceting code. These fields (Author Name, for example) have too many discrete Term values to be handled by the cached-filter mechanism for facet counting, but require counting multiple Terms per document and so are not handled by the alternate facet mechanism based on FieldCache. So I think I need to dive into the SOLR code, but I first wanted to check to make sure nobody else is working on something like this, and secondly to get feedback on the best implementation approach.

(As background to those not familiar with the internals, for a single-valued non-tokenized non-Boolean field the current SimpleFacets implementation uses a FieldCache borrowed from the Lucene sorting mechanism to quickly retrieve the indexed Term value for each Document; otherwise a series of queries cached as bitmaps are done for each possible Term value, ANDed with a hitlist bitmap, and the resulting bits counted)

My thought was that the simplest approach would be to subclass FieldCacheImpl to introduce a getMultiStringIndex method derived from getStringIndex, defining and returning a MultiStringIndex class which stores order as int[][] rather than int[]; a variant of SimpleFacets.getFieldCacheCounts would simply need an inner loop to tally each of the Document's Term indexes for that field.

With multi-valuedness no longer being a useful criterion for automatically choosing between the filter-based and modified FieldCache-based mechanisms, there then would need to be an alternate criterion, either implicit or explicit. Does anyone have any ideas how best to do that? For example, is there a way to quickly determine the number of distinct Term values for a field without enumerating to the end, so the ratio of Terms to Documents can be used?

An entirely alternate approach (briefly suggested in a comment in SimpleFacets) for fields indexed with term vectors would be to simply call getTermFreqVector, for each hit and store (term text, tally) in a HashTable, or (term text, index) in a HT which could be cached with tallies generated per-query in an array as they are now, in the latter case building a field-cache dynamically based on actual query results. Does anyone have any insight on how efficient that may or may not be?

And if I have gotten something dreadfully wrong in my understanding of current implementation or proposed enhancement, I would appreciate getting straightened out.

Thanks,
J.J. Larrea

Reply via email to