Many thanks to Hoss, Yonik, et al. for their excellent efforts to
bring faceted browsing to the masses! In most cases it works great!
But for some fields I have a need which is unfortunately not filled
by the current faceting code. These fields (Author Name, for
example) have too many discrete Term values to be handled by the
cached-filter mechanism for facet counting, but require counting
multiple Terms per document and so are not handled by the alternate
facet mechanism based on FieldCache. So I think I need to dive into
the SOLR code, but I first wanted to check to make sure nobody else
is working on something like this, and secondly to get feedback on
the best implementation approach.
(As background to those not familiar with the internals, for a
single-valued non-tokenized non-Boolean field the current
SimpleFacets implementation uses a FieldCache borrowed from the
Lucene sorting mechanism to quickly retrieve the indexed Term value
for each Document; otherwise a series of queries cached as bitmaps
are done for each possible Term value, ANDed with a hitlist bitmap,
and the resulting bits counted)
My thought was that the simplest approach would be to subclass
FieldCacheImpl to introduce a getMultiStringIndex method derived from
getStringIndex, defining and returning a MultiStringIndex class
which stores order as int[][] rather than int[]; a variant of
SimpleFacets.getFieldCacheCounts would simply need an inner loop to
tally each of the Document's Term indexes for that field.
With multi-valuedness no longer being a useful criterion for
automatically choosing between the filter-based and modified
FieldCache-based mechanisms, there then would need to be an alternate
criterion, either implicit or explicit. Does anyone have any ideas
how best to do that? For example, is there a way to quickly
determine the number of distinct Term values for a field without
enumerating to the end, so the ratio of Terms to Documents can be
used?
An entirely alternate approach (briefly suggested in a comment in
SimpleFacets) for fields indexed with term vectors would be to simply
call getTermFreqVector, for each hit and store (term text, tally) in
a HashTable, or (term text, index) in a HT which could be cached with
tallies generated per-query in an array as they are now, in the
latter case building a field-cache dynamically based on actual query
results. Does anyone have any insight on how efficient that may or
may not be?
And if I have gotten something dreadfully wrong in my understanding
of current implementation or proposed enhancement, I would appreciate
getting straightened out.
Thanks,
J.J. Larrea
- Multi-Valued Faceting J.J. Larrea
-