Hello,

I'm experimenting with ways to add some degree of diversity to search results 
by re-ranking them. For example, I might want to take the top 100 docs (sorted 
by score), and rearrange them so that no more than 2 results share a particular 
attribute x within any 20-result block. It's a best effort algorithm since 
there may be more than 10 results that have x. And if the original list already 
satisfies the diversity goal, then the ordering is unchanged. So 2 questions:

1. What's a good way to implement this?

The most obvious solution (at least for this particular example) might be field 
collapsing. But we do need faceting as well. And the two don't yet work 
together according to http://wiki.apache.org/solr/FieldCollapsing . It also 
wouldn't be applicable if the re-ranking function depended on things other than 
field values (like the score).

Custom sorting (FieldComparatorSource) doesn't seem to work either because the 
relative ordering of 2 docs depends not only on their field values but on what 
other docs match the query as well.

So right now I'm doing post-processing: sort by score, look up x for each top 
doc, then re-arrange if necessary.

Is there a better way?

2. We need a fast way to fetch x for a large (100s) number of docs.

It'd be great if the doc()/document() methods could automatically use the field 
cache - perhaps with something like 
https://issues.apache.org/jira/browse/SOLR-1961 . That hasn't been accepted, 
though. So I wrote this on top of the Solr API:

  private static void loadCachedFields(SolrDocument doc, SolrIndexSearcher 
searcher, int docId, final Set<String> cachedFields) throws IOException {
    // find leaf reader and doc id offset for this doc
    SolrIndexReader reader = searcher.getReader();
    int[] offsets = reader.getLeafOffsets();
    int idx = SolrIndexReader.readerIndex(docId, offsets);
    SolrIndexReader leafReader = reader.getLeafReaders()[idx];
    int offset = offsets[idx];

    IndexSchema schema = searcher.getSchema();
    
    for (String f : cachedFields) {
      Object val;
      if (schema.getField(f).getType() instanceof IntField) {
        val = FieldCache.DEFAULT.getInts(leafReader, f)[docId - offset];
      } else ...

      doc.addField(f, val);
    }
  }

(I borrowed the doc id offset code from QueryComponent.)

Does this look like a reasonable solution?

Thanks!

- zhi-da

Reply via email to