Hi --

I'm building a Solr index to replace an existing RDBMS-based system,
and I have one requirement that I'm not sure how to best satisfy.
Documents in our collection can have user-generated ratings associated
with them; these user-generated ratings are aggregated by source
(sources are basically business partners who use our public API to a)
publish content on our system, and to b) allow their users to interact
with -- i.e., rate, comment on, etc. -- content in our system). When
we query the index, it's important to be able to return documents
sorted by the aggregated ratings data for any source.

The simplest solution I could think of was to add some dynamic fields
to the schema:

  <dynamicField name="userRatingAverage_*" type="sfloat"
indexed="true" stored="true" />
  <dynamicField name="userRatingCount_*" type="sint" indexed="true"
stored="true" />
  <dynamicField name="userRatingSum_*" type="sfloat" indexed="true"
stored="true" />

And when I'm indexing documents, I add one field for each source from
which users have contributed ratings, e.g.:

   <field name="userRatingAverage_sourceId1">3.3</field>
   <field name="userRatingCount_sourceId1">10</field>
   <field name="userRatingSum_sourceId1">33</field>
   <field name="userRatingAverage_sourceId2">2.8</field>
   <field name="userRatingCount_sourceId2">20</field>
   <field name="userRatingSum_sourceId2">56</field>
   etc...

So far this seems acceptable. Query performance seems fine when using
the dynamic fields to sort result sets; indexing performance also
seems fine*. That said, there are only 400K documents in the
collection I'm working with, and few external rating sources at the
moment (there are about a dozen, and most documents have no external
ratings data associated with them). But as these fields will be
created from user-generated data, there's nothing to stop those
numbers from ballooning.

What I'm wondering is whether any of the Solr experts on this list
would endorse this solution, or caution against it? Are there any
things I need to know before I proceed with it?

Before this obvious solution occurred to me, I was thinking I would
need to create a custom FieldType of my own, and perhaps my own
SortComparatorSource, so that I could sort records based in query-time
parameters (i.e., the ID of the source whose ratings are to be used as
the sort key). I've got a copy of LIA, and the
DistanceComparatorSource example from the start of chapter 6 seemed a
bit out of date, but like it ought to serve me plenty well. But then
this message made me think that maybe that wasn't going to be quite as
easy as I'd hoped:

  http://www.nabble.com/custom-sorting-tf4521989.html#a12951515

(It also made me think that I ought to take on the project proposed
there -- i.e., "the idea of being able to specify a raw function as a
sort" -- once I've got a better handle on Solr's internals.)

Thanks in advance for any advice you can give.

-Charlie

* I'm adding about 250 docs/sec, though because of how I'm feeding
documents, it's hard to say how much of that time is spent in Solr,
and how much is spent in the Python feeding script I'm using; in any
case, 250 docs/sec is perfectly adequate for now.

Reply via email to