Thanks for the responses.

Right now each document has a fairly small amount of indexed data such
as title, author, language, subjects, and various media
characteristics. Indexing or reindexing a document is very fast,
updating a batch of 100 documents takes less than 1/10th of a second.
What impact would a field with hundreds or thousands of unique
keywords have on indexing time?

One issue with storing all of this information in solr is the
complexity of adding detail or refining the data. For instance some
users have read every title in a genre that they still enjoy so what
they'll say is "find me something in that genre that I haven't read in
the last 5 years". The basic list of document IDs (ours, not Solr's)
is easy to pull from the database, and then layering all of Solr's
excellent faceting, filtering, and text searching features on top of
that makes a nice user experience.

Aside from user history there are going to be other dynamic lists that
will be useful. Something like a "what's hot" filter that has the top
X% of titles by popularity today, this week, and this month. Or a
historical snapshot like what was popular this month 10 years ago. A
nice general solution would let us pull whatever we can think of from
the database without having to do schema changes in solr or reindex
the entire collection.

On Wed, Jan 19, 2011 at 2:17 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:
> The problem is going to be 'near real time' indexing issues.  Solr 1.4 at
> least does not do a very good job of handling very frequent commits. If you
> want to add to the user's history in the Solr index ever time they click the
> button, and they click the button a lot, and this naturally leads to commits
> very frequent commits to Solr (every minute, every second, multiple times a
> second), you're going to have RAM and performance problems.
>
> I believe there are some things in trunk that make handling this better,
> don't know the details but "near real time search" is what people talk
> about, to google or ask on this list.
>
> Or, if it's acceptable for your requirements, you could record all the "I've
> read this" clicks in an external store, and only add them to the Solr index
> nightly, or even hourly.  If you batch em and add em as frequently as you
> can get away with (every hour sure, every 10 minutes pushing it, every
> minute, no), you can get around that issue. Or for that matter you could ADD
> em to Solr but only 'commit' every hour or whatever, but I don't like that
> strategy since if Solr crashes or otherwise restarts you pretty much lose
> those pending commits, better to queue em up in an external store.
>
> On 1/19/2011 1:52 PM, Markus Jelsma wrote:
>>
>> Hi,
>>
>> I've never seen Solr's behaviour with a huge amount of values in a multi
>> valued but i think it should work alright. Then you can stored a list of
>> user
>> ID's along with each book document and user filter queries to include or
>> exclude the book from the result set.
>>
>> Cheers,
>>
>>> Hi,
>>>
>>> I'm looking for ideas on how to make an efficient facet query on a
>>> user's history with respect to the catalog of documents (something
>>> like "Read document already: yes / no"). The catalog is around 100k
>>> titles and there are several thousand users. Of course, each user has
>>> a different history, many having read fewer than 500 titles, but some
>>> heavy users having read perhaps 50k titles.
>>>
>>> Performance is not terribly important right now so all I did was bump
>>> up the boolean query limit and put together a big string of document
>>> id's that the user has read. The first query is slow but once it's in
>>> the query cache it's fine. I would like to find a better way of doing
>>> it though.
>>>
>>> What type of solr plugin would be best suited to helping in this
>>> situation? I could make a function plugin that provides something like
>>> hasHadBefore() - true/false, but would that be efficient for faceting
>>> and filtering? Another idea is a QParserPlugin that looks for a field
>>> like hasHadBefore:userid and somehow substitutes in the list of docs.
>>> But I'm not sure how a new parser plugin would interact with the
>>> existing parser. Can solr use a parser plugin to only handle one
>>> field, and leave all the other fields to the default parser?
>>>
>>> Thanks,
>>> Jon
>

Reply via email to