Re: Interest in Extending SOLR

Yonik Seeley Thu, 13 Apr 2006 11:52:56 -0700

Michael,

I'm not sure that objectType should be tied to which index something
is stored in.
If Solr does evolve multiple index support, one usecase would be
partitioning data based on other factors than objectType
(documentType).

It would seem more flexible for clients (the direct updater or querier
of Solr) to identify which index should be used.  Of course each index
could have it's own schema, but it shouldn't be mandatory... it seems
like a new index should be able to be created on-the-fly somehow,
perhaps using an existing index as a template.

On 4/12/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote:
> We did rough tests and found that creating multiple indexes performed
> better at run time, especially as the logic to determine what results
> should be presented to which customer became more complex.

I would expect searching a small index would be somewhat faster than
searching a large index with the small one embedded in it.  How much
faster though?  Is it really worth the effort to separate things out? 
When you did the benchmarks, did you make sure to discount the first
queries (because of first-use norm and FieldCache loading)?  All that
can be done in the background...

I'm not arguing against extending Solr to support multiple indicies,
but wondering if you could start using it as-is until such support is
well hashed out.  Seems so, since it seems to be an issue of
performance (an optimization) and not functionallity, right?

Another easy optimization you might be able to make external to Solr
is to segment your site data into different Solr collections (on
different boxes).  This assumes that search traffic is naturally
partitioned by siteId (but I may be misunderstanding).

>   a) Minimize the number of instances of SOLR. If I have 3 web
>      applications, each with 12 database tables to index, I don't want
>      to run 36 JVMs. I think introducing an objectType would address
>      this.

Another possible option is to run multiple Solr instances (webapps)
per appserver... I recall someone else going after this solution.

>   b) Optimize retrieval when I have some knowledge that I can use to
>      define partitions of data. This may actually be more appropriate
>      for Lucene itself, but I see SOLR pretty well positioned to
>      address. One approach is to introduce a "partitionField" that
>      SOLR would use to figure out if a new index is required. For each
>      unique value of the partitionField, we create a separate physical
>      index. If the query does NOT contain a term for the
>      partitionField, we use a multi reader to search across all
>      indexes. If the query DOES contain the term, we only search
>      across those partitions.

While that approach might be better w/o caching, it might be worse
with caching... it really depends on the nature of the index and the
queries.
It would really complicate Solr's caching though since a cache item
would only be valid for certain combinations of sub-indicies.

>      We have tried using cached bitsets to implement this sort of
>      approach, but have found that when we have one large document set
>      partitioned into much smaller sets (e.g. 1-10% of the total
>      document space), creating separate indexes gives us a much higher
>      boost in performance.

I assume this was with Lucene and not Solr?
Solr has better/faster filter representations... (and if I ever get
around to finishing it, a faster BitSet implementation too).

-Yonik

Re: Interest in Extending SOLR

Reply via email to