RE: Interest in Extending SOLR

Bryzek.Michael Thu, 13 Apr 2006 21:09:13 -0700

I definitely like the idea of support for multiple indexes based on
partitioning data that is NOT tied to a predefined element named
objectType. If we combine this with Chris' mention of completing the
work to support multiple schemas via multiple webapps in the same
servlet container, then I no longer see an immediate need to have more
than one schema per webapp. The concept would be:

*         One schema per webapp, Multiple webapps per JVM

*         Partitioning of data into multiple indexes in each webapp
based on logic that you provide

For our own applications, my preference is to migrate away from our
homegrown solution to SOLR, prior to investing further in what we
currently have built. I will plan on testing performance a bit more
formally to see if SOLR out of the box would work for us. Note that in
our present environment, our performance changed significantly (factor
of ~10) when we partitioned data into multiple indexes, though our tests
were very rough.

I would be very happy to contribute time to expand SOLR to provide
initial support for the partitioning concept as I believe this will
prove critical when we evaluate how our database structure maps to a
query index.

One last note: last night, I did spend a bit of time looking into what
exactly it would mean to add support for object types in SOLR. I
modified the code base to support the object type tag in the schema,
providing a working proof of concept (I'm happy to send a sample schema
if anybody is interested). The main changes:

*         Modify IndexSchema to keep an object type

*         Provide a factory in SolrCore that returns the correct
instance of SolrCore based on object type

*         Modify loading of schema to load one copy per object type

I really do like where this conversation has gone, but if the community
does chose to support multiple object types, on the surface (to a
newcomer) it appears highly doable.

-Mike

-----Original Message-----
From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 13, 2006 2:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Interest in Extending SOLR

Michael,

I'm not sure that objectType should be tied to which index something

is stored in.

If Solr does evolve multiple index support, one usecase would be

partitioning data based on other factors than objectType

(documentType).

It would seem more flexible for clients (the direct updater or querier

of Solr) to identify which index should be used.  Of course each index

could have it's own schema, but it shouldn't be mandatory... it seems

like a new index should be able to be created on-the-fly somehow,

perhaps using an existing index as a template.

On 4/12/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote:

> We did rough tests and found that creating multiple indexes performed

> better at run time, especially as the logic to determine what results

> should be presented to which customer became more complex.

I would expect searching a small index would be somewhat faster than

searching a large index with the small one embedded in it.  How much

faster though?  Is it really worth the effort to separate things out? 

When you did the benchmarks, did you make sure to discount the first

queries (because of first-use norm and FieldCache loading)?  All that

can be done in the background...

I'm not arguing against extending Solr to support multiple indicies,

but wondering if you could start using it as-is until such support is

well hashed out.  Seems so, since it seems to be an issue of

performance (an optimization) and not functionallity, right?

Another easy optimization you might be able to make external to Solr

is to segment your site data into different Solr collections (on

different boxes).  This assumes that search traffic is naturally

partitioned by siteId (but I may be misunderstanding).

>   a) Minimize the number of instances of SOLR. If I have 3 web

>      applications, each with 12 database tables to index, I don't want

>      to run 36 JVMs. I think introducing an objectType would address

>      this.

Another possible option is to run multiple Solr instances (webapps)

per appserver... I recall someone else going after this solution.

>   b) Optimize retrieval when I have some knowledge that I can use to

>      define partitions of data. This may actually be more appropriate

>      for Lucene itself, but I see SOLR pretty well positioned to

>      address. One approach is to introduce a "partitionField" that

>      SOLR would use to figure out if a new index is required. For each

>      unique value of the partitionField, we create a separate physical

>      index. If the query does NOT contain a term for the

>      partitionField, we use a multi reader to search across all

>      indexes. If the query DOES contain the term, we only search

>      across those partitions.

While that approach might be better w/o caching, it might be worse

with caching... it really depends on the nature of the index and the

queries.

It would really complicate Solr's caching though since a cache item

would only be valid for certain combinations of sub-indicies.

>      We have tried using cached bitsets to implement this sort of

>      approach, but have found that when we have one large document set

>      partitioned into much smaller sets (e.g. 1-10% of the total

>      document space), creating separate indexes gives us a much higher

>      boost in performance.

I assume this was with Lucene and not Solr?

Solr has better/faster filter representations... (and if I ever get

around to finishing it, a faster BitSet implementation too).

-Yonik

RE: Interest in Extending SOLR

Reply via email to