I definitely like the idea of support for multiple indexes based on partitioning data that is NOT tied to a predefined element named objectType. If we combine this with Chris' mention of completing the work to support multiple schemas via multiple webapps in the same servlet container, then I no longer see an immediate need to have more than one schema per webapp. The concept would be:
* One schema per webapp, Multiple webapps per JVM * Partitioning of data into multiple indexes in each webapp based on logic that you provide For our own applications, my preference is to migrate away from our homegrown solution to SOLR, prior to investing further in what we currently have built. I will plan on testing performance a bit more formally to see if SOLR out of the box would work for us. Note that in our present environment, our performance changed significantly (factor of ~10) when we partitioned data into multiple indexes, though our tests were very rough. I would be very happy to contribute time to expand SOLR to provide initial support for the partitioning concept as I believe this will prove critical when we evaluate how our database structure maps to a query index. One last note: last night, I did spend a bit of time looking into what exactly it would mean to add support for object types in SOLR. I modified the code base to support the object type tag in the schema, providing a working proof of concept (I'm happy to send a sample schema if anybody is interested). The main changes: * Modify IndexSchema to keep an object type * Provide a factory in SolrCore that returns the correct instance of SolrCore based on object type * Modify loading of schema to load one copy per object type I really do like where this conversation has gone, but if the community does chose to support multiple object types, on the surface (to a newcomer) it appears highly doable. -Mike -----Original Message----- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Thursday, April 13, 2006 2:53 PM To: solr-user@lucene.apache.org Subject: Re: Interest in Extending SOLR Michael, I'm not sure that objectType should be tied to which index something is stored in. If Solr does evolve multiple index support, one usecase would be partitioning data based on other factors than objectType (documentType). It would seem more flexible for clients (the direct updater or querier of Solr) to identify which index should be used. Of course each index could have it's own schema, but it shouldn't be mandatory... it seems like a new index should be able to be created on-the-fly somehow, perhaps using an existing index as a template. On 4/12/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote: > We did rough tests and found that creating multiple indexes performed > better at run time, especially as the logic to determine what results > should be presented to which customer became more complex. I would expect searching a small index would be somewhat faster than searching a large index with the small one embedded in it. How much faster though? Is it really worth the effort to separate things out? When you did the benchmarks, did you make sure to discount the first queries (because of first-use norm and FieldCache loading)? All that can be done in the background... I'm not arguing against extending Solr to support multiple indicies, but wondering if you could start using it as-is until such support is well hashed out. Seems so, since it seems to be an issue of performance (an optimization) and not functionallity, right? Another easy optimization you might be able to make external to Solr is to segment your site data into different Solr collections (on different boxes). This assumes that search traffic is naturally partitioned by siteId (but I may be misunderstanding). > a) Minimize the number of instances of SOLR. If I have 3 web > applications, each with 12 database tables to index, I don't want > to run 36 JVMs. I think introducing an objectType would address > this. Another possible option is to run multiple Solr instances (webapps) per appserver... I recall someone else going after this solution. > b) Optimize retrieval when I have some knowledge that I can use to > define partitions of data. This may actually be more appropriate > for Lucene itself, but I see SOLR pretty well positioned to > address. One approach is to introduce a "partitionField" that > SOLR would use to figure out if a new index is required. For each > unique value of the partitionField, we create a separate physical > index. If the query does NOT contain a term for the > partitionField, we use a multi reader to search across all > indexes. If the query DOES contain the term, we only search > across those partitions. While that approach might be better w/o caching, it might be worse with caching... it really depends on the nature of the index and the queries. It would really complicate Solr's caching though since a cache item would only be valid for certain combinations of sub-indicies. > We have tried using cached bitsets to implement this sort of > approach, but have found that when we have one large document set > partitioned into much smaller sets (e.g. 1-10% of the total > document space), creating separate indexes gives us a much higher > boost in performance. I assume this was with Lucene and not Solr? Solr has better/faster filter representations... (and if I ever get around to finishing it, a faster BitSet implementation too). -Yonik