Yonik - > So the number of filters is equal to the number of sites? > How many sites are there?
Today: When new customers join, we generally don't do anything special. Currently we have roughly 400 customers, most of which have one site each. Note that a few customers have as many as 50 sites. In total, we probably filter data in 500 unique ways, before we actually search on the query string entered by the user. Of the 500 unique ways in which we filter data, there are approximately 50 for which we would prefer to use a unique index. I don't have 100% accurate numbers, but these should be in the ballpark. Future: We are planning to expand on the concepts we've developed to integrate Lucene and hopefully SOLR into other applications. One in particular: * Provides a core data set of 100K records * Allows each of 1,000 customers to create their own view of that data * In theory, our overall dataset may contain up to 100K * 1,000 records (100M), but we know that at any given time, only 100K records should be made available. We did rough tests and found that creating multiple indexes performed better at run time, especially as the logic to determine what results should be presented to which customer became more complex. > Support for indexing from CSV files as well as simple pulling from a > database is on our "todo" list: http://wiki.apache.org/solr/TaskList I had seen this on the TODO list. I'm offering to contribute this piece when we've got an idea of overall fit... > How would one identify what index (or SolrCore) an update is > targeted to? This is a good question. I think the query interface itself would have to be extended. That is, a new parameter would have to be introduced which identified the objectType you would like to search/update. If omitted, the default object type would be used. In our current system, we set the objectType to the name of the database table and thus can issue queries like: search.jsp?tableName=users&queryString=email:michael.bryzek > What is the relationship between the multiple indicies... do queries > ever go across multiple indicies, or would there be an "objectType" > parameter passed in as part of the query? In our case, there is no relationship between the multiple indices, but I do see value here (more on this below). In our specific case, we have a one to one mapping between a database table and a Lucene index and have not needed to search across tables. I think the value of the objectType is this true independence. If you are indexing similar data, use a field on your data. If your data sets are truly different, use a different object type. > What is the purpose of multiple indicies... is it so search results > are always restricted to a single site, but it's not practical to > have that many Solr instances? It looks like the indicies are > partitioned along the lines of object type, and not site-id though. Your questions and comments are good. Thinking about it has helped me to clarify what exactly we're trying to accomplish. I think it boils down to these goals: a) Minimize the number of instances of SOLR. If I have 3 web applications, each with 12 database tables to index, I don't want to run 36 JVMs. I think introducing an objectType would address this. b) Optimize retrieval when I have some knowledge that I can use to define partitions of data. This may actually be more appropriate for Lucene itself, but I see SOLR pretty well positioned to address. One approach is to introduce a "partitionField" that SOLR would use to figure out if a new index is required. For each unique value of the partitionField, we create a separate physical index. If the query does NOT contain a term for the partitionField, we use a multi reader to search across all indexes. If the query DOES contain the term, we only search across those partitions. We have tried using cached bitsets to implement this sort of approach, but have found that when we have one large document set partitioned into much smaller sets (e.g. 1-10% of the total document space), creating separate indexes gives us a much higher boost in performance. -Mike -----Original Message----- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Wed 4/12/06 11:54 AM To: solr-user@lucene.apache.org Cc: Subject: Re: Interest in Extending SOLR Welcome Michael, On 4/12/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote: > * Integrated support for partitioning - database tables can be > partitioned for scalability reasons. The most common scenario for > us is to partition off data for our largest customers. For > example, imagine a users table: > > * user_id > * email_address > * site_id > > where site_id refers to the customer to whom the user > belongs. Some sites aggregate data... i.e. one of our customers > may have 100 sites. When indexing, we create a separate index to > store only data for a given site. This precomputes one of our more > expensive computations for search - a filter for all users that > belong to a given site. So the number of filters is equal to the number of sites? How many sites are there? > * Decoupled infrastructure - we wanted the ability to fully scale > our search application independent of our database application That makes total sense... we do the same thing. > * High speed indexing - we initially moved data from the database to > Lucene via XML documents. We found that to index even a 100k > documents, it was much faster to move the data in CSV files > (smaller files, less intensive processing). Support for indexing from CSV files as well as simple pulling from a database is on our "todo" list: http://wiki.apache.org/solr/TaskList > IDEAS: > > Looking through SOLR, I've identified the following main categories of > change. I would love to hear comments and feedback from this group. It would be nice to make any changes as general as possible, while still solving your particular problem. I think I understand many of the internal changes you outlined, but I'm not sure yet exactly what problem you are trying to solve, and how the multiple indicies will be used. - How would one identify what index (or SolrCore) an update is targeted to? - What is the relationship between the multiple indicies... do queries ever go across multiple indicies, or would there be an "objectType" parameter passed in as part of the query? - What is the purpose of multiple indicies... is it so search results are always restricted to a single site, but it's not practical to have that many Solr instances? It looks like the indicies are partitioned along the lines of object type, and not site-id though. -Yonik