All - My apologies in advance of a rather long email message, especially for a first time poster to this list. I'm looking at using SOLR to replace our custom http / xml infrastructure for Lucene that we built to tightly integrate with our web apps running in an oracle, non java environment.
Evaluating our migration led me to a few considerations that I would like to propose to this group both for feedback and feasibility within SOLR. We would be very happy contributing effort to build on top of SOLR to the extent the community finds value in the work. BACKGROUND: United eWay is an application server provider offering highly customizable web applications for philanthropic purposes. We've created many database backed web applications, offering search until recently via Oracle's interMedia product. We decided to move all search out of Oracle and into Lucene in late 2005. Our infrastructure is based on AOLServer and a variant of OpenACS, neither of which offered good integration with Java or one of the ports of Lucene. Prior to learning about SOLR, we deployed our own HTTP / XML based services meeting the needs that we had: * Tight database integration - indexing a table in Oracle requires the execution of several stored procedures. That is, we provided an API in the database to synchronize the database table schema with the schema that we used for indexing. * Integrated support for partitioning - database tables can be partitioned for scalability reasons. The most common scenario for us is to partition off data for our largest customers. For example, imagine a users table: * user_id * email_address * site_id where site_id refers to the customer to whom the user belongs. Some sites aggregate data... i.e. one of our customers may have 100 sites. When indexing, we create a separate index to store only data for a given site. This precomputes one of our more expensive computations for search - a filter for all users that belong to a given site. * Decoupled infrastructure - we wanted the ability to fully scale our search application independent of our database application * High speed indexing - we initially moved data from the database to Lucene via XML documents. We found that to index even a 100k documents, it was much faster to move the data in CSV files (smaller files, less intensive processing). IDEAS: Looking through SOLR, I've identified the following main categories of change. I would love to hear comments and feedback from this group. My preference would be to build these changes directly into SOLR rather than maintain our own application, but the presupposes interest from the community. The general though is to introduce the concept of an objectType into the schema. For example: <objectType name="users"> <fields> <field name="id" type="string" indexed="false" stored="true"/> <field name="email_address" type="text" indexed="false" stored="true"/> <field name="text" type="text" indexed="true" stored="false" multiValued="true"/> <copyField source="id" dest="text"/> <copyField source="email_address" dest="text"/> <uniqueKey>id</uniqueKey> </fields> </objectType> Within one global schema for SOLR, we would provide the ability to define which fields are available for which types of objects, and how they are analyzed. Each object type would then be stored in an independent Lucene index. I've dug a bit into the codebase to see what impact this would have. The change is a relatively large conceptual change, but I believe doable given the nicely separated core package: * Provide a factory to get a SolrCore instance (i.e. replace SolrCore.getSolrCore SolrCore.getInstance(String objectType)) * Modify getInstanceDir, newSearcher, initIndex to accept an objectType * Provide backwards compatibility by providing a new schema file (e.g. schema-typed.xml). Include a 'default' object type for folks that would like to preserve the existing treatment of schemas in SOLR. Users would provide either the existing schema.xml file (resulting in one default object type) or the schema-typed.xml file. Your comments and thoughts would be much appreciated. Best, Michael Bryzek