All -

My apologies in advance of a rather long email message, especially for
a first time poster to this list. I'm looking at using SOLR to
replace our custom http / xml infrastructure for Lucene that we built
to tightly integrate with our web apps running in an oracle, non java
environment.

Evaluating our migration led me to a few considerations that I would
like to propose to this group both for feedback and feasibility within
SOLR. We would be very happy contributing effort to build on top of
SOLR to the extent the community finds value in the work.

BACKGROUND:

United eWay is an application server provider offering highly
customizable web applications for philanthropic purposes. We've
created many database backed web applications, offering search until
recently via Oracle's interMedia product. 

We decided to move all search out of Oracle and into Lucene in late
2005. Our infrastructure is based on AOLServer and a variant of
OpenACS, neither of which offered good integration with Java or one of
the ports of Lucene.

Prior to learning about SOLR, we deployed our own HTTP / XML based
services meeting the needs that we had:

  * Tight database integration - indexing a table in Oracle requires
    the execution of several stored procedures. That is, we provided
    an API in the database to synchronize the database table schema
    with the schema that we used for indexing.

  * Integrated support for partitioning - database tables can be
    partitioned for scalability reasons. The most common scenario for
    us is to partition off data for our largest customers. For
    example, imagine a users table:

     * user_id
     * email_address
     * site_id

    where site_id refers to the customer to whom the user
    belongs. Some sites aggregate data... i.e. one of our customers
    may have 100 sites. When indexing, we create a separate index to
    store only data for a given site. This precomputes one of our more
    expensive computations for search - a filter for all users that
    belong to a given site.

  * Decoupled infrastructure - we wanted the ability to fully scale
    our search application independent of our database application

  * High speed indexing - we initially moved data from the database to
    Lucene via XML documents. We found that to index even a 100k
    documents, it was much faster to move the data in CSV files
    (smaller files, less intensive processing).


IDEAS:

Looking through SOLR, I've identified the following main categories of
change. I would love to hear comments and feedback from this group. My
preference would be to build these changes directly into SOLR rather
than maintain our own application, but the presupposes interest from
the community. The general though is to introduce the concept of an
objectType into the schema. For example:

 <objectType name="users">
   <fields>
     <field name="id" type="string" indexed="false" stored="true"/>
     <field name="email_address" type="text" indexed="false" stored="true"/>

     <field name="text" type="text" indexed="true" stored="false" 
multiValued="true"/>
     <copyField source="id" dest="text"/>
     <copyField source="email_address" dest="text"/>

     <uniqueKey>id</uniqueKey>
   </fields>         
 </objectType>

 Within one global schema for SOLR, we would provide the ability
 to define which fields are available for which types of objects,
 and how they are analyzed. Each object type would then be stored
 in an independent Lucene index.

 I've dug a bit into the codebase to see what impact this would
 have. The change is a relatively large conceptual change, but I
 believe doable given the nicely separated core package:

   * Provide a factory to get a SolrCore instance (i.e. replace
     SolrCore.getSolrCore SolrCore.getInstance(String objectType))

   * Modify getInstanceDir, newSearcher, initIndex to accept an objectType

   * Provide backwards compatibility by providing a new schema file
     (e.g. schema-typed.xml). Include a 'default' object type for
     folks that would like to preserve the existing treatment of
     schemas in SOLR. Users would provide either the existing
     schema.xml file (resulting in one default object type) or the
     schema-typed.xml file.



Your comments and thoughts would be much appreciated.

Best,
Michael Bryzek

Reply via email to