New to Solr/Lucene design question

Yogesh Chawla - PD Tue, 20 Jan 2009 14:30:08 -0800

Hello All,
We are using SOLR/Lucene as the search engine for an application
we are designing.  The application is a workflow application that can
receive different types of documents.


For example, we are currently working on getting booking documents but
will also accept arrest documents later this year.

We have defined a custom schema that incorporates some schemas designed
by federal consortiums.  From those schemas we pluck out values that we want 
SOLR/Lucene to index and search on and we go from our instance document to
a SOLR document.

The fields in our schema.xml look like this:

 <fields>
    <!--   record-uri, unique identifier for any type of record  -->
   <field name="record-uri" type="string" indexed="true" stored="true" 
required="true" /> 
   <!--   stash-filepath, path to the entire XML document on the file system -->
   <field name="stash-filepath" type="string" indexed="true" stored="true" 
required="true" />
   <!--   stash-content THIS IS THE FIELD I HAVE QUESTIONS ABOUT-->
   <field name="stash-content" type="string" indexed="true" stored="true" 
termVectors="true" multiValued="true" ssomitNorms="true"/>
</fields>

Above, there is a field called "stash-content".  The goal is to take any search 
able data from
any document type and put it in this field.  For example, we would store data 
like this in XML format:


<add>
  <doc>
    <field name="stash-content">arrestee_firstname_Yogesh</field>
    <field name="stash-content">arrestee_lastname_Chawla</field>
    <field name="stash-content">arrestee_middlename_myMiddleName</field>
  </doc>
</add>
The advantage to such an approach is that we can add new document types to 
search on and as long
as they use the same semantics such as arrestee_firstname
that we won't to update any code.  It also makes
the code simple and generic for any document type.

We can search on first name like this for a starts with 
query:arrestee_firstname_Y*.  We had to use
the _ instead of a space so that each word would not be searched when a query 
was performed and only
a single string would be searched.  (hope that makes sense).

The cons could be a performance hit.  

The other approach is to add fields explicitly like this:

<add>
  <doc>
    <field name="arrestee_firstname">Yogesh</field>
    <field name="arrestee_lastname">Chawla</field>
    <field name="arrestee_middlename">myMiddleName</field>
  </doc>
</add>
This approach seems more traditional.  The pros of it are that it is straight 
forward.  The cons are that every time
we add a new document type to search on, we have to update schema.xml and the 
java code that creates SOLR
documents.

The number of documents that we will eventually want to search on is about 5 
million.  However, this will take a while
to ramp up to and we are more immediately looking at searching on about 100,000.

I am new to SOLR and just inherited this project with approach number 1.  Is 
this something that is going to bite us in the
future?

Thanks,
Yogesh

New to Solr/Lucene design question

Reply via email to