A third option - Use dynamic fields. Add a dynamic field call "*_stash". This will allow new fields for documents to be added down the road without changing schema.xml, yet still allow you to query on fields like "arresteeFirstName_stash" without extra overhead.
-Todd Feak -----Original Message----- From: Yogesh Chawla - PD [mailto:premiergenerat...@yahoo.com] Sent: Tuesday, January 20, 2009 2:30 PM To: solr-user@lucene.apache.org Subject: New to Solr/Lucene design question Hello All, We are using SOLR/Lucene as the search engine for an application we are designing. The application is a workflow application that can receive different types of documents. For example, we are currently working on getting booking documents but will also accept arrest documents later this year. We have defined a custom schema that incorporates some schemas designed by federal consortiums. From those schemas we pluck out values that we want SOLR/Lucene to index and search on and we go from our instance document to a SOLR document. The fields in our schema.xml look like this: <fields> <!-- record-uri, unique identifier for any type of record --> <field name="record-uri" type="string" indexed="true" stored="true" required="true" /> <!-- stash-filepath, path to the entire XML document on the file system --> <field name="stash-filepath" type="string" indexed="true" stored="true" required="true" /> <!-- stash-content THIS IS THE FIELD I HAVE QUESTIONS ABOUT--> <field name="stash-content" type="string" indexed="true" stored="true" termVectors="true" multiValued="true" ssomitNorms="true"/> </fields> Above, there is a field called "stash-content". The goal is to take any search able data from any document type and put it in this field. For example, we would store data like this in XML format: <add> <doc> <field name="stash-content">arrestee_firstname_Yogesh</field> <field name="stash-content">arrestee_lastname_Chawla</field> <field name="stash-content">arrestee_middlename_myMiddleName</field> </doc> </add> The advantage to such an approach is that we can add new document types to search on and as long as they use the same semantics such as arrestee_firstname that we won't to update any code. It also makes the code simple and generic for any document type. We can search on first name like this for a starts with query:arrestee_firstname_Y*. We had to use the _ instead of a space so that each word would not be searched when a query was performed and only a single string would be searched. (hope that makes sense). The cons could be a performance hit. The other approach is to add fields explicitly like this: <add> <doc> <field name="arrestee_firstname">Yogesh</field> <field name="arrestee_lastname">Chawla</field> <field name="arrestee_middlename">myMiddleName</field> </doc> </add> This approach seems more traditional. The pros of it are that it is straight forward. The cons are that every time we add a new document type to search on, we have to update schema.xml and the java code that creates SOLR documents. The number of documents that we will eventually want to search on is about 5 million. However, this will take a while to ramp up to and we are more immediately looking at searching on about 100,000. I am new to SOLR and just inherited this project with approach number 1. Is this something that is going to bite us in the future? Thanks, Yogesh