I have a large database table with many document records, and I plan to use SOLR to improve the searching for the documents.
The twist here is that perhaps 50% of the records will originate from outside sources, and sometimes those records may be updated versions of documents we already have. Currently, a human visually examines the incoming information and performs a few document searches, and decides if a new document must be created, or an existing one should be updated. We would like to automate the matching to some extent, and it occurs to me that SOLR might be useful for this as well. Each document has many attributes that can be used for matching. The attributes are all in lookup tables. For example, there is a "location" field that might be something like "Central Public Library, Crawford, NE" for row with id #4444. The incoming document might have something like "Crawford Central Public Library, Nebraska", which ideally would map to #4444 as well. I'm currently thinking that a two-phase import might work. First, we use SOLR to try and get a list of attribute ids for the incoming document. Those can be used for ordinary database queries to find primary keys of potential matches. Then we use SOLR again to search the reduced list for the unstructured information, essentially by including those primary keys as part of the search. I was looking at the example for DIH here: http://wiki.apache.org/solr/DataImportHandler and it is clear, but it obviously slanted on finding the products. I need to find the categories so that I can *then* find the products, if that makes sense. Any suggestions on how to proceed? My first thought is that I should set up two SOLR instances, one for indexing only attributes, and one for the documents themselves. Thanks in advance for any help. cheers, Travis