I have a large database table with many document records, and I plan to use
SOLR to improve the searching for the documents.

The twist here is that perhaps 50% of the records will originate from
outside sources, and sometimes those records may be updated versions of
documents we already have.  Currently, a human visually examines the
incoming information and performs a few document searches, and decides if a
new document must be created, or an existing one should be updated.  We
would like to automate the matching to some extent, and it occurs to me that
SOLR might be useful for this as well.

Each document has many attributes that can be used for matching.  The
attributes are all in lookup tables.  For example, there is a "location"
field that might be something like "Central Public Library, Crawford, NE"
for row with id #4444.  The incoming document might have something like
"Crawford Central Public Library, Nebraska", which ideally would map to
#4444 as well.

I'm currently thinking that a two-phase import might work.  First, we use
SOLR to try and get a list of attribute ids for the incoming document.
Those can be used for ordinary database queries to find primary keys of
potential matches.  Then we use SOLR again to search the reduced list for
the unstructured information, essentially by including those primary keys as
part of the search.

I was looking at the example for DIH here:
http://wiki.apache.org/solr/DataImportHandler and it is clear, but it
obviously slanted on finding the products.  I need to find the categories so
that I can *then* find the products, if that makes sense.

Any suggestions on how to proceed?  My first thought is that I should set up
two SOLR instances, one for indexing only attributes, and one for the
documents themselves.

Thanks in advance for any help.

cheers,

Travis

Reply via email to