I am working on a component for indexing documents from a database that 
contains medical records. The information is organized across several tables 
and I am supposed to index records for varying sizes of sets of patients for 
others to do IR experiments with. Each patient record has one or more main 
documents associated with it, and each main document has zero or more addenda 
associated with it. (The main documents and addenda are treated alike for the 
most part, except for a parent record field that is null for main documents and 
has the number of a main document for addenda. Addenda cannot have addenda.) 
Also, each main document has one or more diagnosis records. I am trying to 
figure out the best performing way to select all of the records for each 
patient, including the main documents, addenda and diagnoses.

I tried indexing sets of these records using DataImportHandler and nested 
Entity blocks in a way similar to the Full Import example on the 
http://wiki.apache.org/solr/DataImportHandler page, with a select for all 
patients and main records in a data set, and nested selects that get all of the 
addenda and all of the diagnoses for each patient, but it didn't run very fast 
and a database resource person who looked into it with me said that issuing a 
million SQL queries for addenda and a million queries for diagnoses, one each 
for the million patient documents in a typical set of 10,000 patients, was very 
inefficient, and I should look for a different way of getting the data.

I switched to using SolrJ, and I am trying to figure out which of two ways to 
use to index this data. One would be to use one large SQL statement to get all 
of the data for a patient set. The results would contain duplication due to the 
way tables are joined together that I would need to sort out in the Java code, 
but that is doable.

The other way would be to

1.       Get all of the main document data with one SQL query, create index 
documents with the data that they contain and store them in the index,

2.       Issue another SQL query that gets all of the addenda for all of the 
patients in the data set and an id number for each one that tells which main 
document an addendum belongs with, retrieve the main documents from the index, 
add the addenda fields to the document and put them back in the index

3.       Do the same with diagnosis data.
It would be great to be able to keep the main document data that is retrieved 
from the database in a hash table, update each of those objects with addenda 
and diagnoses, and write completely filled out documents to the index once, but 
I don't have enough memory available to do this for the patient sets I am 
working with now, and they want this indexing process to scale up to patient 
sets that are ten times as large and eventually much larger than that.

Essentially for the second approach I am wondering if a Lucene index can be 
made to serve as a hash table for storing intermediate results, and whether 
SolrJ has an API for retrieving individual index documents so they can be 
updated. Basically it would be shifting from iterating over SQL queries to 
iterating over Lucene index updates. If this way of doing things is also likely 
to be slow, or the SolrJ API doesn't provide a way to do this, or there are 
other problems with it, I can go with selecting all of the data in one large 
query and dealing with the duplication.
Thanks,
Mike

Reply via email to