Denormalization IS the best practice, gnerally.

You still haven't demonstrated an "exponential" increase. If there is any "exponential" increase, that is your own doing and you should simply NOT do that!

The total number of documents would be the number of rows in your second table. Denormalization simply means that each of those documents would also hold some patient data - denormalized.

If you want, you can store some separate, sparse documents with patient data only, and then store only the patient ID in the data documents. Or, you can have a second Solr collection that parallels your SQL tables. It is a simple efficiency question of what operations you perform most commonly.

Do patients really change their names with any great frequency?

The bottom line is that each Solr document needs to have the fields that users will be searching on. There is some limited JOIN capability, but it is limited and may be more trouble than it is worth.

You might be better doing a two-stage lookup - query one Solr table by patient name to get the patient ID and then query the second table using the ID and other fields.

Maybe you simply need to give us some example queries - then we can suggest structures to make those queries efficient.

NRT doesn't mean that occasional bulk updates will happen in non-NRT, simply that common queries will usually be NRT.

Maybe you need an app layer above Solr that manages patient names and does the two-part query so that data documents can be queried with patient ID.

-- Jack Krupansky

-----Original Message----- From: zbindigonzales
Sent: Monday, November 26, 2012 6:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Near Realtime with denormalized Data

Hello Erick

Thanks for your response.

The main problem we have, is that the data is denormalized. And this
increases the document to index exponentially.

Let's say we have the following two tables.

----------------------      ----------------------
Table: Patient                   Table:Image
----------------------      ----------------------
Field: IDPatient                 Field: IDImage
Field: Name                      Field: IDPatient
                                    Field: Name


Then we create a view to denormalize the data.

create or replace view solrview as select * from patient p left join image i
on p.idpatient = i.idpatient

If we select the view we will get the following rows

Patient_IDPatient | Patient_Name | Image_IDImage | Image_IDPatient |
Image_Name
          1                    Peter                   1
1                   Dog
          2                    Peter                   2
1                   Cat
          3                    Peter                   3
1                   Ant
          4                    Josh                    4
1                   Deer
          5                    Kent                    5
2                   Dog
          6                    Martin                  6
2                   Dog
          7                    David
          8                    Eric


This is the configuration we use in the data-config.xml

<entity name="solrEntiry" pk="solrview_PK"
   query=" SELECT * FROM solrview"
   deltaQuery = "SELECT PATIENT_IDPATIENT IMAGE_IDIMAGE FROM solrview
                      WHERE PATIENT_IAMODIFICATIONDATE >
'${dataimporter.last_index_time}
                      UNION
                      SELECT PATIENT_IDPATIENT, IMAGE_IDIMAGE FROM solrview
                      WHERE IMAGE_IAMODIFICATIONDATE >
'${dataimporter.last_index_time}
   deltaImportQuery=" SELECT * FROM solrview
                              WHERE PATIENT_IDPATIENT =
'${dataimporter.delta.PATIENT_IDPATIENT}'
                              AND IMAGE_IDIMAGE =
'${dataimporter.delta.IMAGE_IDIMAGE}'"
                              transformer="ClobTransformer">


If a user is now updating the Patient_Name "Peter" to "Frank" then there are
already 3 rows affected.

So even if we arrive to index 6k / second there is still a problem that 80k
can be affected with only a small change like changing the patient_name
field.


What I now try to do now is reducing the fields that will be updated.
For example if there are only changes on the patient table I will only need
to reindex the fields from the table patient.

Writing our own SolrJ program isn't an option because changes to the
database can come from different services. It would be to error prone if we
would integrate this SolrJ programm everywhere where data gets written.

The reason to post this issue is to see if somebody already got this
situation and have a better "best practise" :-)


Best Regards Sandro











--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Near-Realtime-with-denormalized-Data-tp4022072p4022323.html Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to