First, I would invest the largest effort towards developing good test cases and 
a good test harness for your ETL software itself.   If validation in production 
does encounter errors, it should be considered a bug in your code!  So be sure 
to always add these cases to your test harness.

Also, the row level validation can and should be driven by metadata.   I'm 
assuming you have a mapping between RDBMS table names and Solr entity types?   
And, for any given entity type, a table that maps solr field names and 
datatypes to their RDBMS equivalents?   My assumption would be that the ETL 
process itself uses such metadata.  The same data could be used for production 
data validation.  My inclination would be to integrate granular / row-level 
validation into the ETL job itself.

For summary validation, if re-indexing from scratch every time, just run some 
facet queries and compare to the equivalent summaries for the SQL input data 
(assuming you are familiar with SQL "group by" and "having" clauses).    If 
using incremental loads, make sure you can associate the loaded data with the 
ETL job that loaded it (timestamp, batch ID, etc.).   Then simply scope the 
facet queries by the batch in question and compare to the SQL summary.


-----Original Message-----
From: marotosg [mailto:marot...@gmail.com] 
Sent: Monday, March 02, 2015 6:32 AM
To: solr-user@lucene.apache.org
Subject: Validate data Indexed and versioning

Hi,

I am trying to define a way of validating if my index has the same content than 
my database.
I am indexing a very complex denormalized version of the database with many 
items and nested documents. I have an indexation service which pulls records 
from a staging table(created based on a ETL process), transforms this table 
into xml which will be posted to Solr.

Is there any general approach to check if your indexed document matches the 
database row?.

One option I see is to create an additional service to run against solr and 
database and validate if has the same data but this is going to be very 
intensive.
I was more on the opinion of solr telling the record indexed and content like 
number of nested docs of type A,B etc.,

Any suggestions would help.

Thanks

Regards



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Validate-data-Indexed-and-versioning-tp4190304.html
Sent from the Solr - User mailing list archive at Nabble.com.

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*************************************************************************

Reply via email to