: Does anyone have more experience doing this kind of stuff and whants to share?

My advice: don't.

I work with (or work with people who work with) about two dozen Solr 
indexes -- we don't attempt to update a single one of them in any sort of 
transactional way.  Some of them are updated "real time" (ie: as soon as 
the authoritative DB is updated by some code, the same code updates the 
Solr index; Some of them are updated in batch (ie: once every N minutes 
code checks a log of all logical objects modified/deleted from the DB and 
sends the adds/delets to Solr; And some are only ever rebuilt from scrath 
every N hours (because the data in them isn't very time sensative and 
rebuilding from scratch is easier then dealing with incremental or batch 
updates.

But as i said: we never attempt to be transactional about it, for a few 
reasons:
  1) why should it be part of the transaction?  a Solr index is a 
denormalized/inverted index of data .. why should a tool (or any other 
process) be prevented from writting to an authoritative data store just 
becuase a non authoritative copy of that data can't be updated?  ... if 
you used MySQL with replication, would you really want to block all writes 
to the master just because there's a glitch in replicating to a slave?
  2) why worry about it?  It's relaly a non issue.  If an add or 
delete fails it's usually either developer error (ie: the code 
generating your add statements thinks there's a field that doesn't 
exist), a transient timeout (maybe because of a commit in progress) or 
network glitch (have the client retry once or twice), or in very rare 
instances the whole Solr index was completely jacked (either from disk 
failure, or OOM due to a huge spike in load) and we want to revert 
to a backup of the index in the shortterm and rebuild the index from 
scratch to play it safe.
  3) why limit yourself?  you're going to want the ability to trigger 
arbitrary indexing of your data objects at anytime -- if for no other 
reason then so when you decide to add a field to your index you can 
reindex them all -- so why make your index updating code inherently tied 
to your DB updating code?


As for your specific question along the lines of "why can't we do a 
mix of <add>s and <delete>s all as part of one update message?" the answer 
is "because no one ever wrote any code to parse messages like that."  BUT! 
... that's not the question you really want to ask.  the question you 
relaly want to ask is: "*IF* someone wrote code to allow a mix of <add>s 
and <delete>s all as part of one update message, would it solve my problem 
of wanting to be able to modify my solr index transactionally?" and the 
answer is "No."  Even if Solr accepted update messages that looked 
like this...

    <update>
       <delete><id>42</id></delete>
       <add><field name="id">7</field><field name="a">bb</field></add>
       <add><field name="id">666</field><field name="a">cccc</field></add>
    </update>

...the low level lucene calls that it would be doing internall still 
aren't transactional, so the first "delete" and "add" might succeed, but 
if there was then some kind of internal error, or a timeout because the 
first add took a while (maybe it triggered a segment merge) and the second 
add didn't happen -- the first two commands would have still been 
executed, and there would be no way to "rollback".

In a nutshell: you would be no better off then if your client code has 
sent all three as seperate update messages.


-Hoss

Reply via email to