Hello,

I'm using solr 4.4. I have a solr core with a schema defining a bunch of 
different fields, and among them, a date field:
- date: indexed and stored       // the date used at search time
In practice it's a TrieDateField but I think that's not relevant for the 
concern.

It also has a multi valued, not required, "string" field named "tags" which 
contains, well a list of tags, for some of the documents.

So far, so good: everything works as expected and I'm glad.
I'm able to perform partial (or atomic) updates on the tags field whenever it 
gets modified, and I love it.

Now I have an new source that also pushes updates to the same solr core. 
Unfortunately, that source's incoming documents have their date in an other 
field, of the same type, named created_time instead of date.
- created_time: stored only      // some documents come in with this field set
To be able to sort any document by time, I decided to ask solr to copy the 
contents of the field created_time to the field named date:
 <copyField source="created_date" dest="date" />

I updated my schema and reloaded my core and everything seemed fine. In fact, I 
did break something 8-)
But I figured it out later…
Quoting http://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations :
> all fields in your SchemaXml must be configured as stored="true" except for 
> fields which are <copyField/> destinations -- which must be configured as 
> stored="false"


However at that time, I was not aware of the limitation and I was able to sort 
by time across all the documents in my solr core.
I then decided to make sure that partial (or atomic) updates could still be 
performed, and then I was surprised:
* documents from the more recent source (having both a date and a created_time 
field) are updated fine, the date field is kept (the copyField directive is 
replayed, I guess)
* documents from the first source (having only the date field set) are however 
a little bit less lucky: the date gets lost in process (looks like the date 
field was overridden by the execution of the copyField directive with nothing 
in its source field)

I then became aware of the caveats and limitations of atomic updates, but now I 
want to understand why ;-)

So my question is: What differs concerning copyField behaviours between a 
normal (classic) and a partial (atomic) update?
In practice, I don't understand why the targets of every copyField directives 
are *always* cleared during partial updates?
Could the clearing of the destination field be performed if one of the source 
field of a copyField is present in the atomic update only? May be we didn't 
want to do that because that would have put some complexity where it should not 
be (updates must be fast), but that's just an idea.

I have two ways to handle my problem:
1/ Create a stored="false" search_date field and have two copyFields 
directives, one for the original "date" field an another one for the newer 
"created_time" field, and make the search application rely on the search_date 
field
2/ Since I have some control over the second source pushing documents, I can 
make sure that documents are pushed with the same date field, and work around 
the limitation by removing the copyField directive entirely.
Since it simplifies my solr schema, I chose the option #2

Thank you very much for your attention

Tanguy

Reply via email to