On 8/8/2013 10:47 AM, Rasmussen, Chris wrote:
I'm running a 4.2 SOLRCloud instance with multiple servers/shards.  As I'm indexing data, 
I review the results of the STATUS commands and note an extremely high number of 
"deletedDocs".  I've combed through the source data to verify whether I'm 
sending duplicate documents ids, but haven't been able to find any.  I'm starting to 
wonder whether the field is a red herring?

Is the deleted document counter an accurate reflection of documents marked 
deleted in the collection?  My assumption is that if I send a document with the 
same document id, Solr will marked the document as deleted and then insert the 
new one.  Then at merge time the deleted documents are purged from the index.  
I've noted the the total deleted document count will go down with the indexes 
are merged.

The deletedDocs number is the accurate count of documents deleted in that specific SolrCore -- that particular replica of the shard.

You are correct about what happens if you index a document that already exists with the same value in the uniqueKey field. The old one is deleted (which just marks it as deleted) and the new one is inserted. You are also correct about deleted documents not actually disappearing from an index segment until that segment is merged, or you do an optimize.

If you restart any of your Solr servers, or a problem happens such that the cluster timeout is exceeded and everything gets out of sync, then you might end up in a situation where the transaction log is replayed, which reindex old documents and might increase the deleted count.

Thanks,
Shawn

Reply via email to