On 8/8/2013 10:47 AM, Rasmussen, Chris wrote:
I'm running a 4.2 SOLRCloud instance with multiple servers/shards. As I'm indexing data,
I review the results of the STATUS commands and note an extremely high number of
"deletedDocs". I've combed through the source data to verify whether I'm
sending duplicate documents ids, but haven't been able to find any. I'm starting to
wonder whether the field is a red herring?
Is the deleted document counter an accurate reflection of documents marked
deleted in the collection? My assumption is that if I send a document with the
same document id, Solr will marked the document as deleted and then insert the
new one. Then at merge time the deleted documents are purged from the index.
I've noted the the total deleted document count will go down with the indexes
are merged.
The deletedDocs number is the accurate count of documents deleted in
that specific SolrCore -- that particular replica of the shard.
You are correct about what happens if you index a document that already
exists with the same value in the uniqueKey field. The old one is
deleted (which just marks it as deleted) and the new one is inserted.
You are also correct about deleted documents not actually disappearing
from an index segment until that segment is merged, or you do an optimize.
If you restart any of your Solr servers, or a problem happens such that
the cluster timeout is exceeded and everything gets out of sync, then
you might end up in a situation where the transaction log is replayed,
which reindex old documents and might increase the deleted count.
Thanks,
Shawn