Dwane: DBQ for very large deletes is “iffy”. The problem is this: Solr must lock out _all_ indexing for _all_ replicas while the DBQ runs and this can just take a long time. This is just a consequence of distributed computing. Imagine a scenario where one of the documents affected by the DBQ is added by some other process. That has to be processed in order relative to the DBQ, but the DBQ can take a long time to find and delete the docs. But this has other implications, namely if updates don’t complete in a timely manner, the leader can throw the replicas into recovery...
So best practice is to go ahead and use delete-by-id. Do note that this means you’re responsible for resolving the issue above, but in your case it sounds like you’re guaranteed that none of the docs being deleted will be modified during the operation so you can ignore it. What I’d do is use streaming to get my IDs (this is not using the link you provided, this is essentially doing that patch yourself but on the client) and use that to generate delete-by-id requests. This is just something like create search stream source while (more tuples) { assemble delete-by-id request (perhaps one with multiple IDs) send to Solr } don’t forget to send the last batch of deletes if you’re sending batches, I have ;) Joel Bernstein’s blog is the most authoritative source, see: https://joelsolr.blogspot.com/2015/04/the-streaming-api-solrjio-basics.html. Although IDK whether that example is up to date, but it’ll give you an idea where to start. And Joel is pretty responsive about questions…. I'd package up maybe 1,000 ids per request. I regularly package up that many updates, and deletes are relatively cheap. You’ll avoid the overhead of establishing a request for every ID. This may seem contrary to the points above about DBQ taking a long time, but we’re talking orders of magnitude differences in the time it takes to delete 1,000 docs and query/delete vastly larger numbers, plus this does not require all the indexes be locked. Your users likely won’t notice this running, so while it’s usually good practice to do maintenance during off hours, I wouldn’t stress about it. And a question you didn’t ask for extra credit ;). The streaming expression will _not_ reflect any changes to the collection while it runs. The underlying index searcher is kept open for the duration and it only knows about segments that were closed when it started. But let’s assume your autocommit interval expires while this process is running and opens a new searcher. _Other_ requests from other clients _will_ see the changes. Again, I doubt you care since I’m assuming your orphan records are never seen by other clients anyway. Best, Erick > On May 25, 2020, at 7:48 PM, Dwane Hall <dwaneh...@hotmail.com> wrote: > > Hey Solr users, > > > > I'd really appreciate some community advice if somebody can spare some time > to assist me. My question relates to initially deleting a large amount of > unwanted data from a Solr Cloud collection, and then advice on best patterns > for managing delete operations on a regular basis. We have a situation > where data in our index can be 're-mastered' and as a result orphan records > are left dormant and unneeded in the index (think of a scenario similar to > client resolution where an entity can switch between golden records depending > on the information available at the time). I'm considering removing these > dormant records with a large initial bulk delete, and then running a delete > process on a regular maintenance basis. The initial record backlog is > ~50million records in a ~1.2billion document index (~4%) and the maintenance > deletes are small in comparison ~20,000/week. > > > > So with this scenario in mind I'm wondering what my best approach is for the > initial bulk delete: > > 1. Do nothing with the initial backlog and remove the unwanted documents > during the next large reindexing process? > 2. Delete by query (DBQ) with a specific delete query using the document > id's? > 3. Delete by id (DBID)? > > Are there any significant performance advantages between using DBID over a > specific DBQ? Should I break the delete operations up into batches of say > 1000, 10000, 100000, N DOC_ID's at a time if I take this approach? > > > > The Solr Reference guide mentions DBQ ignores the commitWithin parameter but > you can specify multiple documents to remove with an OR (||) clause in a DBQ > i.e. > > > Option 1 – Delete by id > > {"delete":["<id1>","<id2>"]} > > > > Option 2 – Delete by query (commitWithin ignored) > > {"delete":{"query":"DOC_ID:(<id1> || <id2>)"}} > > > > Shawn also provides a great explanation in this user group post from 2015 of > the DBQ process > (https://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html) > > > > I follow the Solr release notes fairly closely and also noticed this > excellent addition and discussion from Hossman and committers in the Solr 8.5 > release and it looks ideal for this scenario > (https://issues.apache.org/jira/browse/SOLR-14241). Unfortunately we're > still on the 7.7.2 branch and are unable to take advantage of the streaming > deletes feature. > > > > If I do implement a weekly delete maintenance regime is there any advice the > community can offer from experience? I'll definitely want to avoid times of > heavy indexing but how do deletes effect query performance? Will users > notice decreased performance during delete operations so they should be > avoided during peak query windows as well? > > > > As always any advice greatly is appreciated, > > > > Thanks, > > > > Dwane > > > > Environment > > SolrCloud 7.7.2, 30 shards, 2 replicas > > ~3 qps during peak times