Another update: I removed the replicas - to avoid the replication doing a full copy. I am able delete sizeable chunks of data. But the overall index size remains the same even after the deletes. It does not seem to go down.
I understand that Solr would do this in background - but I don't seem to see the decrease in overall index size even after 1-2 hours. I can see a bunch of ".del" files in the index directory, but the it does not seem to get cleaned up. Is there anyway to monitor/follow the progress of index compaction? Also, does triggering "optimize" from the admin UI help to compact the index size on disk? Thanks Vinay On 14 April 2014 12:19, Vinay Pothnis <poth...@gmail.com> wrote: > Some update: > > I removed the auto warm configurations for the various caches and reduced > the cache sizes. I then issued a call to delete a day's worth of data (800K > documents). > > There was no out of memory this time - but some of the nodes went into > recovery mode. Was able to catch some logs this time around and this is > what i see: > > **************** > *WARN [2014-04-14 18:11:00.381] [org.apache.solr.update.PeerSync] > PeerSync: core=core1_shard1_replica2 url=http://host1:8983/solr > <http://host1:8983/solr> too many updates received since start - > startingUpdates no longer overlaps with our currentUpdates* > *INFO [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy] > PeerSync Recovery was not successful - trying replication. > core=core1_shard1_replica2* > *INFO [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy] > Starting Replication Recovery. core=core1_shard1_replica2* > *INFO [2014-04-14 18:11:00.535] [org.apache.solr.cloud.RecoveryStrategy] > Begin buffering updates. core=core1_shard1_replica2* > *INFO [2014-04-14 18:11:00.536] [org.apache.solr.cloud.RecoveryStrategy] > Attempting to replicate from http://host2:8983/solr/core1_shard1_replica1/ > <http://host2:8983/solr/core1_shard1_replica1/>. core=core1_shard1_replica2* > *INFO [2014-04-14 18:11:00.536] > [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http > client, > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false* > *INFO [2014-04-14 18:11:01.964] > [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http > client, > config:connTimeout=5000&socketTimeout=20000&allowCompression=false&maxConnections=10000&maxConnectionsPerHost=10000* > *INFO [2014-04-14 18:11:01.969] [org.apache.solr.handler.SnapPuller] No > value set for 'pollInterval'. Timer Task not started.* > *INFO [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller] > Master's generation: 1108645* > *INFO [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller] > Slave's generation: 1108627* > *INFO [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller] > Starting replication process* > *INFO [2014-04-14 18:11:02.007] [org.apache.solr.handler.SnapPuller] > Number of files in latest index in master: 814* > *INFO [2014-04-14 18:11:02.007] > [org.apache.solr.core.CachingDirectoryFactory] return new directory for > /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007* > *INFO [2014-04-14 18:11:02.008] [org.apache.solr.handler.SnapPuller] > Starting download to > NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/data/solr/core1_shard1_replica2/data/index.20140414181102007 > lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f6570fe; > maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true* > > **************** > > > So, it looks like the number of updates is too huge for the regular > replication and then it goes into full copy of index. And since our index > size is very huge (350G), this is causing the cluster to go into recovery > mode forever - trying to copy that huge index. > > I also read in some thread > http://lucene.472066.n3.nabble.com/Recovery-too-many-updates-received-since-start-td3935281.htmlthat > there is a limit of 100 documents. > > I wonder if this has been updated to make that configurable since that > thread. If not, the only option I see is to do a "trickle" delete of 100 > documents per second or something. > > Also - the other suggestion of using "distributed=false" might not help > because the issue currently is that the replication is going to "full copy". > > Any thoughts? > > Thanks > Vinay > > > > > > > > On 14 April 2014 07:54, Vinay Pothnis <poth...@gmail.com> wrote: > >> Yes, that is our approach. We did try deleting a day's worth of data at a >> time, and that resulted in OOM as well. >> >> Thanks >> Vinay >> >> >> On 14 April 2014 00:27, Furkan KAMACI <furkankam...@gmail.com> wrote: >> >>> Hi; >>> >>> I mean you can divide the range (i.e. one week at each delete instead of >>> one month) and try to check whether you still get an OOM or not. >>> >>> Thanks; >>> Furkan KAMACI >>> >>> >>> 2014-04-14 7:09 GMT+03:00 Vinay Pothnis <poth...@gmail.com>: >>> >>> > Aman, >>> > Yes - Will do! >>> > >>> > Furkan, >>> > How do you mean by 'bulk delete'? >>> > >>> > -Thanks >>> > Vinay >>> > >>> > >>> > On 12 April 2014 14:49, Furkan KAMACI <furkankam...@gmail.com> wrote: >>> > >>> > > Hi; >>> > > >>> > > Do you get any problems when you index your data? On the other hand >>> > > deleting as bulks and reducing the size of documents may help you >>> not to >>> > > hit OOM. >>> > > >>> > > Thanks; >>> > > Furkan KAMACI >>> > > >>> > > >>> > > 2014-04-12 8:22 GMT+03:00 Aman Tandon <amantandon...@gmail.com>: >>> > > >>> > > > Vinay please share your experience after trying this solution. >>> > > > >>> > > > >>> > > > On Sat, Apr 12, 2014 at 4:12 AM, Vinay Pothnis <poth...@gmail.com> >>> > > wrote: >>> > > > >>> > > > > The query is something like this: >>> > > > > >>> > > > > >>> > > > > *curl -H 'Content-Type: text/xml' --data >>> '<delete><query>param1:(val1 >>> > > OR >>> > > > > val2) AND -param2:(val3 OR val4) AND date_param:[1383955200000 TO >>> > > > > 1385164800000]</query></delete>' >>> > > > > 'http://host:port/solr/coll-name1/update?commit=true'* >>> > > > > >>> > > > > Trying to restrict the number of documents deleted via the date >>> > > > parameter. >>> > > > > >>> > > > > Had not tried the "distrib=false" option. I could give that a >>> try. >>> > > Thanks >>> > > > > for the link! I will check on the cache sizes and autowarm >>> values. >>> > Will >>> > > > try >>> > > > > and disable the caches when I am deleting and give that a try. >>> > > > > >>> > > > > Thanks Erick and Shawn for your inputs! >>> > > > > >>> > > > > -Vinay >>> > > > > >>> > > > > >>> > > > > >>> > > > > On 11 April 2014 15:28, Shawn Heisey <s...@elyograg.org> wrote: >>> > > > > >>> > > > > > On 4/10/2014 7:25 PM, Vinay Pothnis wrote: >>> > > > > > >>> > > > > >> When we tried to delete the data through a query - say 1 >>> > day/month's >>> > > > > worth >>> > > > > >> of data. But after deleting just 1 month's worth of data, the >>> > master >>> > > > > node >>> > > > > >> is going out of memory - heap space. >>> > > > > >> >>> > > > > >> Wondering is there any way to incrementally delete the data >>> > without >>> > > > > >> affecting the cluster adversely. >>> > > > > >> >>> > > > > > >>> > > > > > I'm curious about the actual query being used here. Can you >>> share >>> > > it, >>> > > > or >>> > > > > > a redacted version of it? Perhaps there might be a clue there? >>> > > > > > >>> > > > > > Is this a fully distributed delete request? One thing you >>> might >>> > try, >>> > > > > > assuming Solr even supports it, is sending the same delete >>> request >>> > > > > directly >>> > > > > > to each shard core with distrib=false. >>> > > > > > >>> > > > > > Here's a very incomplete list about how you can reduce Solr >>> heap >>> > > > > > requirements: >>> > > > > > >>> > > > > > http://wiki.apache.org/solr/SolrPerformanceProblems# >>> > > > > > Reducing_heap_requirements >>> > > > > > >>> > > > > > Thanks, >>> > > > > > Shawn >>> > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > > >>> > > > >>> > > > -- >>> > > > With Regards >>> > > > Aman Tandon >>> > > > >>> > > >>> > >>> >> >> >