Re: deleting large amount data from solr cloud

Erick Erickson Wed, 16 Apr 2014 04:53:22 -0700

The optimize should, indeed, reduce the index size. Be aware that it
may consume 2x the disk space. You may also try expungedeletes, see
here: https://wiki.apache.org/solr/UpdateXmlMessages


Best,
Erick

On Wed, Apr 16, 2014 at 12:47 AM, Vinay Pothnis <[email protected]> wrote:
> Another update:
>
> I removed the replicas - to avoid the replication doing a full copy. I am
> able delete sizeable chunks of data.
> But the overall index size remains the same even after the deletes. It does
> not seem to go down.
>
> I understand that Solr would do this in background - but I don't seem to
> see the decrease in overall index size even after 1-2 hours.
> I can see a bunch of ".del" files in the index directory, but the it does
> not seem to get cleaned up. Is there anyway to monitor/follow the progress
> of index compaction?
>
> Also, does triggering "optimize" from the admin UI help to compact the
> index size on disk?
>
> Thanks
> Vinay
>
>
> On 14 April 2014 12:19, Vinay Pothnis <[email protected]> wrote:
>
>> Some update:
>>
>> I removed the auto warm configurations for the various caches and reduced
>> the cache sizes. I then issued a call to delete a day's worth of data (800K
>> documents).
>>
>> There was no out of memory this time - but some of the nodes went into
>> recovery mode. Was able to catch some logs this time around and this is
>> what i see:
>>
>> ****************
>> *WARN  [2014-04-14 18:11:00.381] [org.apache.solr.update.PeerSync]
>> PeerSync: core=core1_shard1_replica2 url=http://host1:8983/solr
>> <http://host1:8983/solr> too many updates received since start -
>> startingUpdates no longer overlaps with our currentUpdates*
>> *INFO  [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy]
>> PeerSync Recovery was not successful - trying replication.
>> core=core1_shard1_replica2*
>> *INFO  [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy]
>> Starting Replication Recovery. core=core1_shard1_replica2*
>> *INFO  [2014-04-14 18:11:00.535] [org.apache.solr.cloud.RecoveryStrategy]
>> Begin buffering updates. core=core1_shard1_replica2*
>> *INFO  [2014-04-14 18:11:00.536] [org.apache.solr.cloud.RecoveryStrategy]
>> Attempting to replicate from http://host2:8983/solr/core1_shard1_replica1/
>> <http://host2:8983/solr/core1_shard1_replica1/>. core=core1_shard1_replica2*
>> *INFO  [2014-04-14 18:11:00.536]
>> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http
>> client,
>> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false*
>> *INFO  [2014-04-14 18:11:01.964]
>> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http
>> client,
>> config:connTimeout=5000&socketTimeout=20000&allowCompression=false&maxConnections=10000&maxConnectionsPerHost=10000*
>> *INFO  [2014-04-14 18:11:01.969] [org.apache.solr.handler.SnapPuller]  No
>> value set for 'pollInterval'. Timer Task not started.*
>> *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
>> Master's generation: 1108645*
>> *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
>> Slave's generation: 1108627*
>> *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
>> Starting replication process*
>> *INFO  [2014-04-14 18:11:02.007] [org.apache.solr.handler.SnapPuller]
>> Number of files in latest index in master: 814*
>> *INFO  [2014-04-14 18:11:02.007]
>> [org.apache.solr.core.CachingDirectoryFactory] return new directory for
>> /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007*
>> *INFO  [2014-04-14 18:11:02.008] [org.apache.solr.handler.SnapPuller]
>> Starting download to
>> NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/data/solr/core1_shard1_replica2/data/index.20140414181102007
>> lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f6570fe;
>> maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true*
>>
>> ****************
>>
>>
>> So, it looks like the number of updates is too huge for the regular
>> replication and then it goes into full copy of index. And since our index
>> size is very huge (350G), this is causing the cluster to go into recovery
>> mode forever - trying to copy that huge index.
>>
>> I also read in some thread
>> http://lucene.472066.n3.nabble.com/Recovery-too-many-updates-received-since-start-td3935281.htmlthat
>>  there is a limit of 100 documents.
>>
>> I wonder if this has been updated to make that configurable since that
>> thread. If not, the only option I see is to do a "trickle" delete of 100
>> documents per second or something.
>>
>> Also - the other suggestion of using "distributed=false" might not help
>> because the issue currently is that the replication is going to "full copy".
>>
>> Any thoughts?
>>
>> Thanks
>> Vinay
>>
>>
>>
>>
>>
>>
>>
>> On 14 April 2014 07:54, Vinay Pothnis <[email protected]> wrote:
>>
>>> Yes, that is our approach. We did try deleting a day's worth of data at a
>>> time, and that resulted in OOM as well.
>>>
>>> Thanks
>>> Vinay
>>>
>>>
>>> On 14 April 2014 00:27, Furkan KAMACI <[email protected]> wrote:
>>>
>>>> Hi;
>>>>
>>>> I mean you can divide the range (i.e. one week at each delete instead of
>>>> one month) and try to check whether you still get an OOM or not.
>>>>
>>>> Thanks;
>>>> Furkan KAMACI
>>>>
>>>>
>>>> 2014-04-14 7:09 GMT+03:00 Vinay Pothnis <[email protected]>:
>>>>
>>>> > Aman,
>>>> > Yes - Will do!
>>>> >
>>>> > Furkan,
>>>> > How do you mean by 'bulk delete'?
>>>> >
>>>> > -Thanks
>>>> > Vinay
>>>> >
>>>> >
>>>> > On 12 April 2014 14:49, Furkan KAMACI <[email protected]> wrote:
>>>> >
>>>> > > Hi;
>>>> > >
>>>> > > Do you get any problems when you index your data? On the other hand
>>>> > > deleting as bulks and reducing the size of documents may help you
>>>> not to
>>>> > > hit OOM.
>>>> > >
>>>> > > Thanks;
>>>> > > Furkan KAMACI
>>>> > >
>>>> > >
>>>> > > 2014-04-12 8:22 GMT+03:00 Aman Tandon <[email protected]>:
>>>> > >
>>>> > > > Vinay please share your experience after trying this solution.
>>>> > > >
>>>> > > >
>>>> > > > On Sat, Apr 12, 2014 at 4:12 AM, Vinay Pothnis <[email protected]>
>>>> > > wrote:
>>>> > > >
>>>> > > > > The query is something like this:
>>>> > > > >
>>>> > > > >
>>>> > > > > *curl -H 'Content-Type: text/xml' --data
>>>> '<delete><query>param1:(val1
>>>> > > OR
>>>> > > > > val2) AND -param2:(val3 OR val4) AND date_param:[1383955200000 TO
>>>> > > > > 1385164800000]</query></delete>'
>>>> > > > > 'http://host:port/solr/coll-name1/update?commit=true'*
>>>> > > > >
>>>> > > > > Trying to restrict the number of documents deleted via the date
>>>> > > > parameter.
>>>> > > > >
>>>> > > > > Had not tried the "distrib=false" option. I could give that a
>>>> try.
>>>> > > Thanks
>>>> > > > > for the link! I will check on the cache sizes and autowarm
>>>> values.
>>>> > Will
>>>> > > > try
>>>> > > > > and disable the caches when I am deleting and give that a try.
>>>> > > > >
>>>> > > > > Thanks Erick and Shawn for your inputs!
>>>> > > > >
>>>> > > > > -Vinay
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > On 11 April 2014 15:28, Shawn Heisey <[email protected]> wrote:
>>>> > > > >
>>>> > > > > > On 4/10/2014 7:25 PM, Vinay Pothnis wrote:
>>>> > > > > >
>>>> > > > > >> When we tried to delete the data through a query - say 1
>>>> > day/month's
>>>> > > > > worth
>>>> > > > > >> of data. But after deleting just 1 month's worth of data, the
>>>> > master
>>>> > > > > node
>>>> > > > > >> is going out of memory - heap space.
>>>> > > > > >>
>>>> > > > > >> Wondering is there any way to incrementally delete the data
>>>> > without
>>>> > > > > >> affecting the cluster adversely.
>>>> > > > > >>
>>>> > > > > >
>>>> > > > > > I'm curious about the actual query being used here.  Can you
>>>> share
>>>> > > it,
>>>> > > > or
>>>> > > > > > a redacted version of it?  Perhaps there might be a clue there?
>>>> > > > > >
>>>> > > > > > Is this a fully distributed delete request?  One thing you
>>>> might
>>>> > try,
>>>> > > > > > assuming Solr even supports it, is sending the same delete
>>>> request
>>>> > > > > directly
>>>> > > > > > to each shard core with distrib=false.
>>>> > > > > >
>>>> > > > > > Here's a very incomplete list about how you can reduce Solr
>>>> heap
>>>> > > > > > requirements:
>>>> > > > > >
>>>> > > > > > http://wiki.apache.org/solr/SolrPerformanceProblems#
>>>> > > > > > Reducing_heap_requirements
>>>> > > > > >
>>>> > > > > > Thanks,
>>>> > > > > > Shawn
>>>> > > > > >
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > > --
>>>> > > > With Regards
>>>> > > > Aman Tandon
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>

Re: deleting large amount data from solr cloud

Reply via email to