Re: deleting large amount data from solr cloud

Erick Erickson Thu, 17 Apr 2014 08:36:27 -0700

bq: Will it get split at any point later?

"Split" is a little ambiguous here. Will it be copied into two or more
segments? No. Will it disappear? Possibly. Eventually this segment
will be merged if you add enough documents to the system. Consider
this scenario:
you add 1M docs to your system and it results in 10 segments (numbers
made up). Then you optimize, and you have 1M docs in 1 segment. Fine
so far.


Now you add 750K of those docs over again, which will delete them from
the 1 big segment. Your merge policy will, at some point, select this
segment to merge and it'll disappear...

FWIW,
er...@pedantic.com

On Thu, Apr 17, 2014 at 7:24 AM, Vinay Pothnis <poth...@gmail.com> wrote:
> Thanks a lot Shalin!
>
>
> On 16 April 2014 21:26, Shalin Shekhar Mangar <shalinman...@gmail.com>wrote:
>
>> You can specify maxSegments parameter e.g. maxSegments=5 while optimizing.
>>
>>
>> On Thu, Apr 17, 2014 at 6:46 AM, Vinay Pothnis <poth...@gmail.com> wrote:
>>
>> > Hello,
>> >
>> > Couple of follow up questions:
>> >
>> > * When the optimize command is run, looks like it creates one big segment
>> > (forceMerge = 1). Will it get split at any point later? Or will that big
>> > segment remain?
>> >
>> > * Is there anyway to maintain the number of segments - but still merge to
>> > reclaim the deleted documents space? In other words, can I issue
>> > "forceMerge=20"? If so, how would the command look like? Any examples for
>> > this?
>> >
>> > Thanks
>> > Vinay
>> >
>> >
>> >
>> > On 16 April 2014 07:59, Vinay Pothnis <poth...@gmail.com> wrote:
>> >
>> > > Thank you Erick!
>> > > Yes - I am using the expunge deletes option.
>> > >
>> > > Thanks for the note on disk space for the optimize command. I should
>> have
>> > > enough space for that. What about the heap space requirement? I hope it
>> > can
>> > > do the optimize with the memory that is allocated to it.
>> > >
>> > > Thanks
>> > > Vinay
>> > >
>> > >
>> > > On 16 April 2014 04:52, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> > >
>> > >> The optimize should, indeed, reduce the index size. Be aware that it
>> > >> may consume 2x the disk space. You may also try expungedeletes, see
>> > >> here: https://wiki.apache.org/solr/UpdateXmlMessages
>> > >>
>> > >> Best,
>> > >> Erick
>> > >>
>> > >> On Wed, Apr 16, 2014 at 12:47 AM, Vinay Pothnis <poth...@gmail.com>
>> > >> wrote:
>> > >> > Another update:
>> > >> >
>> > >> > I removed the replicas - to avoid the replication doing a full
>> copy. I
>> > >> am
>> > >> > able delete sizeable chunks of data.
>> > >> > But the overall index size remains the same even after the deletes.
>> It
>> > >> does
>> > >> > not seem to go down.
>> > >> >
>> > >> > I understand that Solr would do this in background - but I don't
>> seem
>> > to
>> > >> > see the decrease in overall index size even after 1-2 hours.
>> > >> > I can see a bunch of ".del" files in the index directory, but the it
>> > >> does
>> > >> > not seem to get cleaned up. Is there anyway to monitor/follow the
>> > >> progress
>> > >> > of index compaction?
>> > >> >
>> > >> > Also, does triggering "optimize" from the admin UI help to compact
>> the
>> > >> > index size on disk?
>> > >> >
>> > >> > Thanks
>> > >> > Vinay
>> > >> >
>> > >> >
>> > >> > On 14 April 2014 12:19, Vinay Pothnis <poth...@gmail.com> wrote:
>> > >> >
>> > >> >> Some update:
>> > >> >>
>> > >> >> I removed the auto warm configurations for the various caches and
>> > >> reduced
>> > >> >> the cache sizes. I then issued a call to delete a day's worth of
>> data
>> > >> (800K
>> > >> >> documents).
>> > >> >>
>> > >> >> There was no out of memory this time - but some of the nodes went
>> > into
>> > >> >> recovery mode. Was able to catch some logs this time around and
>> this
>> > is
>> > >> >> what i see:
>> > >> >>
>> > >> >> ****************
>> > >> >> *WARN  [2014-04-14 18:11:00.381] [org.apache.solr.update.PeerSync]
>> > >> >> PeerSync: core=core1_shard1_replica2 url=http://host1:8983/solr
>> > >> >> <http://host1:8983/solr> too many updates received since start -
>> > >> >> startingUpdates no longer overlaps with our currentUpdates*
>> > >> >> *INFO  [2014-04-14 18:11:00.476]
>> > >> [org.apache.solr.cloud.RecoveryStrategy]
>> > >> >> PeerSync Recovery was not successful - trying replication.
>> > >> >> core=core1_shard1_replica2*
>> > >> >> *INFO  [2014-04-14 18:11:00.476]
>> > >> [org.apache.solr.cloud.RecoveryStrategy]
>> > >> >> Starting Replication Recovery. core=core1_shard1_replica2*
>> > >> >> *INFO  [2014-04-14 18:11:00.535]
>> > >> [org.apache.solr.cloud.RecoveryStrategy]
>> > >> >> Begin buffering updates. core=core1_shard1_replica2*
>> > >> >> *INFO  [2014-04-14 18:11:00.536]
>> > >> [org.apache.solr.cloud.RecoveryStrategy]
>> > >> >> Attempting to replicate from
>> > >> http://host2:8983/solr/core1_shard1_replica1/
>> > >> >> <http://host2:8983/solr/core1_shard1_replica1/>.
>> > >> core=core1_shard1_replica2*
>> > >> >> *INFO  [2014-04-14 18:11:00.536]
>> > >> >> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new
>> http
>> > >> >> client,
>> > >> >>
>> > >>
>> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false*
>> > >> >> *INFO  [2014-04-14 18:11:01.964]
>> > >> >> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new
>> http
>> > >> >> client,
>> > >> >>
>> > >>
>> >
>> config:connTimeout=5000&socketTimeout=20000&allowCompression=false&maxConnections=10000&maxConnectionsPerHost=10000*
>> > >> >> *INFO  [2014-04-14 18:11:01.969]
>> [org.apache.solr.handler.SnapPuller]
>> > >>  No
>> > >> >> value set for 'pollInterval'. Timer Task not started.*
>> > >> >> *INFO  [2014-04-14 18:11:01.973]
>> [org.apache.solr.handler.SnapPuller]
>> > >> >> Master's generation: 1108645*
>> > >> >> *INFO  [2014-04-14 18:11:01.973]
>> [org.apache.solr.handler.SnapPuller]
>> > >> >> Slave's generation: 1108627*
>> > >> >> *INFO  [2014-04-14 18:11:01.973]
>> [org.apache.solr.handler.SnapPuller]
>> > >> >> Starting replication process*
>> > >> >> *INFO  [2014-04-14 18:11:02.007]
>> [org.apache.solr.handler.SnapPuller]
>> > >> >> Number of files in latest index in master: 814*
>> > >> >> *INFO  [2014-04-14 18:11:02.007]
>> > >> >> [org.apache.solr.core.CachingDirectoryFactory] return new directory
>> > for
>> > >> >> /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007*
>> > >> >> *INFO  [2014-04-14 18:11:02.008]
>> [org.apache.solr.handler.SnapPuller]
>> > >> >> Starting download to
>> > >> >> NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
>> > >> /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007
>> > >> >> lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f6570fe;
>> > >> >> maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true*
>> > >> >>
>> > >> >> ****************
>> > >> >>
>> > >> >>
>> > >> >> So, it looks like the number of updates is too huge for the regular
>> > >> >> replication and then it goes into full copy of index. And since our
>> > >> index
>> > >> >> size is very huge (350G), this is causing the cluster to go into
>> > >> recovery
>> > >> >> mode forever - trying to copy that huge index.
>> > >> >>
>> > >> >> I also read in some thread
>> > >> >>
>> > >>
>> >
>> http://lucene.472066.n3.nabble.com/Recovery-too-many-updates-received-since-start-td3935281.htmlthatthereisa
>>  limit of 100 documents.
>> > >> >>
>> > >> >> I wonder if this has been updated to make that configurable since
>> > that
>> > >> >> thread. If not, the only option I see is to do a "trickle" delete
>> of
>> > >> 100
>> > >> >> documents per second or something.
>> > >> >>
>> > >> >> Also - the other suggestion of using "distributed=false" might not
>> > help
>> > >> >> because the issue currently is that the replication is going to
>> "full
>> > >> copy".
>> > >> >>
>> > >> >> Any thoughts?
>> > >> >>
>> > >> >> Thanks
>> > >> >> Vinay
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >> On 14 April 2014 07:54, Vinay Pothnis <poth...@gmail.com> wrote:
>> > >> >>
>> > >> >>> Yes, that is our approach. We did try deleting a day's worth of
>> data
>> > >> at a
>> > >> >>> time, and that resulted in OOM as well.
>> > >> >>>
>> > >> >>> Thanks
>> > >> >>> Vinay
>> > >> >>>
>> > >> >>>
>> > >> >>> On 14 April 2014 00:27, Furkan KAMACI <furkankam...@gmail.com>
>> > wrote:
>> > >> >>>
>> > >> >>>> Hi;
>> > >> >>>>
>> > >> >>>> I mean you can divide the range (i.e. one week at each delete
>> > >> instead of
>> > >> >>>> one month) and try to check whether you still get an OOM or not.
>> > >> >>>>
>> > >> >>>> Thanks;
>> > >> >>>> Furkan KAMACI
>> > >> >>>>
>> > >> >>>>
>> > >> >>>> 2014-04-14 7:09 GMT+03:00 Vinay Pothnis <poth...@gmail.com>:
>> > >> >>>>
>> > >> >>>> > Aman,
>> > >> >>>> > Yes - Will do!
>> > >> >>>> >
>> > >> >>>> > Furkan,
>> > >> >>>> > How do you mean by 'bulk delete'?
>> > >> >>>> >
>> > >> >>>> > -Thanks
>> > >> >>>> > Vinay
>> > >> >>>> >
>> > >> >>>> >
>> > >> >>>> > On 12 April 2014 14:49, Furkan KAMACI <furkankam...@gmail.com>
>> > >> wrote:
>> > >> >>>> >
>> > >> >>>> > > Hi;
>> > >> >>>> > >
>> > >> >>>> > > Do you get any problems when you index your data? On the
>> other
>> > >> hand
>> > >> >>>> > > deleting as bulks and reducing the size of documents may help
>> > you
>> > >> >>>> not to
>> > >> >>>> > > hit OOM.
>> > >> >>>> > >
>> > >> >>>> > > Thanks;
>> > >> >>>> > > Furkan KAMACI
>> > >> >>>> > >
>> > >> >>>> > >
>> > >> >>>> > > 2014-04-12 8:22 GMT+03:00 Aman Tandon <
>> amantandon...@gmail.com
>> > >:
>> > >> >>>> > >
>> > >> >>>> > > > Vinay please share your experience after trying this
>> > solution.
>> > >> >>>> > > >
>> > >> >>>> > > >
>> > >> >>>> > > > On Sat, Apr 12, 2014 at 4:12 AM, Vinay Pothnis <
>> > >> poth...@gmail.com>
>> > >> >>>> > > wrote:
>> > >> >>>> > > >
>> > >> >>>> > > > > The query is something like this:
>> > >> >>>> > > > >
>> > >> >>>> > > > >
>> > >> >>>> > > > > *curl -H 'Content-Type: text/xml' --data
>> > >> >>>> '<delete><query>param1:(val1
>> > >> >>>> > > OR
>> > >> >>>> > > > > val2) AND -param2:(val3 OR val4) AND
>> > >> date_param:[1383955200000 TO
>> > >> >>>> > > > > 1385164800000]</query></delete>'
>> > >> >>>> > > > > 'http://host:port/solr/coll-name1/update?commit=true'*
>> > >> >>>> > > > >
>> > >> >>>> > > > > Trying to restrict the number of documents deleted via
>> the
>> > >> date
>> > >> >>>> > > > parameter.
>> > >> >>>> > > > >
>> > >> >>>> > > > > Had not tried the "distrib=false" option. I could give
>> > that a
>> > >> >>>> try.
>> > >> >>>> > > Thanks
>> > >> >>>> > > > > for the link! I will check on the cache sizes and
>> autowarm
>> > >> >>>> values.
>> > >> >>>> > Will
>> > >> >>>> > > > try
>> > >> >>>> > > > > and disable the caches when I am deleting and give that a
>> > >> try.
>> > >> >>>> > > > >
>> > >> >>>> > > > > Thanks Erick and Shawn for your inputs!
>> > >> >>>> > > > >
>> > >> >>>> > > > > -Vinay
>> > >> >>>> > > > >
>> > >> >>>> > > > >
>> > >> >>>> > > > >
>> > >> >>>> > > > > On 11 April 2014 15:28, Shawn Heisey <s...@elyograg.org>
>> > >> wrote:
>> > >> >>>> > > > >
>> > >> >>>> > > > > > On 4/10/2014 7:25 PM, Vinay Pothnis wrote:
>> > >> >>>> > > > > >
>> > >> >>>> > > > > >> When we tried to delete the data through a query -
>> say 1
>> > >> >>>> > day/month's
>> > >> >>>> > > > > worth
>> > >> >>>> > > > > >> of data. But after deleting just 1 month's worth of
>> > data,
>> > >> the
>> > >> >>>> > master
>> > >> >>>> > > > > node
>> > >> >>>> > > > > >> is going out of memory - heap space.
>> > >> >>>> > > > > >>
>> > >> >>>> > > > > >> Wondering is there any way to incrementally delete the
>> > >> data
>> > >> >>>> > without
>> > >> >>>> > > > > >> affecting the cluster adversely.
>> > >> >>>> > > > > >>
>> > >> >>>> > > > > >
>> > >> >>>> > > > > > I'm curious about the actual query being used here.
>>  Can
>> > >> you
>> > >> >>>> share
>> > >> >>>> > > it,
>> > >> >>>> > > > or
>> > >> >>>> > > > > > a redacted version of it?  Perhaps there might be a
>> clue
>> > >> there?
>> > >> >>>> > > > > >
>> > >> >>>> > > > > > Is this a fully distributed delete request?  One thing
>> > you
>> > >> >>>> might
>> > >> >>>> > try,
>> > >> >>>> > > > > > assuming Solr even supports it, is sending the same
>> > delete
>> > >> >>>> request
>> > >> >>>> > > > > directly
>> > >> >>>> > > > > > to each shard core with distrib=false.
>> > >> >>>> > > > > >
>> > >> >>>> > > > > > Here's a very incomplete list about how you can reduce
>> > Solr
>> > >> >>>> heap
>> > >> >>>> > > > > > requirements:
>> > >> >>>> > > > > >
>> > >> >>>> > > > > > http://wiki.apache.org/solr/SolrPerformanceProblems#
>> > >> >>>> > > > > > Reducing_heap_requirements
>> > >> >>>> > > > > >
>> > >> >>>> > > > > > Thanks,
>> > >> >>>> > > > > > Shawn
>> > >> >>>> > > > > >
>> > >> >>>> > > > > >
>> > >> >>>> > > > >
>> > >> >>>> > > >
>> > >> >>>> > > >
>> > >> >>>> > > >
>> > >> >>>> > > > --
>> > >> >>>> > > > With Regards
>> > >> >>>> > > > Aman Tandon
>> > >> >>>> > > >
>> > >> >>>> > >
>> > >> >>>> >
>> > >> >>>>
>> > >> >>>
>> > >> >>>
>> > >> >>
>> > >>
>> > >
>> > >
>> >
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>

Re: deleting large amount data from solr cloud

Reply via email to