Re: deleting large amount data from solr cloud

Vinay Pothnis Thu, 17 Apr 2014 09:24:40 -0700

Thanks Erick!


On 17 April 2014 08:35, Erick Erickson <[email protected]> wrote:

> bq: Will it get split at any point later?
>
> "Split" is a little ambiguous here. Will it be copied into two or more
> segments? No. Will it disappear? Possibly. Eventually this segment
> will be merged if you add enough documents to the system. Consider
> this scenario:
> you add 1M docs to your system and it results in 10 segments (numbers
> made up). Then you optimize, and you have 1M docs in 1 segment. Fine
> so far.
>
> Now you add 750K of those docs over again, which will delete them from
> the 1 big segment. Your merge policy will, at some point, select this
> segment to merge and it'll disappear...
>
> FWIW,
> [email protected]
>
> On Thu, Apr 17, 2014 at 7:24 AM, Vinay Pothnis <[email protected]> wrote:
> > Thanks a lot Shalin!
> >
> >
> > On 16 April 2014 21:26, Shalin Shekhar Mangar <[email protected]
> >wrote:
> >
> >> You can specify maxSegments parameter e.g. maxSegments=5 while
> optimizing.
> >>
> >>
> >> On Thu, Apr 17, 2014 at 6:46 AM, Vinay Pothnis <[email protected]>
> wrote:
> >>
> >> > Hello,
> >> >
> >> > Couple of follow up questions:
> >> >
> >> > * When the optimize command is run, looks like it creates one big
> segment
> >> > (forceMerge = 1). Will it get split at any point later? Or will that
> big
> >> > segment remain?
> >> >
> >> > * Is there anyway to maintain the number of segments - but still
> merge to
> >> > reclaim the deleted documents space? In other words, can I issue
> >> > "forceMerge=20"? If so, how would the command look like? Any examples
> for
> >> > this?
> >> >
> >> > Thanks
> >> > Vinay
> >> >
> >> >
> >> >
> >> > On 16 April 2014 07:59, Vinay Pothnis <[email protected]> wrote:
> >> >
> >> > > Thank you Erick!
> >> > > Yes - I am using the expunge deletes option.
> >> > >
> >> > > Thanks for the note on disk space for the optimize command. I should
> >> have
> >> > > enough space for that. What about the heap space requirement? I
> hope it
> >> > can
> >> > > do the optimize with the memory that is allocated to it.
> >> > >
> >> > > Thanks
> >> > > Vinay
> >> > >
> >> > >
> >> > > On 16 April 2014 04:52, Erick Erickson <[email protected]>
> >> wrote:
> >> > >
> >> > >> The optimize should, indeed, reduce the index size. Be aware that
> it
> >> > >> may consume 2x the disk space. You may also try expungedeletes, see
> >> > >> here: https://wiki.apache.org/solr/UpdateXmlMessages
> >> > >>
> >> > >> Best,
> >> > >> Erick
> >> > >>
> >> > >> On Wed, Apr 16, 2014 at 12:47 AM, Vinay Pothnis <[email protected]
> >
> >> > >> wrote:
> >> > >> > Another update:
> >> > >> >
> >> > >> > I removed the replicas - to avoid the replication doing a full
> >> copy. I
> >> > >> am
> >> > >> > able delete sizeable chunks of data.
> >> > >> > But the overall index size remains the same even after the
> deletes.
> >> It
> >> > >> does
> >> > >> > not seem to go down.
> >> > >> >
> >> > >> > I understand that Solr would do this in background - but I don't
> >> seem
> >> > to
> >> > >> > see the decrease in overall index size even after 1-2 hours.
> >> > >> > I can see a bunch of ".del" files in the index directory, but
> the it
> >> > >> does
> >> > >> > not seem to get cleaned up. Is there anyway to monitor/follow the
> >> > >> progress
> >> > >> > of index compaction?
> >> > >> >
> >> > >> > Also, does triggering "optimize" from the admin UI help to
> compact
> >> the
> >> > >> > index size on disk?
> >> > >> >
> >> > >> > Thanks
> >> > >> > Vinay
> >> > >> >
> >> > >> >
> >> > >> > On 14 April 2014 12:19, Vinay Pothnis <[email protected]> wrote:
> >> > >> >
> >> > >> >> Some update:
> >> > >> >>
> >> > >> >> I removed the auto warm configurations for the various caches
> and
> >> > >> reduced
> >> > >> >> the cache sizes. I then issued a call to delete a day's worth of
> >> data
> >> > >> (800K
> >> > >> >> documents).
> >> > >> >>
> >> > >> >> There was no out of memory this time - but some of the nodes
> went
> >> > into
> >> > >> >> recovery mode. Was able to catch some logs this time around and
> >> this
> >> > is
> >> > >> >> what i see:
> >> > >> >>
> >> > >> >> ****************
> >> > >> >> *WARN  [2014-04-14 18:11:00.381]
> [org.apache.solr.update.PeerSync]
> >> > >> >> PeerSync: core=core1_shard1_replica2 url=http://host1:8983/solr
> >> > >> >> <http://host1:8983/solr> too many updates received since start
> -
> >> > >> >> startingUpdates no longer overlaps with our currentUpdates*
> >> > >> >> *INFO  [2014-04-14 18:11:00.476]
> >> > >> [org.apache.solr.cloud.RecoveryStrategy]
> >> > >> >> PeerSync Recovery was not successful - trying replication.
> >> > >> >> core=core1_shard1_replica2*
> >> > >> >> *INFO  [2014-04-14 18:11:00.476]
> >> > >> [org.apache.solr.cloud.RecoveryStrategy]
> >> > >> >> Starting Replication Recovery. core=core1_shard1_replica2*
> >> > >> >> *INFO  [2014-04-14 18:11:00.535]
> >> > >> [org.apache.solr.cloud.RecoveryStrategy]
> >> > >> >> Begin buffering updates. core=core1_shard1_replica2*
> >> > >> >> *INFO  [2014-04-14 18:11:00.536]
> >> > >> [org.apache.solr.cloud.RecoveryStrategy]
> >> > >> >> Attempting to replicate from
> >> > >> http://host2:8983/solr/core1_shard1_replica1/
> >> > >> >> <http://host2:8983/solr/core1_shard1_replica1/>.
> >> > >> core=core1_shard1_replica2*
> >> > >> >> *INFO  [2014-04-14 18:11:00.536]
> >> > >> >> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new
> >> http
> >> > >> >> client,
> >> > >> >>
> >> > >>
> >> >
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false*
> >> > >> >> *INFO  [2014-04-14 18:11:01.964]
> >> > >> >> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new
> >> http
> >> > >> >> client,
> >> > >> >>
> >> > >>
> >> >
> >>
> config:connTimeout=5000&socketTimeout=20000&allowCompression=false&maxConnections=10000&maxConnectionsPerHost=10000*
> >> > >> >> *INFO  [2014-04-14 18:11:01.969]
> >> [org.apache.solr.handler.SnapPuller]
> >> > >>  No
> >> > >> >> value set for 'pollInterval'. Timer Task not started.*
> >> > >> >> *INFO  [2014-04-14 18:11:01.973]
> >> [org.apache.solr.handler.SnapPuller]
> >> > >> >> Master's generation: 1108645*
> >> > >> >> *INFO  [2014-04-14 18:11:01.973]
> >> [org.apache.solr.handler.SnapPuller]
> >> > >> >> Slave's generation: 1108627*
> >> > >> >> *INFO  [2014-04-14 18:11:01.973]
> >> [org.apache.solr.handler.SnapPuller]
> >> > >> >> Starting replication process*
> >> > >> >> *INFO  [2014-04-14 18:11:02.007]
> >> [org.apache.solr.handler.SnapPuller]
> >> > >> >> Number of files in latest index in master: 814*
> >> > >> >> *INFO  [2014-04-14 18:11:02.007]
> >> > >> >> [org.apache.solr.core.CachingDirectoryFactory] return new
> directory
> >> > for
> >> > >> >>
> /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007*
> >> > >> >> *INFO  [2014-04-14 18:11:02.008]
> >> [org.apache.solr.handler.SnapPuller]
> >> > >> >> Starting download to
> >> > >> >> NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
> >> > >> /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007
> >> > >> >> lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f6570fe
> ;
> >> > >> >> maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true*
> >> > >> >>
> >> > >> >> ****************
> >> > >> >>
> >> > >> >>
> >> > >> >> So, it looks like the number of updates is too huge for the
> regular
> >> > >> >> replication and then it goes into full copy of index. And since
> our
> >> > >> index
> >> > >> >> size is very huge (350G), this is causing the cluster to go into
> >> > >> recovery
> >> > >> >> mode forever - trying to copy that huge index.
> >> > >> >>
> >> > >> >> I also read in some thread
> >> > >> >>
> >> > >>
> >> >
> >>
> http://lucene.472066.n3.nabble.com/Recovery-too-many-updates-received-since-start-td3935281.htmlthatthereisalimit
>  of 100 documents.
> >> > >> >>
> >> > >> >> I wonder if this has been updated to make that configurable
> since
> >> > that
> >> > >> >> thread. If not, the only option I see is to do a "trickle"
> delete
> >> of
> >> > >> 100
> >> > >> >> documents per second or something.
> >> > >> >>
> >> > >> >> Also - the other suggestion of using "distributed=false" might
> not
> >> > help
> >> > >> >> because the issue currently is that the replication is going to
> >> "full
> >> > >> copy".
> >> > >> >>
> >> > >> >> Any thoughts?
> >> > >> >>
> >> > >> >> Thanks
> >> > >> >> Vinay
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >> On 14 April 2014 07:54, Vinay Pothnis <[email protected]>
> wrote:
> >> > >> >>
> >> > >> >>> Yes, that is our approach. We did try deleting a day's worth of
> >> data
> >> > >> at a
> >> > >> >>> time, and that resulted in OOM as well.
> >> > >> >>>
> >> > >> >>> Thanks
> >> > >> >>> Vinay
> >> > >> >>>
> >> > >> >>>
> >> > >> >>> On 14 April 2014 00:27, Furkan KAMACI <[email protected]>
> >> > wrote:
> >> > >> >>>
> >> > >> >>>> Hi;
> >> > >> >>>>
> >> > >> >>>> I mean you can divide the range (i.e. one week at each delete
> >> > >> instead of
> >> > >> >>>> one month) and try to check whether you still get an OOM or
> not.
> >> > >> >>>>
> >> > >> >>>> Thanks;
> >> > >> >>>> Furkan KAMACI
> >> > >> >>>>
> >> > >> >>>>
> >> > >> >>>> 2014-04-14 7:09 GMT+03:00 Vinay Pothnis <[email protected]>:
> >> > >> >>>>
> >> > >> >>>> > Aman,
> >> > >> >>>> > Yes - Will do!
> >> > >> >>>> >
> >> > >> >>>> > Furkan,
> >> > >> >>>> > How do you mean by 'bulk delete'?
> >> > >> >>>> >
> >> > >> >>>> > -Thanks
> >> > >> >>>> > Vinay
> >> > >> >>>> >
> >> > >> >>>> >
> >> > >> >>>> > On 12 April 2014 14:49, Furkan KAMACI <
> [email protected]>
> >> > >> wrote:
> >> > >> >>>> >
> >> > >> >>>> > > Hi;
> >> > >> >>>> > >
> >> > >> >>>> > > Do you get any problems when you index your data? On the
> >> other
> >> > >> hand
> >> > >> >>>> > > deleting as bulks and reducing the size of documents may
> help
> >> > you
> >> > >> >>>> not to
> >> > >> >>>> > > hit OOM.
> >> > >> >>>> > >
> >> > >> >>>> > > Thanks;
> >> > >> >>>> > > Furkan KAMACI
> >> > >> >>>> > >
> >> > >> >>>> > >
> >> > >> >>>> > > 2014-04-12 8:22 GMT+03:00 Aman Tandon <
> >> [email protected]
> >> > >:
> >> > >> >>>> > >
> >> > >> >>>> > > > Vinay please share your experience after trying this
> >> > solution.
> >> > >> >>>> > > >
> >> > >> >>>> > > >
> >> > >> >>>> > > > On Sat, Apr 12, 2014 at 4:12 AM, Vinay Pothnis <
> >> > >> [email protected]>
> >> > >> >>>> > > wrote:
> >> > >> >>>> > > >
> >> > >> >>>> > > > > The query is something like this:
> >> > >> >>>> > > > >
> >> > >> >>>> > > > >
> >> > >> >>>> > > > > *curl -H 'Content-Type: text/xml' --data
> >> > >> >>>> '<delete><query>param1:(val1
> >> > >> >>>> > > OR
> >> > >> >>>> > > > > val2) AND -param2:(val3 OR val4) AND
> >> > >> date_param:[1383955200000 TO
> >> > >> >>>> > > > > 1385164800000]</query></delete>'
> >> > >> >>>> > > > > 'http://host:port
> /solr/coll-name1/update?commit=true'*
> >> > >> >>>> > > > >
> >> > >> >>>> > > > > Trying to restrict the number of documents deleted via
> >> the
> >> > >> date
> >> > >> >>>> > > > parameter.
> >> > >> >>>> > > > >
> >> > >> >>>> > > > > Had not tried the "distrib=false" option. I could give
> >> > that a
> >> > >> >>>> try.
> >> > >> >>>> > > Thanks
> >> > >> >>>> > > > > for the link! I will check on the cache sizes and
> >> autowarm
> >> > >> >>>> values.
> >> > >> >>>> > Will
> >> > >> >>>> > > > try
> >> > >> >>>> > > > > and disable the caches when I am deleting and give
> that a
> >> > >> try.
> >> > >> >>>> > > > >
> >> > >> >>>> > > > > Thanks Erick and Shawn for your inputs!
> >> > >> >>>> > > > >
> >> > >> >>>> > > > > -Vinay
> >> > >> >>>> > > > >
> >> > >> >>>> > > > >
> >> > >> >>>> > > > >
> >> > >> >>>> > > > > On 11 April 2014 15:28, Shawn Heisey <
> [email protected]>
> >> > >> wrote:
> >> > >> >>>> > > > >
> >> > >> >>>> > > > > > On 4/10/2014 7:25 PM, Vinay Pothnis wrote:
> >> > >> >>>> > > > > >
> >> > >> >>>> > > > > >> When we tried to delete the data through a query -
> >> say 1
> >> > >> >>>> > day/month's
> >> > >> >>>> > > > > worth
> >> > >> >>>> > > > > >> of data. But after deleting just 1 month's worth of
> >> > data,
> >> > >> the
> >> > >> >>>> > master
> >> > >> >>>> > > > > node
> >> > >> >>>> > > > > >> is going out of memory - heap space.
> >> > >> >>>> > > > > >>
> >> > >> >>>> > > > > >> Wondering is there any way to incrementally delete
> the
> >> > >> data
> >> > >> >>>> > without
> >> > >> >>>> > > > > >> affecting the cluster adversely.
> >> > >> >>>> > > > > >>
> >> > >> >>>> > > > > >
> >> > >> >>>> > > > > > I'm curious about the actual query being used here.
> >>  Can
> >> > >> you
> >> > >> >>>> share
> >> > >> >>>> > > it,
> >> > >> >>>> > > > or
> >> > >> >>>> > > > > > a redacted version of it?  Perhaps there might be a
> >> clue
> >> > >> there?
> >> > >> >>>> > > > > >
> >> > >> >>>> > > > > > Is this a fully distributed delete request?  One
> thing
> >> > you
> >> > >> >>>> might
> >> > >> >>>> > try,
> >> > >> >>>> > > > > > assuming Solr even supports it, is sending the same
> >> > delete
> >> > >> >>>> request
> >> > >> >>>> > > > > directly
> >> > >> >>>> > > > > > to each shard core with distrib=false.
> >> > >> >>>> > > > > >
> >> > >> >>>> > > > > > Here's a very incomplete list about how you can
> reduce
> >> > Solr
> >> > >> >>>> heap
> >> > >> >>>> > > > > > requirements:
> >> > >> >>>> > > > > >
> >> > >> >>>> > > > > >
> http://wiki.apache.org/solr/SolrPerformanceProblems#
> >> > >> >>>> > > > > > Reducing_heap_requirements
> >> > >> >>>> > > > > >
> >> > >> >>>> > > > > > Thanks,
> >> > >> >>>> > > > > > Shawn
> >> > >> >>>> > > > > >
> >> > >> >>>> > > > > >
> >> > >> >>>> > > > >
> >> > >> >>>> > > >
> >> > >> >>>> > > >
> >> > >> >>>> > > >
> >> > >> >>>> > > > --
> >> > >> >>>> > > > With Regards
> >> > >> >>>> > > > Aman Tandon
> >> > >> >>>> > > >
> >> > >> >>>> > >
> >> > >> >>>> >
> >> > >> >>>>
> >> > >> >>>
> >> > >> >>>
> >> > >> >>
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Shalin Shekhar Mangar.
> >>
>

Re: deleting large amount data from solr cloud

Reply via email to