Hi Alexander

Thanks for the heads up on the script, though it seems to me like it works 
totally fine at least on Quincy, but Ceph Pacifics way of handling changes to 
the crushmap, and remapping is not compatible.

________________________________
From: Alexander Patrakov <[email protected]>
Sent: Friday, January 17, 2025 17:51
To: Anthony D'Atri <[email protected]>
Cc: Kasper Rasmussen <[email protected]>; [email protected] 
<[email protected]>
Subject: Re: [ceph-users] Re: Adding Rack to crushmap - Rebalancing multiple PB 
of data - advice/experience

Hello Kasper,

Please be aware that the current "upmap-remapped" script is flaky. It
might just refuse to work, with this message:

Error loading remapped pgs

This has been traced to the fact that "ceph pg ls remapped -f json"
sets its stderr to non-blocking mode, and that is the same file
descriptor to which jq (which follows in the pipeline) writes. Thus,
jq can get -EAGAIN and terminate prematurely.

The problem is tracked as https://tracker.ceph.com/issues/67505

Retrying the script might help.

What's worse is that the whole reason for adding jq to the
upmap-remapped script is another Ceph bug: it sometimes outputs
invalid JSON (containing a literal inf or nan instead of a number),
and this became much more common with Reef, as new fields were added
that are commonly equal to inf or nan. This is tracked as
https://tracker.ceph.com/issues/66215 and has a fix merged in a
not-yet-released version.

Maybe you should look into alternative tools, like
https://github.com/digitalocean/pgremapper


On Fri, Jan 17, 2025 at 11:43 PM Anthony D'Atri <[email protected]> wrote:
>
>
>
> > On Jan 17, 2025, at 6:02 AM, Kasper Rasmussen 
> > <[email protected]> wrote:
> >
> > However I'm concerned with the amount of data that needs to be rebalanced, 
> > since the cluster holds multiple PB, and I'm looking for review of/input 
> > for my plan, as well as words of advice/experience from someone who has 
> > been in a similar situation.
>
> Yep, that’s why you want to use upmap-remapped.  Otherwise the thundering 
> herd of data shuffling will DoS your client traffic, esp. since you’re using 
> spinners.  Count on pretty much all data moving in the process, and the 
> convergence taking …. maybe a week?
>
> > On Pacific: Data is marked as "degraded", and not misplaced as expected. I 
> > also see above 2000% degraded data (but that might be another issue)
> >
> > On Quincy: Data is marked as misplaced - which seems correct.
>
>
> I’m not specifically familiar with such a change, but that could be mainly 
> cosmetic, a function of how the percentage is calculated for objects / PGs 
> that are multiply remapped.
>
> In the depths of time I had clusters that would sometimes show a negative 
> number of RADOS objects to recover, it would bounce above and below zero a 
> few times as it converged to 0.
>
>
> > Instead balancing has been done by a cron job executing - ceph osd 
> > reweight-by-utilization 112 0.05 30
>
> I used a similar strategy with older releases.  Note that this will 
> complicate your transition, as those relative weights are a function of the 
> CRUSH topology, so when the topology changes, likely some reweighted OSDs 
> will get much less than their fair share, and some will get much more.  How 
> full is your cluster (ceph df)?  It might not be a bad idea to incrementally 
> revert those all to 1.00000 if you have the capacity, and disable the cron 
> job.
> You’ll also likely want to switch to the balancer module for the 
> upmap-remapped strategy to incrementally move your data around.  Did you have 
> it disabled for a specific reason?
>
> Updating to Reef before migrating might be to your advantage so that you can 
> benefit from performance and efficiency improvements since Pacific.
>
>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]



--
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to