> On May 21, 2025, at 6:04 AM, Anthony Fecarotta <[email protected]> wrote:
> 
> Thanks, Anthony.
> 
> Each R740 as follows:
> 
> 1 BOSS Boot Drive (Proxmox)
> 1 Samsung PM1725a NVMe SSD (used as a DB Disk for each OSD)

Older mixed-use SKU.  Mixed-use is overkill in most cases.  How large is this, 
i.e. how big a slice does each OSD get?  If you have 24x SAS SSDs using this 
single PCIe Gen 3 SSD for offload, I might rethink that strategy.

With NVMe devices offloading WAL+DB from spinners, conventional wisdom is that 
the ratio be no more than 10:1.
Offloading from SAS or SATA SSDs is done much less frequently, and to achieve 
benefit would need a much smaller ratio, like maybe 4-5:1
Also note that if you have 24 OSDs offloading to a single NVMe, when that 
device fails, you’ll have to recover an entire host’s worth of data.  Do you 
have that much headroom?

Honestly I would just leave the BlueStore journals colocated on the SAS SSDs 
and use the NVMe drives for something else.  Like RGW index or CephFS metadata 
pools, or a faster but smaller RBD pool.

If you haven’t bought the gear yet, buy all-NVMe chassis and drives instead.  
Also, do you have at least … 96 threads / vcores per server?

> 24 OSDs — SAS SSD 12/GBs
> 1 PERC H730P RAID card (in HBA mode)
> 25GB/s two port NICs (NDC, not PCIe)

NDC? Oh a mezzanine / OCP card.  Still PCIe, just not the AIC form factor.  
Conserves the limited number of AIC slots in a server, reduces the need for 
risers.

> I was considering replacing the PERC H740P RAID card with the HBA330 Mini 
> Monolithic; but only if it would improve performance.

I doubt that it would.

> I have a Cisco Nexus9000 switches with capability to go to 40Gbe or 100Gbe.

Remember that 40GE is under the hood 4x10GE lanes, and 100GE is 4x25GE.  The 
latter is newer tech and I’d solve for future proofing and the lower latency.

> I was considering adding 40GBe NICs to each node in the cluster, to be used 
> as a Private network for Ceph.

So today you only have the public network?  What is your saturation like?  You 
should be able to get stats from your switches, and also from node_exporter’s 
interface metrics.  Chances are that you aren’t close to saturating 25GE, 
especially if bonded.  A separate replication net is an artifact of the days 
when clusters were using 100Mb/s or 1GE ethernet, and CRUSH was less efficient 
with respect to planning backfill and recovery.  With modern releases I usually 
advocate for not bothering with a replication network, it’s additional expense 
and introduces a subtle failure mode.

> Then doing some type of link aggregation with the two 25Gbe ports on the NDC 
> NIC. I do not know much about Link Aggregation/LCP, but I'm currently 
> studying it, as it seems to be a necessity in a production environment.

You aren’t doing bonding now?  By all means do it.  Unless. Cluster has more 
than, say, 30 nodes across 4+ racks I would counsel strongly to bond for 
resilience.  Larger clusters lose a smaller percentage when a host or rack is 
unavailable and can thus tolerate this better.

I would suggest bonding those 25GE ports for a single public network, 
especially if you can have them fed by distinct TORs or aggregation switches.  
You want your network to continue when you do maintenance on a single switch - 
firmware, configuration reload, etc.  Bonding is about resilience.  With the 
right xmit_hash_policy you can enjoy more throughput too, but that itself 
shouldn’t be the primary motivation.


> 
> Thoughts?
> 
> 
> Regards, 
>       
>  Anthony Fecarotta
>  Founder & President
>    [email protected] <mailto:[email protected]>
>    224-339-1182    (855) 625-0300
>    1 Mid America Plz Flr 3 Oakbrook Terrace, IL 60181
>    www.linehaul.ai <http://www.linehaul.ai/>   <http://www.linehaul.ai/>
>  <https://www.linkedin.com/in/anthony-fec/>
> On Sun May 18, 2025, 08:09 PM GMT, Anthony D'Atri 
> <mailto:[email protected]> wrote:
> My experience is that an IR HBA with FBWC and supercap can somewhat 
> accelerate latency with slow media.  Wrapping each drive in a VD to enable WB 
> caching, though, is extra work and confounds drive metrics.
> 
> It’s also been that cache modules / BBUs can be flaky, and these things 
> really need additional monitoring (hint: iDRAC isn’t enough)
> 
> If you don’t already have optional cache / BBU aka CV, I wouldn’t spend the $ 
> retrofitting, especially on systems from 6-7 years ago.  Put the $ toward 
> faster networking or NVMe-enabled systems.  The R640 / R740 can be had in 
> all-NVMe chassis and is an inexpensive way to break out of the SAS/SATA trap.
> 
> There are no fewer than 40 R740 chassis types, if you count risers.  This 
> document
> 
> https://dl.dell.com/manuals/common/dellemc-nvme-io-topologies-poweredge.pdf
> 
> gives you a taste.
> 
> 
> Assuming that these are H330mini or similar, and all SAS/SATA, I personally 
> would set the HBA mode / personality to JBOD / passthrough / HBA and act like 
> the RoC isn’t there.  Including boot drives if you don’t have BOSS.  ymmv.
> 
> Whatever you do, I advise using DSU to update firmware.  Old firmware on 
> these LSI / PERC / Avago / Broadcom HBAs can present significant issues.
> 
> On May 18, 2025, at 8:11 AM, Anthony Fecarotta <[email protected]> wrote:
> 
> Does running a RAID controller in HBA mode (not to be confused with IT mode) 
> impact Ceph performance compared to using a dedicated HBA card? Is there any 
> documentation or benchmarking data showing improved performance with true HBA 
> hardware?
> 
> For what it's worth my cluster is on Dell PowerEdge R740 machines.
> 
> Thank you for your insights.
> 
> 
> Regards,
> [image]
> Anthony Fecarotta
> Founder & President
> [image] [email protected] <mailto:[email protected]>
> [image] 224-339-1182 [image] (855) 625-0300
> [image] 1 Mid America Plz Flr 3 Oakbrook Terrace, IL 60181
> [image] www.linehaul.ai <http://www.linehaul.ai/>
> [image] <http://www.linehaul.ai/>
> [image] <https://www.linkedin.com/in/anthony-fec/>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> 

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to