>
> Hi Anthony,
> We will be using Samsung SSD 870 QVO 8TB disks on
> all OSD servers.
Your choices are yours to make, but for what it’s worth, I would not use these.
* They are client-class, not designed for enterprise workloads or duty cycle
* Best I can tell this lacks PLP power loss protection, which can result in
corrupted or lost data
* QLC can be just smurfy for object storage workloads that are read-mostly, but
can be disappointing for RBD or small objects/files
* 3 year warranty instead of the 5 years typical for enterprise SKUs
* Slow writes after the SLC cache portion fills, this is designed for desktop
intermittent workload, not sustained enterprise workload.
* Rated endurance for a 4KB random write workload is ~ 0.33 DWPD over the 3
year warranty period, which if divided by the enterprise 5 year warranty
workload would be .20 DWPD.
If you expect a low write workload and have VERY limited performance
expectations, maybe they’d work for you, but especially don’t think you can
safely do replication size=2 or EC 2/3+1. A few months ago someone in the
community unilaterally sent me money *begging* me to make their cluster of
these faster. Nothing I could do sort of recommending that they be replaced
with a more appropriate SKU.
> One more thing , I want to know is that CephFS supports mounting with
> FsCache on clients ?
I find some references on the net to people doing this, but have zero
experience with it.
> 500T data stored in the cluster will be accessed by
> the jobs running on the clients nodes and we need super fast read
> performance.
Client-class media are incompatible with super fast anything. I don’t recall
you mentioning the network — bonded 10GE at least?
> For that we do have additional cache disk installed on all the
> clients nodes. And the way NFS V4 supports mount NFS share with FsCache on
> clients' hosts ,CephFS also supports that.
You would do better to invest in enterprise cluster tech than in band-aids that
may or may not work well.
{Good,Fast,Cheap} Pick Any Two.
Trite but so often true.
>
> On those 4x non-OSD nodes, I will probably run ldap and HTCondor service.
> But mds node will not be used for anything other than mds daemon.
>
> Thanks,
> Gagan
>
>
>
> On Fri, Apr 11, 2025 at 8:45 PM Anthony D'Atri <[email protected]>
> wrote:
>
>>
>>
>>> On Apr 11, 2025, at 4:04 AM, gagan tiwari <
>> [email protected]> wrote:
>>>
>>> Hi Anthony,
>>> Thanks for the reply!
>>>
>>> We will be using CephFS to access Ceph Storage from clients. So, this
>>> will need MDS daemon also.
>>
>> MDS is single-threaded, so unlike most Ceph daemons it benefits more from
>> a high-frequency CPU than core count.
>>
>>> So, based on your advice, I am thinking of having 4 Dell PowerEdge
>> servers
>>> . 3 of them will run 3 Monitor daemons and one of them will run MDS
>>> daemon.
>>>
>>> These Dell Servers will have following hardware :-
>>>
>>> 1. 4 cores ( 8 threads ) ( Can go for 8 core and 16 threads )
>>>
>>> 2. 64G RAM
>>>
>>> 3. 2x4T Samsung SSD with RA!D 1 to install OS and run monitor and
>>> metadata services.
>>
>> That probably suffices for a small cluster. Are those Samsungs
>> enterprise?
>>
>>
>>> OSD nodes will be upgraded to have 32 cores ( 64 threads ). Disk and RAM
>>> will remain same ( 128G and 22X8T Samsung SSD )
>>
>> Which Samsung SSD? Using client SKUs for OSDs has a way of leading to
>> heartbreak.
>>
>> 64 threads would be better for a 22x OSD node, though still a bit light.
>> Are these SATA or NVMe?
>>
>>> Actually , I want to use OSD nodes to run OSD damons and not any
>>> other demons and which is why I am thinking of having 4 additional Dell
>>> servers as mentioned above.
>>
>> Colocation of daemons is common these days, especially with smaller
>> clusters.
>>
>>>
>>> Please advise if this plan will be better.
>>
>> That’ll work, but unless you already have those quite-modest 4x non-OSD
>> nodes sitting around idle you might consider just going with the OSD nodes
>> and bumping the CPU again so you can colocate all the daemons.
>>
>>>
>>> Thanks,
>>> Gagan
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 9, 2025 at 8:12 PM Anthony D'Atri <[email protected]>
>>> wrote:
>>>
>>>>
>>>>>
>>>>> We would start deploying Ceph with 4 hosts ( HP Proliant servers ) each
>>>>> running RockyLinux 9.
>>>>>
>>>>> One of the hosts called ceph-adm will be smaller one and will have
>>>>> following hardware :-
>>>>>
>>>>> 2x4T SSD with raid 1 to install OS on.
>>>>>
>>>>> 8 Core with 3600MHz freq.
>>>>>
>>>>> 64G RAM
>>>>>
>>>>> We are planning to run all Ceph daemons except OSD daemon like monitor
>> ,
>>>>> metadata ,etc on this host.
>>>>
>>>> 8 core == 16 threads? Are you provisioning this node because you have it
>>>> laying around idle?
>>>>
>>>> Note that you will want *at least* 3 Monitor (monitors) daemons, which
>>>> must be on different nodes. 5 is better, but at least 3. You’ll also
>> have
>>>> Grafana, Prometheus, MDS (if you’re going to CephFS vs using S3 object
>>>> storage or RBD block)
>>>>
>>>> 8c is likely on the light side for all of that. You would also benefit
>>>> from not having that node be a single point of failure. I would
>> suggest if
>>>> you can raising this node to the spec of the planned 3x OSD nodes so you
>>>> have 4x equivalent nodes, and spread that non-OSD daemons across them.
>>>>
>>>> Note also that your OSD nodes will also have node_exporter, crash, and
>>>> other boilerplate daemons.
>>>>
>>>>
>>>>> We will have 3 hosts to run OSD which will store actual data.
>>>>>
>>>>> Each OSD host will have following hardware
>>>>>
>>>>> 2x4T SSD with raid 1 to install OS on.
>>>>>
>>>>> 22X8T SSD to store data ( OSDs ) ( without partition ). We will use
>>>> entire
>>>>> disk without partitions
>>>>
>>>> SAS, SATA, or NVMe SSDs? Which specific model? You really want to
>> avoid
>>>> client (desktop) models for Ceph, but you likely do not need to pay for
>>>> higher endurance mixed-use SKUs.
>>>>
>>>>> Each OSD host will have 128G RAM ( No swap space )
>>>>
>>>> Thank you for skipping swap. Some people are really stuck in the past
>> in
>>>> that regard.
>>>>
>>>>> Each OSD host will have 16 cores.
>>>>
>>>> So 32 threads total? That is very light for 22 OSDs + other daemons.
>> For
>>>> HDD OSDs a common rule of thumb is at minimum 2x threads per, for
>> SAS/SATA
>>>> SSDs, 4, for NVMe SSDs 6. Plus margin for the OS and other processes.
>>>>
>>>>> All 4 hosts will connect to each via 10G nic.
>>>>
>>>> Two ports with bonding? Redundant switches?
>>>>
>>>>> The 500T data
>>>>
>>>> The specs you list above include 528 TB of *raw* space. Be advised that
>>>> with three OSD nodes, you will necessarily be doing replication. For
>>>> safety replication with size=3. Taking into consideration TB vs TiB and
>>>> headroom, you’re looking at 133TiB of usable space. You could go with
>>>> size=2 to get 300TB of usable space, but at increased risk of data
>>>> unavailability or loss when drives/hosts fail or reboot.
>>>>
>>>> With at least 4 OSD nodes - even if they aren’t fully populated with
>>>> capacity drives — you could do EC for a more favorable raw:usable
>> ratio, at
>>>> the expense of slower writes and recovery. With 4 nodes you could in
>>>> theory do 2,2 EC for 200 TiB of usable space, with 5 you could do 3,2
>> for
>>>> 240 TiB usable, etc.
>>>>
>>>>> will be accessed by the clients. We need to have
>>>>> read performance as fast as possible.
>>>>
>>>> Hope your SSDs are enterprise NVMe.
>>>>
>>>>> We can't afford data loss and downtime.
>>>>
>>>> Then no size=2 for you.
>>>>
>>>>> So, we want to have a Ceph
>>>>> deployment which serves our purpose.
>>>>>
>>>>> So, please advise me if the plan that I have designed will serve our
>>>>> purpose.
>>>>> Or is there a better way , please advise that.
>>>>>
>>>>> Thanks,
>>>>> Gagan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> We have a HP storage server with 12 SDD of 5T each and have set-up
>>>> hardware
>>>>> RAID6 on these disks.
>>>>>
>>>>> HP storage server has 64G RAM and 18 cores.
>>>>>
>>>>> So, please advise how I should go about setting up Ceph on it to have
>>>> best
>>>>> read performance. We need fastest read performance.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Gagan
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- [email protected]
>>>>> To unsubscribe send an email to [email protected]
>>>>
>>>>
>>> _______________________________________________
>>> ceph-users mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>
>>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]