I think the fundamental underlying question is how SolrCloud is run. I’m under the impression that most deployments of SolrCloud tend to use all collections/shards all the time, in which case unloading cores is not overly useful.
The use case for transient cores is when different collections or different shards of collections have different usage patterns and might spend relatively long periods of time without being used in which case unloading them from memory makes sense. This is the case of a multi tenant hosted environment such as the one Salesforce runs on top of SolrCloud (with ZERO replicas). Are there other similar use cases for SolrCloud in the industry? Ilan On Tue 10 Sep 2024 at 18:15, Pierre Salagnac <pierre.salag...@gmail.com> wrote: > Starting a thread to discuss transient core support in SolrCloud, and > mostly to figure out if Solr users would be interested in it. > > Transient cores allow a Solr node to not keep all cores in memory. > Basically, a given core may be dynamically loaded (if not already loaded) > to answer a request, and then unloaded later to free up some memory for > another core. This is a memory saver at the cost of a higher CPU > consumption. Depending on the cluster usage pattern, it may be very useful > or counter productive. A cluster with many cores/collections that are not > updated or queried concurrently will perform much better with transient > cores and appropriate tuning (let say we don't handle same data during the > day from during the night) > > That's a quite old feature, but as far as I know, it never worked with > SolrCloud (worked only in standalone mode). > This feature has been deprecated with SOLR-16591. My understanding is this > was mostly because of the lack of support in cloud mode, as most users now > run SolrCloud. > > > I've recently worked in our internal fork to make transient cores work with > SolrCloud with internal implementation of ZERO/SIP-20 replicas.[1] This for > sure takes some shortcuts since ZERO replicas don't support all Solr > features. Even if this does not run in production at scale yet, I reached a > point where I'm confident that [transient cores] + [ZERO replicas] can > work. > Transposing this work to NRT/TLOG/PULL replicas, I see only one pain point: > recovery is supposed to happen when we open the core. By skipping the core > opening at start-up, we also skip recovering/replicating cores from peers. > And by re-opening a core later, not sure how to make sure the replication > does not interact wrongly. > > Beside this last point that I don't know how to address right now, I don't > see any blocker in extending the logic from standalone to SolrCloud. > > > Now, I want to ask whether other people are interested in transient cores. > If yes, I can start by contributing the changes that make sense without > SIP-20, with the long term goal to un-deprecate this feature eventually. > If not, I'll just let the feature die. > > Thanks > > > [1] > > https://cwiki.apache.org/confluence/display/SOLR/SIP-20%3A+Separation+of+Compute+and+Storage+in+SolrCloud >