I've decided to take the approach to wait for the expected number of nodes
to become available before initializing the collection. Here is the script
I am using:

https://github.com/apache/incubator-sdap-nexus/blob/91b15ce0b123d652eaa1f5eb589a835ae3e77ceb/docker/solr/cloud-init/create-collection.py

This script will be deployed (using kubernetes) alongside every Solr node
and started at the same time as Solr. I utilize a lock in zookeeper to
ensure that only one node ever attempts to create the collection.

I still think this could be done without any actual nodes running so that
when the cluster starts the collection is immediately ready but this seems
to fit my purpose for now.

- Frank

On Wed, Jan 9, 2019 at 7:22 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> First, for a given data set, I can easily double or halve
> the size of the index on disk depending on what options
> I choose for my fields; things like how many times I may
> need to copy fields to support various use-cases,
> whether I need to store the input for some, all or no
> fields, whether I enable docValues, whether I need to
> support phrase queries and on and on....
>
> Even assuming you can estimate the eventual size,
> it doesn't help much. As one example, if you choose
> stored="true", the index size will grow by roughly 50% of
> the raw data size. But that data doesn't really affect
> searching that much in that it doesn't need to be
> RAM resident in the same way your terms data needs
> to be. So In  order to be performant I may need anywhere
> from a fraction of the raw index size on disk to multiples
> of the index size on disk in terms of RAM.
>
> So you see where this is going. I'm not against your
> suggestion, but I have strong doubts as to its
> feasibility give all the variables I've seen. We can revisit
> this after you've had a chance to kick the tires, I suspect
> we'll have more shared context on which to base
> the discussion.
>
> Best,
> Erick
>
> On Wed, Jan 9, 2019 at 5:12 PM Frank Greguska <fg...@apache.org> wrote:
> >
> > Thanks, I am no Solr expert so I may be over-simplifying things a bit in
> my
> > ignorance.
> >
> > "No. The replicas are in a "down" state the Solr instances are brought
> back
> > up" Why can't I dictate (at least initially) the "up" state somehow? It
> > seems Solr keeps track of where replicas were deployed so that the
> cluster
> > 'heals' itself when all nodes are back. At deployment, I know which nodes
> > should be available so the collection could be unavailable until all
> > expected nodes are up.
> >
> > Thank you for the pointer to the createNodeSet parameter, that might
> prove
> > useful.
> >
> > "I think the thing I'm getting stuck on is how in the world the
> > Solr code could know enough to "do the right thing". How many
> > docs do you have? How big are they? How much to you expect
> > to grow? What kinds of searches do you want to support?"
> >
> > Solr can't know these things. But me as the deployer/developer might.
> > For example say I know my initial data size and can say the index will be
> > 10 TB. If I have 2 nodes with 5 TB disks well then I have to have 2
> shards
> > because it won't fit on one node. If instead I have 4 nodes with 5 TB
> > disks, well I could still have 2 shards but with replicas. Or I could
> > choose no replicas but more shards. This is what I mean by the
> > shard/replica decision being partially dependent on available hardware;
> > there are some decisions I could make knowing my planned deployment so
> that
> > when I start the cluster it can be immediately functional. Rather than
> > first starting the cluster, then creating the collection, then making it
> > available.
> >
> > You may be right that it is a small and complicated concern because I
> > really only need to care about it once when I am first deploying my
> > cluster. But everyone who needs to stand up a SolrCloud cluster needs to
> do
> > it. My guess is most people either do it manually as a one-time
> operations
> > thing or they write a custom script to do it for them automatically as I
> am
> > attempting. Seems like a good candidate for a new feature.
> >
> > - Frank
> >
> > On Wed, Jan 9, 2019 at 4:18 PM Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> > > bq.  do all 100 replicas move to the one remaining node?
> > >
> > > No. The replicas are in a "down" state the Solr instances
> > > are brought back up (I'm skipping autoscaling here, but
> > > even that wouldn't move all the replicas to the one remaining
> > > node).
> > >
> > > bq.  what the collection *should* look like based on the
> > > hardware I am deploying to.
> > >
> > > With the caveat that the Solr instances have to be up, this
> > > is entirely possible. First of all, you can provide a "createNodeSet"
> > > to the create command to specify exactly what Solr nodes you
> > > want used for your collection. There's a special "EMPTY"
> > > value that _almost_ does what you want, that is it creates
> > > no replicas, just the configuration in ZooKeeper. Thereafter,
> > > though, you have to ADDREPLICA (which you can do with
> > > "node" parameter to place it exactly where you want.
> > >
> > > bq. how many shards are at least partially dependent on the
> > > available hardware
> > >
> > > Not if you're using compositeID routing. The number of shards
> > > is fixed at creation time, although you can split them later.
> > >
> > > I don't  think you can use bin/solr create_collection with the
> > > EMPTY createNodeSet, so you need at least one
> > > Solr node running to create your skeleton collection.
> > >
> > > I think the thing I'm getting stuck on is how in the world the
> > > Solr code could know enough to "do the right thing". How many
> > > docs do you have? How big are they? How much to you expect
> > > to grow? What kinds of searches do you want to support?
> > >
> > > But more power to you if you can figure out how to support the kind
> > > of thing you want. Personally I think it's harder than you might
> > > think and not broadly useful. I've been wrong more times than I like
> > > to recall, so maybe you have an approach that would get around
> > > the tigers hiding in the grass I think are out there...
> > >
> > > Best,
> > > Erick
> > >
> > >
> > > On Wed, Jan 9, 2019 at 3:04 PM Frank Greguska <fg...@apache.org>
> wrote:
> > > >
> > > > Thanks for the response. You do raise good points.
> > > >
> > > > Say I reverse your example and I have a 10 node cluster with a
> 10-shard
> > > > collection and a replication factor of 10. Now I kill 9 of my nodes,
> do
> > > all
> > > > 100 replicas move to the one remaining node? I believe the answer is,
> > > well
> > > > that depends on the configuration.
> > > >
> > > > I'm thinking about it from the initial cluster planning side of
> things.
> > > The
> > > > decisions about auto-scaling, how many replicas, and even how many
> shards
> > > > are at least partially dependent on the available hardware. So at
> > > > deployment time I would think there would be a way of defining what
> the
> > > > collection *should* look like based on the hardware I am deploying
> to.
> > > > Obviously this could change during runtime and I may need to add
> nodes,
> > > > split shards, etc...
> > > >
> > > > As it is now it seems like I need to deploy my cluster then write a
> > > custom
> > > > script to ensure each node I expect to be there is running and only
> then
> > > > create my collection with desired shards and replication.
> > > >
> > > > - Frank
> > > >
> > > > On Wed, Jan 9, 2019 at 2:14 PM Erick Erickson <
> erickerick...@gmail.com>
> > > > wrote:
> > > >
> > > > > How would you envision that working? When would the
> > > > > replicas actually be created and under what heuristics?
> > > > >
> > > > > Imagine this is possible, and there are a bunch of
> > > > > placeholders in ZK for a 10-shard, collection with
> > > > > a replication factor of 10 (100 replicas all told). Now
> > > > > I bring up a single Solr instance. Should all 100 replicas
> > > > > be created immediately? Wait for N Solr nodes to be
> > > > > brought online? On some command?
> > > > >
> > > > > My gut feel is that this would be fraught with problems
> > > > > and not very valuable to many people. If you could create
> > > > > the "template" in ZK without any replicas actually being created,
> > > > > then at some other point say "make it so", I don't see the
> advantage
> > > > > over just the current setup. And I do think that it would be
> > > > > considerable effort.
> > > > >
> > > > > Net-net is I'd like to see a much stronger justification
> > > > > before anyone embarks on something like this. First as
> > > > > I mentioned above I think it'd be a lot of effort, second I
> > > > > virtually guarantee it'd introduce significant bugs. How
> > > > > would it interact with autoscaling for instance?
> > > > >
> > > > > Best,
> > > > > Erick
> > > > >
> > > > > On Wed, Jan 9, 2019 at 9:59 AM Frank Greguska <fg...@apache.org>
> > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I am trying to bootstrap a SolrCloud installation and I ran into
> an
> > > issue
> > > > > > that seems rather odd. I see it is possible to bootstrap a
> > > configuration
> > > > > > set from an existing SOLR_HOME using
> > > > > >
> > > > > > ./server/scripts/cloud-scripts/zkcli.sh -zkhost ${ZK_HOST} -cmd
> > > bootstrap
> > > > > > -solrhome ${SOLR_HOME}
> > > > > >
> > > > > > but this does not create a collection, it just uploads a
> > > configuration
> > > > > set.
> > > > > >
> > > > > > Furthermore, I can not use
> > > > > >
> > > > > > bin/solr create
> > > > > >
> > > > > > to create a collection and link it to my bootstrapped
> configuration
> > > set
> > > > > > because it requires Solr to already be running.
> > > > > >
> > > > > > I'm hoping someone can shed some light on why this is the case?
> It
> > > seems
> > > > > > like a collection is just some znodes stored in zookeeper that
> > > contain
> > > > > > configuration settings and such. Why should I not be able to
> create
> > > those
> > > > > > nodes before Solr is running?
> > > > > >
> > > > > > I'd like to open a feature request for this if one does not
> already
> > > exist
> > > > > > and if I am not missing something obvious.
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > Frank Greguska
> > > > >
> > >
>

Reply via email to