A couple of additions:

I had a system that indexed log files. I created a new core each day
(some 20m log events/day). I created collection aliases called today,
week and month that aggregated the relevant collections. That way,
accessing the “today” collection would always get you to the right
place. And I could unload, or delete, collections over a certain age.

Second thing - some months ago, I created a pull request against pysolr
that added Zookeeper support. Please use it, try it, and comment on the
PR, as it hasn’t been merged yet. I’m keen to get feedback on whether it
works for you. When testing it, I had it happily notice a node going
down and redirect traffic to another host within 200ms, and did so
transparently. I will likely be starting to use it in a project in the
next few weeks myself.

Upayavira

On Thu, Apr 2, 2015, at 09:00 PM, Erick Erickson wrote:
> See inline:
> 
> On Thu, Apr 2, 2015 at 12:36 PM, Ben Hsu <ben....@criticalmedia.com>
> wrote:
> > Hello
> >
> > I am playing with solr5 right now, to see if its cloud features can replace
> > what we have with solr 3.6, and I have some questions, some newbie, and
> > some not so newbie
> >
> > Background: the documents we are putting in solr have a date field. the
> > majority of our searches are restricted to documents created within the
> > last week, but searches do go back 60 days. documents older than 60 days
> > are removed from the repo. we also want high availability in case a machine
> > becomes unavailable
> >
> > our current method, using solr 3.6, is to split the data into 1 day chunks,
> > within each day the data is split into several shards, and each shard has 2
> > replicas. Our code generates the list of cores to be queried on based on
> > the time ranged in the query. Cores that fall off the 60 day range are
> > deleted through solr's RESTful API.
> >
> > This all sounds a lot like what Solr Cloud provides, so I started looking
> > at Solr Cloud's features.
> >
> > My newbie questions:
> >
> >  - it looks like the way to write a document is to pick a node (possibly
> > using a LB), send it to that node, and let solr figure out which nodes that
> > document is supposed to go. is this the recommended way?
> 
> [EOE] That's totally fine. If you're using SolrJ a better way is to
> use CloudSolrClient
> which sends the docs to the proper leader, thus saving one hop.
> 
> >  - similarly, can I just randomly pick a core (using the demo example:
> > http://localhost:7575/solr/#/gettingstarted_shard1_replica2/query ), query
> > it, and let it scatter out the queries to the appropriate cores, and send
> > me the results back? will it give me back results from all the shards?
> 
> [EOE] Yes. Actually, you don't even have to pick a core, just a
> collection.
> The # is totally unneeded, it's just part of navigating around the UI. So
> this
> should work:
> http://localhost:7575/solr/gettingstarted/query?q=*:*
> 
> >  - is there a recommended Python library?
> [EOE] Unsure. If you do find one, check that it has the
> CloudSolrClient support as
> I expect that would take the most effort
> 
> >
> > My hopefully less newbie questions:
> >  - does solr auto detect when node become unavailable, and stop sending
> > queries to them?
> 
> [EOE] Yes, that's what Zookeeper is all about. As each Solr node comes up
> it
> registers itself as a listener for collection state changes. ZK
> detects a node dying and
> notifies all the remaining nodes that nodeX is out of commission and
> they adjust accordingly.
> 
> >  - when the master node dies and the cluster elects a new master, what
> > happens to writes?
> [EOE] Stop thinking master/slave! It's "leaders" and "replicas"
> (although I'm trying
> to use "leaders" and "followers"). The critical bit is that on an
> update, the raw document
> is forwarded from the leader to all followers so they can come and go.
> You simply cannot
> rely on a particular node that is a leader remaining the leader. For
> instance, if you bring up
> your nodes in a different order tomorrow, the leaders and followers
> won't be the same.
> 
> 
> >  - what happens when a node is unavailable
> [EOE] SolrCloud "does the right thing" and keeps on chugging. See the
> comments about
> auto-detect. The exception is that if _all_ the nodes hosting a shard
> go down, you cannot
> add to the index and queries will fail unless you set
> shards.tolerant=true.
> 
> >  - what is the procedure when a shard becomes too big for one machine, and
> > needs to be split?
> There is the Collections API SPLITSHARD command you can use. This means
> that
> you increase by powers of two though, there's no such thing as adding,
> say, one new
> shard to a 4 shard cluster.
> 
> You can also reindex from scratch.
> 
> You can also "overshard" when you initially create your collection and
> host multiple
> shards and/or replicas on a single machine, then physically move them
> when the
> aggregate size exceeds your boundaries.
> 
> >  - what is the procedure when we lose a machine and the node needs replacing
> Use the Collections API to DELETEREPLICA on the replicas on the dead
> node.
> Use the Collections API to ADREPLICA on new machines.
> 
> >  - how would we quickly bulk delete data within a date range?
> [EOE]
> ...solr/update?commit=true&stream.body=<delete><query>date_field:[DATE1
> TO DATE2]</query></delete
> 
> You can take explicit control of where your docs go by various routing
> schemes. The default is to route based on a hash of the id field, but
> if you choose you route all docs based on the value of a field
> (_route_) or based on the first part of the unique key with the bang
> (!) operator.
> 
> Do note, though, that one of the consequences of putting all of a
> day's data on a single shard (or subset of shards) is that you
> concentrate all your searching on those machines, and the other ones
> can be idle. At times you can get better throughput by just letting
> the docs be distributed randomly. That's what I'd start with
> anyway.....
> 
> Best,
> Erick

Reply via email to