You said that the data expires, but you haven't said
how many docs you need to host at a time. At 10M/second
inserts you'll.... need a boatload of shards. All of the
conversation about one beefy machine .vs. lots of not-so-beefy
machines should wait until you answer that question. For
instance, indexing some _very_ simple documents on my
laptop can hit 10,000 docs/second/shard. So you're talking
1,000 shards here. Indexing more complex docs I might
get 1,000 docs/second/shard, so then you'd need 10,000
shards. Don't take these as hard numbers, I'm
just trying to emphasize that you'll need to do scaling
exercises to see if what you want to do is reasonable given
your constraints.

If those 10M docs/second are bursty and you can stand some
latency, then that's one set of considerations. If it's steady-state
it's another. In either case you need some _serious_ design work
before you go forward.

And then you want to facet (fq clauses aren't nearly as expensive)
and want 2 second commit intervals.

You _really_ need to stand up some test systems and see what
performance you can get before launching off on trying to do the
whole thing. Fortunately, you can stand up, say, a 4 shard system
and tune it and drive it as fast as you possibly can and extrapolate
from there.

But to reiterate. This is a very high indexing rate that very few
organizations have attempted. You _really_ need to do a
proof-of-concept _then_ plan.

Here's the long form of this argument:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Fri, Dec 16, 2016 at 5:19 AM, GW <thegeofo...@gmail.com> wrote:
> Layer 2 bridge SAN is just for my Apache/apps on Conga so they can be spun
> on up any host with a static IP. This has nothing to do with Solr which is
> running on plain old hardware.
>
> Solrcloud is on a real cluster not on a SAN.
>
> The bit about dead with no error. I got this from a post I made asking
> about the best way to deploy apps. Was shown some code on making your app
> zookeeper aware. I am just getting to this so I'm talking from my ass. A ZK
> aware program will have a list of nodes ready for business verses a plain
> old Round Robin. If data on a machine is corrupted you can get 0 docs found
> while a ZK aware app will know that node is shite.
>
>
>
>
>
>
>
> On 16 December 2016 at 07:20, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
>
>> On Fri, Dec 16, 2016 at 12:39 PM, GW <thegeofo...@gmail.com> wrote:
>>
>> > Dorian,
>> >
>> > From my reading, my belief is that you just need some beefy machines for
>> > your zookeeper ensemble so they can think fast.
>>
>> Zookeeper need to think fast enough for cluster state/changes. So I think
>> it scales with the number of machines/collections/shards and not documents.
>>
>> > After that your issues are
>> > complicated by drive I/O which I believe is solved by using shards. If
>> you
>> > have a collection running on top of a single drive array it should not
>> > compare to writing to a dozen drive arrays. So a whole bunch of light
>> duty
>> > machines that have a decent amount of memory and barely able process
>> faster
>> > than their drive I/O will serve you better.
>> >
>> My dataset will be lower than total memory, so I expect no query to hit
>> disk.
>>
>> >
>> > I think the Apache big data mandate was to be horizontally scalable to
>> > infinity with cheap consumer hardware. In my minds eye you are not going
>> to
>> > get crazy input rates without a big horizontal drive system.
>> >
>> There is overhead with small machines, and with very big machines (pricy).
>> So something in the middle.
>> So small cluster of big machines or big cluster of small machines.
>>
>> >
>> > I'm in the same boat. All the scaling and roll out documentation seems to
>> > reference the Witch Doctor's secret handbook.
>> >
>> > I just started into making my applications ZK aware and really just
>> > starting to understand the architecture. After a whole year I still feel
>> > weak while at the same time I have traveled far. I still feel like an
>> > amateur.
>> >
>> > My plans are to use bridge tools in Linux so all my machines are sitting
>> on
>> > the switch with layer 2. Then use Conga to monitor which apps need to be
>> > running. If a server dies, it's apps are spun up on one of the other
>> > servers using the original IP and mac address through a bridge firewall
>> > gateway so there is no hold up with with mac phreaking like layer 3.
>> Layer
>> > 3 does not like to see a route change with a mac address. My apps will be
>> > on a SAN ~ Data on as many shards/machines as financially possible.
>> >
>> By conga you mean https://sourceware.org/cluster/conga/spec/ ?
>> Also SAN may/will suck like someone answered in your thread.
>>
>> >
>> > I was going to put a bunch of Apache web servers in round robin to talk
>> to
>> > Solr but discovered that a Solr node can be dead and not report errors.
>> >
>> Please explain more "dead but no error".
>>
>> > It's all rough at the moment but it makes total sense to send Solr
>> requests
>> > based on what ZK says is available verses a round robin.
>> >
>> Yes, like I&other commenter wrote on your thread.
>>
>> >
>> > Will keep you posted on my roll out if you like.
>> >
>> > Best,
>> >
>> > GW
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On 16 December 2016 at 03:31, Dorian Hoxha <dorian.ho...@gmail.com>
>> wrote:
>> >
>> > > Hello searchers,
>> > >
>> > > I'm researching solr for a project that would require a
>> > max-inserts(10M/s)
>> > > and some heavy facet+fq on top of that, though on low qps.
>> > >
>> > > And I'm trying to find blogs/slides where people have used some big
>> > > machines instead of hundreds of small ones.
>> > >
>> > > 1. Largest I've found is this
>> > > <https://sbdevel.wordpress.com/2016/11/30/70tb-16b-docs-
>> > > 4-machines-1-solrcloud/>
>> > > with 16cores + 384GB ram but they were using 25! solr4 instances /
>> server
>> > > which seems wasteful to me ?
>> > >
>> > > I know that 1 solr can have max ~29-30GB heap because GC is
>> > wasteful/sucks
>> > > after that, and you should leave the other amount to the os for
>> > file-cache.
>> > > 2. But do you think 1 instance will be able to fully-use a 256GB/20core
>> > > machine ?
>> > >
>> > > 3. Like to share your findings/links with big-machine clusters ?
>> > >
>> > > Thank You
>> > >
>> >
>>

Reply via email to