John:

The MapReduceIndexerTool (in contrib) is intended for bulk indexing in
a Hadoop ecosystem. This doesn't preclude home-grown setups of course,
but it's available OOB. The only tricky bit is at the end. Either you
have your Solr indexes on HDFS in which case MRIT can merge them into
a live Solr cluster or you have to copy them from HDFS to your
local-disk indexes (and, of course, get the shards right). It's a
pretty slick utility, it reads from Zookeeper to understand the number
of shards required and does the whole map/reduce thing to distribute
the work.

As an aside, it uses EmbeddedSolrServer to do _exactly_ the same thing
as indexing to a Solr installation, reads the configs from ZK etc.

Then there's SparkSolr, a way to index from M/R jobs directly to live
Solr setups. The throughput there is limited by how many docs/second
you can process on each shard X #shards.

BTW, in a highly optimized-for-updates setup I've seen 1M+ docs/second
achieved. Don't try this at home, it takes quite a bit of
infrastructure....

As Yago says,  adding replicas imposes about a penalty, I've typically
seen 20-30% in terms of indexing throughput. You can ameliorate this
by adding more shards, but that adds other complexities.

But I cannot over-emphasize how much "it depends" (tm). I was setting
up a stupid-simple index where all I wanted was a bunch of docs with
exactly one simple field plus the ID. On my laptop I was seeing 50K
docs/second in a single shard.

Then for another test case I was doing an ngrammed (mingram-2,
maxgram-32) and was seeing < 100 docs/second. There's simply no way to
translate from the raw data size to hardware specs, unfortunately.

Best,
Erick

On Sat, Sep 24, 2016 at 10:48 AM, Toke Eskildsen <t...@statsbiblioteket.dk> 
wrote:
> Regarding a 12TB index:
>
> Yago Riveiro <yago.rive...@gmail.com> wrote:
>
>> Our cluster is small for the data we hold (12 machines with SSD and 32G of
>> RAM), but we don't need sub-second queries, we need facet with high
>> cardinality (in worst case scenarios we aggregate 5M unique string values)
>
>> In a peak of inserts we can handle around 25K docs per second without any
>> issue with 2 replicas and without compromise reads or put a node in stress.
>> Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or a
>> lack of CPU to communicate.
>
> I am surprised that you manage to have this working on that hardware. As you 
> have replicas, it seems to me that you handle 2*12TB of index with 12*32GB of 
> RAM? This is very close to our setup (22TB of index with 320GB of RAM 
> (updated last week from 256GB) per machine), but we benefit hugely from 
> having a static index.
>
> I assume the SSDs are local? How much memory do you use for heap on each 
> machine?
>
> - Toke Eskildsen

Reply via email to