John: The MapReduceIndexerTool (in contrib) is intended for bulk indexing in a Hadoop ecosystem. This doesn't preclude home-grown setups of course, but it's available OOB. The only tricky bit is at the end. Either you have your Solr indexes on HDFS in which case MRIT can merge them into a live Solr cluster or you have to copy them from HDFS to your local-disk indexes (and, of course, get the shards right). It's a pretty slick utility, it reads from Zookeeper to understand the number of shards required and does the whole map/reduce thing to distribute the work.
As an aside, it uses EmbeddedSolrServer to do _exactly_ the same thing as indexing to a Solr installation, reads the configs from ZK etc. Then there's SparkSolr, a way to index from M/R jobs directly to live Solr setups. The throughput there is limited by how many docs/second you can process on each shard X #shards. BTW, in a highly optimized-for-updates setup I've seen 1M+ docs/second achieved. Don't try this at home, it takes quite a bit of infrastructure.... As Yago says, adding replicas imposes about a penalty, I've typically seen 20-30% in terms of indexing throughput. You can ameliorate this by adding more shards, but that adds other complexities. But I cannot over-emphasize how much "it depends" (tm). I was setting up a stupid-simple index where all I wanted was a bunch of docs with exactly one simple field plus the ID. On my laptop I was seeing 50K docs/second in a single shard. Then for another test case I was doing an ngrammed (mingram-2, maxgram-32) and was seeing < 100 docs/second. There's simply no way to translate from the raw data size to hardware specs, unfortunately. Best, Erick On Sat, Sep 24, 2016 at 10:48 AM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > Regarding a 12TB index: > > Yago Riveiro <yago.rive...@gmail.com> wrote: > >> Our cluster is small for the data we hold (12 machines with SSD and 32G of >> RAM), but we don't need sub-second queries, we need facet with high >> cardinality (in worst case scenarios we aggregate 5M unique string values) > >> In a peak of inserts we can handle around 25K docs per second without any >> issue with 2 replicas and without compromise reads or put a node in stress. >> Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or a >> lack of CPU to communicate. > > I am surprised that you manage to have this working on that hardware. As you > have replicas, it seems to me that you handle 2*12TB of index with 12*32GB of > RAM? This is very close to our setup (22TB of index with 320GB of RAM > (updated last week from 256GB) per machine), but we benefit hugely from > having a static index. > > I assume the SSDs are local? How much memory do you use for heap on each > machine? > > - Toke Eskildsen