Re: Indexing throughput

2018-05-02 Thread Shawn Heisey
On 5/2/2018 10:58 AM, Greenhorn Techie wrote: > The current hardware profile for our production cluster is 20 nodes, each > with 24cores and 256GB memory. Data being indexed is very structured in > nature and is about 30 columns or so, out of which half of them are > categorical with a defined list

Re: Indexing throughput

2018-05-02 Thread Greenhorn Techie
Thanks Walter and Erick for the valuable suggestions. We shall try out various values for shards and as well other tuning metrics I discussed in various threads earlier. Kind Regards On 2 May 2018 at 18:24:31, Erick Erickson (erickerick...@gmail.com) wrote: I've seen 1.5 M docs/second. Basicall

Re: Indexing throughput

2018-05-02 Thread Erick Erickson
I've seen 1.5 M docs/second. Basically the indexing throughput is gated by two things: 1> the number of shards. Indexing throughput essentially scales up reasonably linearly with the number of shards. 2> the indexing program that pushes data to Solr. Before thinking Solr is the bottleneck, check ho

Re: Indexing throughput

2018-05-02 Thread Walter Underwood
We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb RAM each (EC2 C4.8xlarge). The collection is 24 million documents with four shards. The cluster is Solr 6.6.2. All storage is SSD EBS. We built a simple batch loader in Java. We get about one million documents per minute with