When I created the collection using "solrctl" I did not specify a replication factor. So after I read your email, I went looking for the current replication factor - and couldn't. Where do I find the current replication factor? I don't see it on the Solr Admin panel of a node. Nor do I seem to get it from any of the REST APIs (generally, seem to get very little info from the REST APIs in 4.4).
Also, information about SolrCloud replication is confusing. https://www.mail-archive.com/solr-user@lucene.apache.org/msg96344.html With HDFS as the store, I would think Solr replication would be redundant. No? Each document/event is only a few hundred bytes (less than 500 bytes, I would say). Thanks, Tim On Tue, Feb 3, 2015 at 5:03 PM, Mark Miller <markrmil...@gmail.com> wrote: > What is your replication factor and doc size? > > Replication can affect performance a fair amount more than it should > currently. > > For the number of nodes, that doesn’t sound like it matches what I’ve seen > unless those are huge documents or you have some slow analyzer in the chain > or something. > > Without replication, with relatively small docs and decent hardware, I’d > expect around 10,000-12,000 doc’s per node. Replication can up to half that > by some reports. Larger doc size or other outliers might cut some off as > well. > > Solr 4.4 is pretty ancient in SolrCloud terms at this point in general by > the way. > > - Mark > > http://about.me/markrmiller > > > On Feb 3, 2015, at 7:47 PM, Tim Smith <secs...@gmail.com> wrote: > > > > Hi, > > > > I have a SolrCloud (Solr 4.4, writing to HDFS on CDH-5.3) collection > > configured to be populated by flume Morphlines sink. The flume agent > reads > > data from Kafka and writes to the Solr collection. > > > > The issue is that Solr indexing rate is abysmally poor (~6k docs/sec at > > best, dips to a few hundred per sec) across the cluster. The incoming > > data/document rate is about 30-40k/second. > > > > I have gone wide/thin with 18 nodes and each with 8GB (Java) + 4GB > > (non-heap) memory and narrow/thick with current set of 5 dedicated nodes > > each with 36GB (Java) and 16GB (non-heap) memory (18 shards with the > former > > config and 5 shards, right now). > > > > On the flume side, I have gone from 5 flume instances, each with a single > > sink to 5 sinks for each flume instance. I have tweaked batchSize and > > batchDuration. > > > > I checked ZooKeeper loads and don't see it stressed. Neither are the > > datanodes. On the Solr nodes, solr is consuming all the allocated memory > > (32GB) but I don't see solr hitting any CPU limits. > > > > *But*, indexing rate stubbornly stays at ~6k docs/sec. When I bounce the > > flume agent, it jumps up momentarily to several hundreds of thousands but > > then comes down to ~6k/sec and the flume channels get saturated within > > seconds. > > > > Any clues/pointers for troubleshooting will be appreciated? > > > > > > Thanks, > > > > Tim > >