Hi Erick et al, Thanks a lot for the response. Your explanation seems very plausible and I'd love to investigate those further.
Batching the docs (for me surprisingly) improved the numbers: Buffer size secs MB/s Docs/s N:500 1117 34.4077538 2400.72695 N:100 1073 35.8186962 2499.17241 N:10 1170 32.849112 2291.97607 N:5 1234 31.1454303 2173.10535 N:3 1433 26.8202798 1871.32729 N:2 1758 21.862037 1525.37656 N:1 2307 16.6594976 1162.38058 It looks like the larger the buffer (in terms of number of documents), the faster the processing. I thought the gains would not been too high as (1) solr buffers it itself, (2) the documents are pretty large. SolrJ API changed a bit since the last few releases and it is becoming incredibly difficult to find working code. You mentioned that I can connect to zkHost directly. I tried [1], [2], and [3] and its variants without any success (the returned object was null) . How would it look like in 7.2+ branch (I am currently running the embedded zookeeper, solr runs on 9999, so the zookeeper should be on 10999 [4])? I am impressed by the number of metrics I can get from the solr with my very limited knowledge. You mentioned that there are 200+ metrics one can get about the system. As the primary resource of infos, would you recommend: https://lucene.apache.org/solr/guide/7_4/collections-api.html Can you maybe expand this list with additional references? Cheers, Arturas Refs: [::1]:10999 ESTABLISHED 13984 [1] String zkHostString = "localhost:10999"; SolrClient solrClient = new CloudSolrClient(zkHostString, true); solrClient.setDefaultCollection("de_wiki_man"); [2] String zkHostString = "localhost:10999"; SolrClient solrClient = new CloudSolrClient.Builder().withZkHost(zkHostString).build(); [3] ArrayList<String> zkHosts = new ArrayList<>(); zkHosts.add("localhost:10999"); solrClient = new CloudSolrClient.Builder(zkHosts, null) .withConnectionTimeout(1000000) .withSocketTimeout(6000000) .build(); solrClient.setDefaultCollection("de_wiki_man"); [4] C:\WINDOWS\system32>netstat -aon | grep 13984 TCP 0.0.0.0:9999 0.0.0.0:0 LISTENING 13984 TCP 0.0.0.0:10999 0.0.0.0:0 LISTENING 13984 TCP 127.0.0.1:8999 0.0.0.0:0 LISTENING 13984 TCP 127.0.0.1:62888 127.0.0.1:62889 ESTABLISHED 13984 TCP 127.0.0.1:62889 127.0.0.1:62888 ESTABLISHED 13984 TCP 127.0.0.1:62891 127.0.0.1:62892 ESTABLISHED 13984 TCP 127.0.0.1:62892 127.0.0.1:62891 ESTABLISHED 13984 TCP 127.0.0.1:62900 127.0.0.1:62901 ESTABLISHED 13984 TCP 127.0.0.1:62901 127.0.0.1:62900 ESTABLISHED 13984 TCP 127.0.0.1:62902 127.0.0.1:62903 ESTABLISHED 13984 TCP 127.0.0.1:62903 127.0.0.1:62902 ESTABLISHED 13984 TCP 127.0.0.1:62904 127.0.0.1:62905 ESTABLISHED 13984 TCP 127.0.0.1:62905 127.0.0.1:62904 ESTABLISHED 13984 TCP 127.0.0.1:62906 127.0.0.1:62907 ESTABLISHED 13984 TCP 127.0.0.1:62907 127.0.0.1:62906 ESTABLISHED 13984 TCP [::]:9999 [::]:0 LISTENING 13984 TCP [::]:10999 [::]:0 LISTENING 13984 TCP [::1]:10999 [::1]:62893 ESTABLISHED 13984 TCP [::1]:62893 On Wed, Jul 4, 2018 at 6:06 PM, Erick Erickson <erickerick...@gmail.com> wrote: > First, I usually prefer to construct your CloudSolrClient by > using the Zookeeper ensemble string rather than URLs, > although that's probably not a cure for your problem. > > Here's what I _think_ is happening. If you're slamming Solr > with a lot of updates, you're doing a lot of merging. At some point > when there are a lot of merges going on incoming > updates block until one or more merge threads is done. > > At that point, I suspect your client is timing out. And (perhaps) > if you used the Zookeeper ensemble instead of HTTP, the > cluster state fetch would go away. I suspect that another > issue would come up, but.... > > It's also possible this would all go away if you increase your > timeouts significantly. That's still a "set it and hope" approach > rather than a totally robust solution though. > > Let's assume that the above works and you start getting timeouts. > You can back off the indexing rate at that point, or just go to > sleep for a while. This isn't what you'd like for a permanent solution, > but may let you get by. > > There's work afoot to separate out update thread pools from query > thread pools so _querying_ doesn't suffer when indexing is heavy, > but that hasn't been implemented yet. This could also address > your cluster state fetch error. > > You will get significantly better throughput if you batch your > docs and use the client.add(list_of_documents) BTW. > > Another possibility is to use the new metrics (since Solr 6.4). They > provide over 200 metrics you can query, and it's quite > possible that they'd help your clients know when to self-throttle > but AFAIK, there's nothing built in to help you there. > > Best, > Erick > > On Wed, Jul 4, 2018 at 2:32 AM, Arturas Mazeika <maze...@gmail.com> wrote: > > Hi Solr Folk, > > > > I am trying to push solr to the limit and sometimes I succeed. The > > questions is how to not go over it, e.g., avoid: > > > > java.lang.RuntimeException: Tried fetching cluster state using the node > > names we knew of, i.e. [192.168.56.1:9998_solr, 192.168.56.1:9997_solr, > > 192.168.56.1:9999_solr, 192.168.56.1:9996_solr]. However, succeeded in > > obtaining the cluster state from none of them.If you think your Solr > > cluster is up and is accessible, you could try re-creating a new > > CloudSolrClient using working solrUrl(s) or zkHost(s). > > at org.apache.solr.client.solrj.impl.HttpClusterStateProvider. > > getState(HttpClusterStateProvider.java:109) > > at org.apache.solr.client.solrj.impl.CloudSolrClient. > resolveAliases( > > CloudSolrClient.java:1113) > > at org.apache.solr.client.solrj.impl.CloudSolrClient. > > requestWithRetryOnStaleState(CloudSolrClient.java:845) > > at org.apache.solr.client.solrj.impl.CloudSolrClient.request( > > CloudSolrClient.java:818) > > at org.apache.solr.client.solrj.SolrRequest.process( > > SolrRequest.java:194) > > at org.apache.solr.client.solrj.SolrClient.add(SolrClient. > java:173) > > at org.apache.solr.client.solrj.SolrClient.add(SolrClient. > java:138) > > at org.apache.solr.client.solrj.SolrClient.add(SolrClient. > java:152) > > at com.asc.InsertDEWikiSimple$SimpleThread.run( > > InsertDEWikiSimple.java:132) > > > > > > Details: > > > > I am benchmarking solrcloud setup on a single machine (Intel 7 with 8 > "cpu > > cores", an SSD as well as a HDD) using the German Wikipedia collection. I > > created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and > > managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s, > > ~100% CPU). Now I wanted to see what happens if I push HDD to the limits. > > Indexing the files from the SSD (I am able to scan the collection at the > > actual rate 400-500MB/s) with 16 threads, I tried to send those to the > solr > > cluster with all indexes on the HDD. > > > > Clearly solr needs to deal with a very slow hard drive (10-20MB/s actual > > rate). If the cluster is not touched, solrj may start loosing connections > > after a few hours. If one checks the status of the cluster, it may happen > > sooner. After the connection is lost, the cluster calms down with writing > > after a half a dozen of minutes. > > > > What would be a reasonable way to push to the limit without going over? > > > > The exact parameters are: > > > > - 4 cores running 2gb ram > > - Schema: > > > > <fieldType name="ft_wiki_de" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer> > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <filter class="solr.GermanMinimalStemFilterFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > </fieldType> > > > > <fieldType name="ft_url" class="solr.TextField" > positionIncrementGap="100"> > > <analyzer> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > </fieldType> > > > > <fieldType name="uuid" class="solr.UUIDField" indexed="true" /> > > <field name="id" type="uuid" indexed="true" stored="true" > required="true"/> > > <field name="_root_" type="uuid" indexed="true" stored="false" > > docValues="false" /> > > > > <field name="size" type="pint" indexed="true" stored="true"/> > > <field name="time" type="pdate" indexed="true" stored="true"/> > > <field name="content" type="ft_wiki_de" indexed="true" stored="true"/> > > <field name="url" type="ft_url" indexed="true" stored="true"/> > > > > <field name="_version_" type="plong" indexed="false" > stored="false"/> > > > > I SolrJ-connect once: > > > > ArrayList<String> urls = new ArrayList<>(); > > urls.add("http://localhost:9999/solr"); > > urls.add("http://localhost:9998/solr"); > > urls.add("http://localhost:9997/solr"); > > urls.add("http://localhost:9996/solr"); > > > > solrClient = new CloudSolrClient.Builder(urls) > > .withConnectionTimeout(10000) > > .withSocketTimeout(60000) > > .build(); > > solrClient.setDefaultCollection("de_wiki_man"); > > > > and then execute in 16 threads till there's anything to execute: > > > > Path p = getJobPath(); > > String content = new String > > (Files.readAllBytes(p)); > > UUID id = UUID.randomUUID(); > > SolrInputDocument doc = new SolrInputDocument(); > > > > BasicFileAttributes attr = Files.readAttributes(p, > > BasicFileAttributes.class); > > > > doc.addField("id", id.toString()); > > doc.addField("content", content); > > doc.addField("time", > attr.creationTime().toString()); > > doc.addField("size", content.length()); > > doc.addField("url", p.getFileName(). > > toAbsolutePath().toString()); > > solrClient.add(doc); > > > > > > to go through all the wiki html files. > > > > Cheers, > > Arturas >