Hi Solr Folk,

I am trying to push solr to the limit and sometimes I succeed. The
questions is how to not go over it, e.g., avoid:

java.lang.RuntimeException: Tried fetching cluster state using the node
names we knew of, i.e. [192.168.56.1:9998_solr, 192.168.56.1:9997_solr,
192.168.56.1:9999_solr, 192.168.56.1:9996_solr]. However, succeeded in
obtaining the cluster state from none of them.If you think your Solr
cluster is up and is accessible, you could try re-creating a new
CloudSolrClient using working solrUrl(s) or zkHost(s).
        at org.apache.solr.client.solrj.impl.HttpClusterStateProvider.
getState(HttpClusterStateProvider.java:109)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.resolveAliases(
CloudSolrClient.java:1113)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.
requestWithRetryOnStaleState(CloudSolrClient.java:845)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.request(
CloudSolrClient.java:818)
        at org.apache.solr.client.solrj.SolrRequest.process(
SolrRequest.java:194)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:173)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:152)
        at com.asc.InsertDEWikiSimple$SimpleThread.run(
InsertDEWikiSimple.java:132)


Details:

I am benchmarking solrcloud setup on a single machine (Intel 7 with 8 "cpu
cores", an SSD as well as a HDD) using the German Wikipedia collection. I
created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and
managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s,
~100% CPU). Now I wanted to see what happens if I push HDD to the limits.
Indexing the files from the SSD (I am able to scan the collection at the
actual rate 400-500MB/s) with 16 threads, I tried to send those to the solr
cluster with all indexes on the HDD.

Clearly solr needs to deal with a very slow hard drive (10-20MB/s actual
rate). If the cluster is not touched, solrj may start loosing connections
after a few hours. If one checks the status of the cluster, it may happen
sooner. After the connection is lost, the cluster calms down with writing
after a half a dozen of minutes.

What would be a reasonable way to push to the limit without going over?

The exact parameters are:

- 4 cores running 2gb ram
- Schema:

  <fieldType name="ft_wiki_de" class="solr.TextField"
positionIncrementGap="100">
     <analyzer>
       <charFilter class="solr.HTMLStripCharFilterFactory"/>
       <tokenizer  class="solr.StandardTokenizerFactory"/>
       <filter     class="solr.GermanMinimalStemFilterFactory"/>
       <filter     class="solr.LowerCaseFilterFactory"/>
     </analyzer>
  </fieldType>

  <fieldType name="ft_url" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
       <tokenizer  class="solr.StandardTokenizerFactory"/>
       <filter     class="solr.LowerCaseFilterFactory"/>
     </analyzer>
  </fieldType>

  <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
  <field name="id" type="uuid" indexed="true" stored="true" required="true"/>
  <field name="_root_" type="uuid" indexed="true" stored="false"
docValues="false" />

  <field name="size"    type="pint"       indexed="true" stored="true"/>
  <field name="time"    type="pdate"      indexed="true" stored="true"/>
  <field name="content" type="ft_wiki_de" indexed="true" stored="true"/>
  <field name="url"     type="ft_url"     indexed="true" stored="true"/>

  <field name="_version_" type="plong"        indexed="false" stored="false"/>

I SolrJ-connect once:

        ArrayList<String> urls = new ArrayList<>();
        urls.add("http://localhost:9999/solr";);
        urls.add("http://localhost:9998/solr";);
        urls.add("http://localhost:9997/solr";);
        urls.add("http://localhost:9996/solr";);

        solrClient = new CloudSolrClient.Builder(urls)
            .withConnectionTimeout(10000)
            .withSocketTimeout(60000)
            .build();
        solrClient.setDefaultCollection("de_wiki_man");

and then execute in 16 threads till there's anything to execute:

                    Path p = getJobPath();
                                           String content = new String
(Files.readAllBytes(p));
                    UUID id = UUID.randomUUID();
                    SolrInputDocument doc = new SolrInputDocument();

                    BasicFileAttributes attr = Files.readAttributes(p,
BasicFileAttributes.class);

                    doc.addField("id",      id.toString());
                    doc.addField("content", content);
                    doc.addField("time",    attr.creationTime().toString());
                    doc.addField("size",    content.length());
                    doc.addField("url",     p.getFileName().
toAbsolutePath().toString());
                    solrClient.add(doc);


to go through all the wiki html files.

Cheers,
Arturas

Reply via email to