Hi,
I have an 260M documents index (90GB) with this structure:
<field name="fragment" type="text_general" indexed="true" stored="true"
multiValued="false" termVectors="false" termPositions="false"
termOffsets="false" />
<field name="parentId" type="long" indexed="false" stored="true"
multiValued="false"/>
<field name="fragmentContentType" type="string" indexed="false"
stored="true" multiValued="false"/>
<field name="creationDate" type="date" indexed="true" stored="true"
multiValued="false"/>
<field name="creationTimestamp" type="date" indexed="true" stored="true"
multiValued="false"/>
<field name="visibility" type="string" indexed="true" stored="true"
multiValued="false"/>
<field name="category" type="string" indexed="true" stored="true"
multiValued="false"/>
<field name="marked" type="string" indexed="true" stored="true"
multiValued="false"/>
<!-- catchall field, containing all other searchable text fields
(implemented
via copyField further on in this schema -->
<field name="text" type="text_general" indexed="true" stored="false"
multiValued="true"/>
<copyField source="fragment" dest="text"/>
<copyField source="parentId" dest="text"/>
<copyField source="fragmentContentType" dest="text"/>
<copyField source="creationDate" dest="text"/>
<copyField source="visibility" dest="text"/>
<copyField source="category" dest="text"/>
<copyField source="marked" dest="text"/>
where the fragmetnt field contains XML messagges.
There is a search function that provide the messagges satisfying a search
criterion.
TARGET:
To find the best configuration to optimize the response time of a two solr
instances cloud with 2 VM with 8 core and 32 GB
TEST RESULTS:
1.
Configurations:
1.
the better configuration without replicas
- CONF1: 16 shards of 17M documents (8 per VM)
1.
configuration with replica
- CONF 2: 8 shards of 35M documents with replication factor of 1
- CONF 3: 16 shards of 35M documents with replication factor of 1
1.
Executed tests
- sequential requests
- 5 parallel requests
- 10 parallel requests
- 20 parallel requests
in two scenarios: during an indexing phase and not
Call are: http://localhost:8983/solr/sepa/select?
q=+fragment%3A*AAA*+&fq=marked%3AT&fq=-fragmentContentType
%3ABULK&start=0&rows=100&sort=creationTimestamp+desc%2Cid+asc
1.
Test results
All the test have point out an I/O utilization of 100MB/s during
loading data on disk cache, disk cache utilization of 20GB and core
utilization of 100% (all 8 cores)
-
No indexing
-
CONF1 (time average and maximum time)
-
sequential: 4,1 6,9
-
5 parallel: 15,6 19,1
-
10 parallel: 23,6 30,2
-
20 parallel: 48 52,2
-
CONF2
-
sequential: 12,3 17,4
-
5 parallel: 32,5 34,2
-
10 parallel: 45,4 49
-
20 parallel: 64,6 74
-
CONF3
-
sequential: 6,9 9,9
-
5 parallel: 33,2 37,5
-
10 parallel: 46 51
-
20 parallel: 68 83
-
Indexing (into the solr admin console is it possible to view the
total throughput?
I find it only relative to a single shard).
CONF1
-
sequential: 7,7 9,5
-
5 parallel: 26,8 28,4
-
10 parallel: 31,8 37,8
-
20 parallel: 42 52,5
-
CONF2
-
sequential: 12,3 19
-
5 parallel: 39 40,8
-
10 parallel: 56,6 62,9
-
20 parallel: 79 116
-
CONF3
-
sequential: 10 18,9
-
5 parallel: 36,5 41,9
-
10 parallel: 63,7 64,1
-
20 parallel: 85 120
I have two question:
-
the response times of the configuration with replica are worse (in test
case of sequential requests worse of about three time) than the response
times of the configuration without replica. Is it an expected result?
- Why during index inserting and updating replicas doesn’t help to
reduce the response time?