Hi everyone, I've been working on an installation recently which uses SolrCloud to index 45M documents into 8 shards on 2 VMs running 64-bit Ubuntu (with another 2 identical VMs set up for replicas). The reason we're using so many shards for a relatively small index is that there are complex filtering requirements at search time, to restrict users to items they are licensed to view. Initial tests demonstrated that multiple shards would be required.
The total size of the index is about 140GB, and each VM has 16GB RAM (32GB total) and 4 CPU units. I know this is far under what would normally be recommended for an index of this size, and I'm working on persuading the customer to increase the RAM (basically, telling them it won't work otherwise.) Performance is currently pretty poor and I would expect more RAM to improve things. However, there are a couple of other oddities which concern me, The first is that I've been reindexing a fixed set of 500 docs to test indexing and commit performance (with soft commits within 60s). The time taken to complete a hard commit after this is longer than I'd expect, and highly variable - from 10s to 70s. This makes me wonder whether the SAN (which provides all the storage for these VMs and the customers several other VMs) is being saturated periodically. I grabbed some iostat output on different occasions to (possibly) show the variability: Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 64.50 0.00 2476.00 0 4952 ... sdb 8.90 0.00 348.00 0 6960 ... sdb 1.15 0.00 43.20 0 864 The other thing that confuses me is that after a Solr restart or hard commit, search times average about 1.2s under light load. After searching the same set of queries for 5-6 iterations this improves to 0.1s. However, in either case - cold or warm - iostat reports no device reads at all: Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 0.40 0.00 8.00 0 160 ... sdb 0.30 0.00 10.40 0 104 (the writes are due to logging). This implies to me that the 'hot' blocks are being completely cached in RAM - so why the variation in search time and the number of iterations required to speed it up? The Solr caches are only being used lightly by these tests and there are no evictions. GC is not a significant overhead. Each Solr shard runs in a separate JVM with 1GB heap. I don't have a great deal of experience in low-level performance tuning, so please forgive any naivety. Any ideas of what to do next would be greatly appreciated. I don't currently have details of the VM implementation but can get hold of this if it's relevant. thanks, Tom