Hello everyone,
I have a small challenge performance testing a SolrCloud setup. I have 10
shards, and each shard is supposed to have index-size ~200GB. However I
only have a single index of 200GB because it will take too long to build
another index with different data,  and I hope to somehow use this index on
all 10 shards and make it behave as all documents are different on each
shard. So building more indexes from new data is not an option.

Making a query to a SolrCloud is a two-phase operation. First all shards
receive the query and return ID's and ranking. The merger will then remove
duplicate ID's and then the full documents will be retreived.

When I copy this index to all shards and make a request the following will
happen: Phase one: All shards will receive the query and return ids+ranking
(actually same set from all shards). This part is realistic enough.
Phase two: ID's will be merged and retrieving the documents is not
realistic as if they were spread out between shards (IO wise).

Is there any way I can 'fake' this somehow and have shards return a
prefixed_id for phase1 etc., which then also have to be undone when
retriving the documents for phase2.  I have tried making the hack in
org.apache.solr.handler.component.QueryComponent and a few other classes,
but no success. (The resultset are always empty). I do not need to index
any new documents, which would also be a challenge due to the ID
hash-interval for the shards with this hack.

Anyone has a good idea how to make this hack work?

From,
Thomas Egense

Reply via email to