Hello everyone, I have a small challenge performance testing a SolrCloud setup. I have 10 shards, and each shard is supposed to have index-size ~200GB. However I only have a single index of 200GB because it will take too long to build another index with different data, and I hope to somehow use this index on all 10 shards and make it behave as all documents are different on each shard. So building more indexes from new data is not an option.
Making a query to a SolrCloud is a two-phase operation. First all shards receive the query and return ID's and ranking. The merger will then remove duplicate ID's and then the full documents will be retreived. When I copy this index to all shards and make a request the following will happen: Phase one: All shards will receive the query and return ids+ranking (actually same set from all shards). This part is realistic enough. Phase two: ID's will be merged and retrieving the documents is not realistic as if they were spread out between shards (IO wise). Is there any way I can 'fake' this somehow and have shards return a prefixed_id for phase1 etc., which then also have to be undone when retriving the documents for phase2. I have tried making the hack in org.apache.solr.handler.component.QueryComponent and a few other classes, but no success. (The resultset are always empty). I do not need to index any new documents, which would also be a challenge due to the ID hash-interval for the shards with this hack. Anyone has a good idea how to make this hack work? From, Thomas Egense