In thinking about the testing for QueryLimits, I'm somewhat unsatisfied because it seems like it would be ideal to be validating limit expirations at various points in the request process but the current tests are all thumbs, doing things like wasting CPU when query limits are checked which can reliably generate a limit expiration, but it's bad for test times and difficult to predict exactly where in the process the failure occurs...
This morning I spent some time trying to think of places it might be good to explicitly cause a Limit Expiration via test injection, and as I did so I realized that it would be equally good to be able to inject other tragedies such as exceptions or pauses as well. The thought project below is imagining a SolrCloudTestCase which is simulating a distributed request and we would want to understand/validate correct behavior for something disruptive at various points in the Coordinator, or the Shard requests. Still pondering if this is achievable, but I would like feedback on the following potential testing points (I expressed it as an enum because that helped me think about it, but not wedded to the actual enum here). I was writing it as Limits oriented, but am now pondering if it could be more generalized. public enum LimitExpirationLocation { BEFORE_ALL_COORD, // Limits should be expired at the entry into // the coordinator's Solr Dispatch Filter BEFORE_COMP_REQ_COORD, // Limits should fail prior to processing the request with // a specified component BEFORE_ONE_COORD, // Limits should expire after some requests to shards have been // issued by the coordinator, but before any response is processed. BEFORE_SRCH_ONE_SHARD, // Limits should expire at entry to SolrDispatchFilter for one shard request BEFORE_SRCH_ALL_SHARD, // Limits should expire at entry to SolrDispatchFilter for all shard requests DURING_LUCENE, // Limits should expire in during the lucene search process // todo: REQUEST PURPOSES such as TOP_IDs? BEFORE_REND_ONE_SHARD, // Limits should expire after lucene search, before results are // rendered for just one shard BEFORE_REND_ALL_SHARD, // Limits should expire after lucene search, before results are rendered // on all shards. AFTER_ONE_COORD, // Limits should expire after the first response is received // by the coordinator AFTER_RESPONSES_COORD, // Limits should expire after all responses have returned to coordinator // but before the response has been rendered into json/xml/etc BEFORE_COMP_RESP_COORD, // Limits should expire before a component is called DURING_RENDER_COORD, // Limits should expire while the coordinator is interpreting the responses DURING_TRANSMIT_COORD // Limits should expire at or after the response is being returned // (typically should have no effect) } Thoughts? Any interesting points to expire/fail I missed. One thing I'm thinking is that it would be super useful for any such test injection to be able to ensure that it only blew up the target request, so a supplied request ID propagated to shard requests might help which sounds something like request tracing but I'm not sure if the tracing infrastructure would expose anything useful there (haven't looked). Partly I'm leery of forcing such tests to employ tracing (not sure if that's working properly in CloudTestCases either, haven't looked at that yet either.). -Gus -- http://www.needhamsoftware.com (work) https://a.co/d/b2sZLD9 (my fantasy fiction book)