Generic Test Injection Points for Cloud

Gus Heck Tue, 27 Aug 2024 07:41:15 -0700

In thinking about the testing for QueryLimits, I'm somewhat unsatisfied
because it seems like it would be ideal to be validating limit expirations
at various points in the request process but the current tests are all
thumbs, doing things like wasting CPU when query limits are checked which
can reliably generate a limit expiration, but it's bad for test times and
difficult to predict exactly where in the process the failure occurs...


This morning I spent some time trying to think of places it might be good
to explicitly cause a Limit Expiration via test injection, and as I did so
I realized that it would be equally good to be able to inject other
tragedies such as exceptions or pauses as well.

The thought project below is imagining a SolrCloudTestCase which is
simulating a distributed request and we would want to understand/validate
correct behavior for something disruptive at various points in the
Coordinator, or the Shard requests. Still pondering if this is achievable,
but I would like feedback on the following potential testing points (I
expressed it as an enum because that helped me think about it, but not
wedded to the actual enum here). I was writing it as Limits oriented, but
am now pondering if it could be more generalized.

  public enum LimitExpirationLocation {
    BEFORE_ALL_COORD,      // Limits should be expired at the entry into
                           // the coordinator's Solr Dispatch Filter

    BEFORE_COMP_REQ_COORD,  // Limits should fail prior to processing the
request with
                            // a specified component

    BEFORE_ONE_COORD,       // Limits should expire after some requests to
shards have been
                            // issued by the coordinator, but before any
response is processed.

    BEFORE_SRCH_ONE_SHARD,  // Limits should expire at entry to
SolrDispatchFilter for one shard request

    BEFORE_SRCH_ALL_SHARD,  // Limits should expire at entry to
SolrDispatchFilter for all shard requests

    DURING_LUCENE,          // Limits should expire in during the lucene
search process

    // todo: REQUEST PURPOSES such as TOP_IDs?

    BEFORE_REND_ONE_SHARD,  // Limits should expire after lucene search,
before results are
                            // rendered for just one shard

    BEFORE_REND_ALL_SHARD,  // Limits should expire after lucene search,
before results are rendered
                            // on all shards.

    AFTER_ONE_COORD,        // Limits should expire after the first
response is received
                            // by the coordinator

    AFTER_RESPONSES_COORD,  // Limits should expire after all responses
have returned to coordinator
                            // but before the response has been rendered
into json/xml/etc

    BEFORE_COMP_RESP_COORD, // Limits should expire before a component is
called

    DURING_RENDER_COORD,    // Limits should expire while the coordinator
is interpreting the responses

    DURING_TRANSMIT_COORD   // Limits should expire at or after the
response is being returned
                            // (typically should have no effect)
  }

Thoughts? Any interesting points to expire/fail I missed. One thing I'm
thinking is that it would be super useful for any such test injection to be
able to ensure that it only blew up the target request, so a supplied
request ID propagated to shard requests might help which sounds something
like request tracing but I'm not sure if the tracing infrastructure would
expose anything useful there (haven't looked). Partly I'm leery of forcing
such tests to employ tracing (not sure if that's working properly in
CloudTestCases either, haven't looked at that yet either.).

-Gus
-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Generic Test Injection Points for Cloud

Reply via email to