More Robust Search Timeouts (to Kill Zombie Queries)?

Chris Harris Fri, 27 Mar 2009 20:55:27 -0700

I've noticed that some of my queries take so long (5 min+) that by the
time they return, there is no longer any plausible use for the search
results. I've started calling these zombie queries because, well, they
should be dead, but they just won't die. Instead, they stick around,
wasting my Solr box's CPU, RAM, and I/O resources, and potentially
causing more legitimate queries to stack up. (Regarding "stacking up",
see SOLR-138.)


I may be able to prevent some of this by optimizing my index settings
and by disallowing certain things, such as prefix wildcard queries
(e.g. *ar). However, right now I'm most interested in figuring out how
to get more robust server-side search timeouts in Solr. This would
seem to provide a good balance between these goals:

1) I would like to allow users to attempt to run potentially expensive
queries, such as queries with lots of wildcards or ranges
2) I would like to make sure that potentially expensive queries don't
turn into zombies -- especially long-lasting zombies

For example, I think some of my users might be willing to wait a
minute or two for certain classes of search to complete. But after
that point, I'd really like to say enough is enough.

[Background]

While my load is pretty low (it's not a public-facing site), some of
my queries are monsters that can take, say, over 5 minutes. (I don't
know how much longer than 5 minutes they might take. Some of them
might take hours, for all I know, if allowed to run to completion!)

The biggest cuprit queries currently seem to be wildcard queries. This
is made worse by how I've allowed prefix wildcard searches on an index
with a large # of terms. (This is made worse yet by doing word bigram
indexing.)

I've implemented the "timeAllowed" search time out support feature
introduced in SOLR-502, and this does catch some searches that would
have become zombies. (Some proximity searches, for example.) But the
timeAllowed mechanism does not catch everything. And, as I understand
it, it's powerless to do anything about, say, wildcard expansions that
are taking forever.

The question is how to proceed.

[Option 1: Wait for someone to bring timeAllowed support to more parts
of Solr search]

This might be nice. I sort of assume it will happen eventually. I kind
of want a more immediate solution, though. Any thoughts on how hard it
would be to add the timeout to, say, wildcard expansion? I haven't
figured out if I know enough about Solr yet to work on this myself.

[Option 2: Add gross timeout support to StandardRequestHandler?]

What if I modified StandardRequestHandler so that, when it was invoked,
the following would happen:

* spawn new thread t to do the stuff that StandardRequestHandlerStuff
would normally do
* start thread t
* sleep, waiting for either the thread t to finish or for a timer go off
* after waking up, look whether the timer went off. if so, then
terminate thread t

This would kill any runaway zombie queries. But maybe it would also
have horrible side-effects. Is it wishful thinking to believe that
this might not screw up referencing counting, or create deadlocks, or
anything else?

[Option 3: Servlet container-level Solutions?]

I thought Jetty and friends would might have an option along the lines of
"if a request is taking longer than x seconds, then abort the thread
handling it". This seems troublesome in practice, though:

1) I can't find a servlet container with documentation clearly stating
that this is possible.
2) I played with Jetty, and maxIdleTime sounded like it *might* cause
this behavior, but experiments suggest otherwise.
3) This behavior sounds dangerous, especially unless you can convince
the servlet container to only abort index-reading threads, while
leaving index-writing threads alone.

Thanks for any advice,
Chris

More Robust Search Timeouts (to Kill Zombie Queries)?

Reply via email to