[MarkLogic Dev General] Mysterious, Dramatic Query Slowdown on Multi-Node Cluster

Ron Hitchens Thu, 16 Jun 2016 13:19:05 -0700

   We’re seeing a very odd phenomenon on a client project here in the UK.  
Queries (as in read-only, no updates) slow down dramatically (from 1.2 seconds 
to 30-40 seconds or longer) while “Jobs” are running that do relatively light 
updates.  But only on multi-node clusters (3 node in this case).


Details:
MarkLogic 8.0-3.2
Production (pre-launch): A three node cluster running on Linux in AWS
JVM app nodes (also in AWS) that perform different tasks, taking to the same ML 
cluster

QA is a single E+D MarkLogic node in AWS

   The operational scenario is this.

o Prod cluster (3 nodes) has about 14+ million documents (articles and books).
o Some number of “API app nodes” which present a REST API dedicated to queries
o Some number of “worker bee” nodes that process batch jobs for ingestion and 
content enrichment

   The intention is that the worker bees handle the slow, lumpy work of 
processing and validating content before ingesting it into ML.  There is a job 
processing framework that is used on the worker bees to queue, throttle and 
process jobs asynchronously.

   The API nodes respond to queries from the web app front end and other 
clients within the system to do searches, fetch documents, etc.  These, for the 
most part, are pure queries that don’t do any updates.

   The issue we’ve bumped up against is this: We have a worker bee job that 
enriches content by, for a particular content document (such as an article), 
taking each associated binary and submitting it to a thumbnail service.  A 
thread then polls the service until the results are ready.  Those results are 
then written to the content document with URIs of the thumbnail images.

   In the course of processing these jobs, this is what happens (several can 
run at once, but we see this problem even with only one running:

   o A job is pulled off the queue.  The queue is just a bunch of job XML 
documents in ML.
   o The job’s state is updated to running in its XML doc
   o Code starts running in the JVM to process the job
   o During execution, messages can be logged for the job, which results in a 
node insert to the job XML doc
   o The thumbnail job reads a list of binary references from the content doc
   o For each one it issues a request to an external service, then starts a 
polling thread to check for good completion
      o There can be up to 10 of these polling threads going at once
      o They are waiting most of the time, not talking to ML
   o Messages can be logged to the job doc in the previous step, but the 
content doc is not touched
   o When the thumbnail result is ready, then the results are inserted into the 
content doc in ML
   o The job finishes up and updates the state of the job doc

   There is some lock contention for the job doc from multiple threads logging 
messages, but it’s not normally significant.  We see the deadlocks logged by ML 
at debug level and they seem to resolve within a few milliseconds as expected 
and the updates always complete quickly.

   When the results come back and the content doc is updated, there can be 
contention there as well.  Some jitter is introduced to prevent the pollers 
from all waking up at once, but again this shouldn’t matter even of they do.

   The odd phenomenon is that while one of these jobs is running (spending most 
of it’s time waiting, about 4-5 seconds between polls) on one of the worker bee 
nodes, a query sent from one of the API JVM nodes will take many tens of 
seconds to complete.  Once the job has finished, then query times will return 
to normal (a few milliseconds to 1-2 seconds depending on the specifics of the 
query).

   So the mystery is this: why would a pure query apparently block for a long 
time in this scenario.  Queries should run lock free, so even if there is lock 
content happening with the thumbnail job, queries should not be held up.  
MarkLogic is not busy at all, nothing else is going on.

   This doesn’t happen on a single node, which makes me suspect something to do 
with cross-node lock propagation.  But like I said, logging doesn’t indicate 
any sort of pathological lock storm or anything like that.

   If someone can give me some assurance that the latest ML release will solve 
this problem I’d be happy to recommend that to the client.  But I’ve reviewed 
all the documented bug fixes since 8.0-3 and nothing seems relevant.

   This is a rather urgent problem since all this thumbnail processing must the 
completed soon without making the rest of the system unusable.

   Thanks in advance.

---
Ron Hitchens {[email protected]}  +44 7879 358212

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] Mysterious, Dramatic Query Slowdown on Multi-Node Cluster

Reply via email to