Hi Ron, It is hard to say for sure, but there have been many bug fixes since 8.0-3.2 that can account for some or all of this.
Do you have an environment where you can try out the latest (8.0-5.4)? -Danny -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Ron Hitchens Sent: Thursday, June 16, 2016 1:18 PM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Mysterious, Dramatic Query Slowdown on Multi-Node Cluster We’re seeing a very odd phenomenon on a client project here in the UK. Queries (as in read-only, no updates) slow down dramatically (from 1.2 seconds to 30-40 seconds or longer) while “Jobs” are running that do relatively light updates. But only on multi-node clusters (3 node in this case). Details: MarkLogic 8.0-3.2 Production (pre-launch): A three node cluster running on Linux in AWS JVM app nodes (also in AWS) that perform different tasks, taking to the same ML cluster QA is a single E+D MarkLogic node in AWS The operational scenario is this. o Prod cluster (3 nodes) has about 14+ million documents (articles and books). o Some number of “API app nodes” which present a REST API dedicated to queries o Some number of “worker bee” nodes that process batch jobs for ingestion and content enrichment The intention is that the worker bees handle the slow, lumpy work of processing and validating content before ingesting it into ML. There is a job processing framework that is used on the worker bees to queue, throttle and process jobs asynchronously. The API nodes respond to queries from the web app front end and other clients within the system to do searches, fetch documents, etc. These, for the most part, are pure queries that don’t do any updates. The issue we’ve bumped up against is this: We have a worker bee job that enriches content by, for a particular content document (such as an article), taking each associated binary and submitting it to a thumbnail service. A thread then polls the service until the results are ready. Those results are then written to the content document with URIs of the thumbnail images. In the course of processing these jobs, this is what happens (several can run at once, but we see this problem even with only one running: o A job is pulled off the queue. The queue is just a bunch of job XML documents in ML. o The job’s state is updated to running in its XML doc o Code starts running in the JVM to process the job o During execution, messages can be logged for the job, which results in a node insert to the job XML doc o The thumbnail job reads a list of binary references from the content doc o For each one it issues a request to an external service, then starts a polling thread to check for good completion o There can be up to 10 of these polling threads going at once o They are waiting most of the time, not talking to ML o Messages can be logged to the job doc in the previous step, but the content doc is not touched o When the thumbnail result is ready, then the results are inserted into the content doc in ML o The job finishes up and updates the state of the job doc There is some lock contention for the job doc from multiple threads logging messages, but it’s not normally significant. We see the deadlocks logged by ML at debug level and they seem to resolve within a few milliseconds as expected and the updates always complete quickly. When the results come back and the content doc is updated, there can be contention there as well. Some jitter is introduced to prevent the pollers from all waking up at once, but again this shouldn’t matter even of they do. The odd phenomenon is that while one of these jobs is running (spending most of it’s time waiting, about 4-5 seconds between polls) on one of the worker bee nodes, a query sent from one of the API JVM nodes will take many tens of seconds to complete. Once the job has finished, then query times will return to normal (a few milliseconds to 1-2 seconds depending on the specifics of the query). So the mystery is this: why would a pure query apparently block for a long time in this scenario. Queries should run lock free, so even if there is lock content happening with the thumbnail job, queries should not be held up. MarkLogic is not busy at all, nothing else is going on. This doesn’t happen on a single node, which makes me suspect something to do with cross-node lock propagation. But like I said, logging doesn’t indicate any sort of pathological lock storm or anything like that. If someone can give me some assurance that the latest ML release will solve this problem I’d be happy to recommend that to the client. But I’ve reviewed all the documented bug fixes since 8.0-3 and nothing seems relevant. This is a rather urgent problem since all this thumbnail processing must the completed soon without making the rest of the system unusable. Thanks in advance. --- Ron Hitchens {[email protected]} +44 7879 358212 _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
