shenyu0127 commented on issue #11099:
URL: https://github.com/apache/pinot/issues/11099#issuecomment-1661325913

   The test is still flaky. 
   
   In https://github.com/apache/pinot/pull/11121, I tried to fix the test by
   - reducing the data size so the servers do not go OOM.
   - reducing the RAM threshold for killing a thread so that we can still kill 
a RAM expensive thread
   
   Now the problem is if the RAM threshold is low, it might kill the broker's 
main thread for the query [1], then the test hangs for ever.
   
   I tried to increase the RAM threshold, the problem is sometimes no broker 
sub thread gets killed and the test fails. It is hard to control the data size 
and RAM sampling to ensure we always kill a broker sub thread but not the 
broker main thread.
   
   Killing a broker main thread is also an expected behavior but it is hard to 
assert that in a test (the exception thrown when the main thread is killed is 
[2]).
   
   [1]
   ```
   2023-07-31T11:11:12.0617231Z 11:11:11.877 INFO 
[PerQueryCPUMemAccountantFactory$PerQueryCPUMemResourceUsageAccountant] [brw-0] 
Adding thread to _threadLocalEntry: brw-0
   2023-07-31T11:11:12.0618241Z 11:11:11.878 WARN 
[PerQueryCPUMemAccountantFactory$PerQueryCPUMemResourceUsageAccountant] 
[CPUMemThreadAccountant] Heap used bytes 2045771776 exceeds critical level 
1717986918
   2023-07-31T11:11:12.0631653Z 11:11:11.879 ERROR 
[PerQueryCPUMemAccountantFactory$PerQueryCPUMemResourceUsageAccountant] 
[CPUMemThreadAccountant] Query 729626673000000002 got picked because using 
11104832 bytes of memory, actual kill committed true}
   2023-07-31T11:11:12.0637965Z 11:11:11.879 ERROR 
[PerQueryCPUMemAccountantFactory$PerQueryCPUMemResourceUsageAccountant] 
[CPUMemThreadAccountant] Current task status recorded is 
{Thread[jersey-server-managed-async-executor-0,5,main]=ThreadEntry{_currentThreadTaskStatus=idle,
 _errorStatus=null}, 
Thread[pqw-0,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqr-0,7,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-2,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[jersey-server-managed-async-executor-1,5,main]=ThreadEntry{_currentThreadTaskStatus=idle,
 _errorStatus=null}, 
Thread[pqw-0,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-1,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqr-0,7,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqr-1,7,main]=ThreadEntry{_currentThreadTaskStatus=idle, _errorStatus=nu
 ll}, Thread[pqw-0,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqr-1,7,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-2,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[brw-0,7,main]=ThreadEntry{_currentThreadTaskStatus=TaskEntry{_queryId='729626673000000002',
 _taskId=0, _rootThread=Thread[jersey-server-managed-async-executor-1,5,main]}, 
_errorStatus=null}, 
Thread[pqr-1,7,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-2,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqr-0,7,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-1,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-1,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}}
   2023-07-31T11:11:12.0644004Z 11:11:11.976 WARN 
[PerQueryCPUMemAccountantFactory$PerQueryCPUMemResourceUsageAccountant] 
[CPUMemThreadAccountant] Query aggregation results 
{729626673000000002=AggregatedStats{_queryId='729626673000000002', 
_anchorThread=Thread[jersey-server-managed-async-executor-1,5,main], 
_isAnchorThread=true, _exceptionAtomicReference=null, _allocatedBytes=11104832, 
_cpuNS=0}} for the previous kill.
   2023-07-31T11:11:12.0645488Z 11:11:11.997 INFO 
[PerQueryCPUMemAccountantFactory$PerQueryCPUMemResourceUsageAccountant] [pqw-3] 
Adding thread to _threadLocalEntry: pqw-3
   2023-07-31T11:11:12.0646521Z 11:11:12.005 INFO 
[PerQueryCPUMemAccountantFactory$PerQueryCPUMemResourceUsageAccountant] [pqw-3] 
Adding thread to _threadLocalEntry: pqw-3
   2023-07-31T11:11:12.0647593Z 11:11:12.006 WARN 
[PerQueryCPUMemAccountantFactory$PerQueryCPUMemResourceUsageAccountant] 
[CPUMemThreadAccountant] Heap used bytes 1965031424 exceeds critical level 
1717986918
   2023-07-31T11:11:12.0648634Z 11:11:12.006 ERROR 
[PerQueryCPUMemAccountantFactory$PerQueryCPUMemResourceUsageAccountant] 
[CPUMemThreadAccountant] Query 729626673000000002 got picked because using 
11104832 bytes of memory, actual kill committed true}
   2023-07-31T11:11:12.0656280Z 11:11:12.006 ERROR 
[PerQueryCPUMemAccountantFactory$PerQueryCPUMemResourceUsageAccountant] 
[CPUMemThreadAccountant] Current task status recorded is 
{Thread[jersey-server-managed-async-executor-0,5,main]=ThreadEntry{_currentThreadTaskStatus=idle,
 _errorStatus=null}, 
Thread[pqw-0,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqr-0,7,main]=ThreadEntry{_currentThreadTaskStatus=TaskEntry{_queryId='Broker_10.1.0.58_18099_729626673000000003_O',
 _taskId=-1, _rootThread=Thread[pqr-0,7,main]}, _errorStatus=null}, 
Thread[pqw-2,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-3,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[jersey-server-managed-async-executor-1,5,main]=ThreadEntry{_currentThreadTaskStatus=idle,
 _errorStatus=null}, 
Thread[pqw-0,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, Thread[pqw-1,5,main]=ThreadEntry{_currentThreadTaskStatu
 s=idle, _errorStatus=null}, 
Thread[pqr-0,7,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqr-1,7,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-0,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqr-1,7,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-2,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[brw-0,7,main]=ThreadEntry{_currentThreadTaskStatus=TaskEntry{_queryId='729626673000000002',
 _taskId=0, _rootThread=Thread[jersey-server-managed-async-executor-1,5,main]}, 
_errorStatus=java.lang.RuntimeException:  Query 729626673000000002 got killed 
because using 11104832 bytes of memory on BROKER: Broker_10.1.0.58_18099, 
exceeding the quota}, 
Thread[pqr-1,7,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-2,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, Thread[pqw-3,5,main]=ThreadEn
 try{_currentThreadTaskStatus=idle, _errorStatus=null}, 
Thread[pqr-0,7,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-1,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}, 
Thread[pqw-1,5,main]=ThreadEntry{_currentThreadTaskStatus=idle, 
_errorStatus=null}}
   2023-07-31T11:11:12.0663482Z 11:11:12.007 WARN 
[PerQueryCPUMemAccountantFactory$PerQueryCPUMemResourceUsageAccountant] 
[CPUMemThreadAccountant] Query aggregation results 
{729626673000000003=AggregatedStats{_queryId='729626673000000003', 
_anchorThread=Thread[jersey-server-managed-async-executor-1,5,main], 
_isAnchorThread=true, _exceptionAtomicReference=null, _allocatedBytes=0, 
_cpuNS=0}, 729626673000000002=AggregatedStats{_queryId='729626673000000002', 
_anchorThread=Thread[jersey-server-managed-async-executor-1,5,main], 
_isAnchorThread=false, _exceptionAtomicReference=java.lang.RuntimeException:  
Query 729626673000000002 got killed because using 11104832 bytes of memory on 
BROKER: Broker_10.1.0.58_18099, exceeding the quota, _allocatedBytes=11104832, 
_cpuNS=0}} for the previous kill.
   2023-07-31T11:11:17.8392954Z 11:11:12.006 ERROR [PinotClientRequest] 
[jersey-server-managed-async-executor-1] Caught exception while processing POST 
request
   2023-07-31T11:11:17.8394641Z java.lang.InterruptedException: null
   2023-07-31T11:11:17.8396275Z         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081)
 ~[?:?]
   2023-07-31T11:11:17.8398661Z         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1369)
 ~[?:?]
   2023-07-31T11:11:17.8529568Z         at 
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:278) ~[?:?]
   2023-07-31T11:11:17.8530816Z         at 
org.apache.pinot.core.transport.AsyncQueryResponse.getFinalResponses(AsyncQueryResponse.java:86)
 ~[classes/:?]
   2023-07-31T11:11:17.8532048Z         at 
org.apache.pinot.broker.requesthandler.SingleConnectionBrokerRequestHandler.processBrokerRequest(SingleConnectionBrokerRequestHandler.java:115)
 ~[classes/:?]
   2023-07-31T11:11:17.8533325Z         at 
org.apache.pinot.broker.requesthandler.BaseBrokerRequestHandler.handleRequest(BaseBrokerRequestHandler.java:701)
 ~[classes/:?]
   2023-07-31T11:11:17.8534409Z         at 
org.apache.pinot.broker.requesthandler.BaseBrokerRequestHandler.handleRequest(BaseBrokerRequestHandler.java:260)
 ~[classes/:?]
   2023-07-31T11:11:17.8535545Z         at 
org.apache.pinot.broker.requesthandler.BrokerRequestHandlerDelegate.handleRequest(BrokerRequestHandlerDelegate.java:107)
 ~[classes/:?]
   2023-07-31T11:11:17.8536622Z         at 
org.apache.pinot.broker.api.resources.PinotClientRequest.executeSqlQuery(PinotClientRequest.java:243)
 ~[classes/:?]
   2023-07-31T11:11:17.8537641Z         at 
org.apache.pinot.broker.api.resources.PinotClientRequest.processSqlQueryPost(PinotClientRequest.java:156)
 ~[classes/:?]
   2023-07-31T11:11:17.8538497Z         at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
   2023-07-31T11:11:17.8539230Z         at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]
   2023-07-31T11:11:17.8540056Z         at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]
   2023-07-31T11:11:17.8540721Z         at 
java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
   2023-07-31T11:11:17.8542098Z         at 
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
 ~[jersey-server-2.39.jar:?]
   2023-07-31T11:11:17.8543895Z         at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
 ~[jersey-server-2.39.jar:?]
   2023-07-31T11:11:17.8545560Z         at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
 ~[jersey-server-2.39.jar:?]
   2023-07-31T11:11:17.8547243Z         at 
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:159)
 ~[jersey-server-2.39.jar:?]
   2023-07-31T11:11:17.8548825Z         at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
 ~[jersey-server-2.39.jar:?]
   2023-07-31T11:11:17.8550132Z         at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
 ~[jersey-server-2.39.jar:?]
   2023-07-31T11:11:17.8551253Z         at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.lambda$apply$0(ResourceMethodInvoker.java:390)
 ~[jersey-server-2.39.jar:?]
   2023-07-31T11:11:17.8552269Z         at 
org.glassfish.jersey.server.ServerRuntime$AsyncResponder$2$1.run(ServerRuntime.java:830)
 [jersey-server-2.39.jar:?]
   2023-07-31T11:11:17.8553080Z         at 
org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) 
[jersey-common-2.39.jar:?]
   2023-07-31T11:11:17.8553822Z         at 
org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) 
[jersey-common-2.39.jar:?]
   2023-07-31T11:11:17.8554582Z         at 
org.glassfish.jersey.internal.Errors.process(Errors.java:292) 
[jersey-common-2.39.jar:?]
   2023-07-31T11:11:17.8555357Z         at 
org.glassfish.jersey.internal.Errors.process(Errors.java:274) 
[jersey-common-2.39.jar:?]
   2023-07-31T11:11:17.8556125Z         at 
org.glassfish.jersey.internal.Errors.process(Errors.java:244) 
[jersey-common-2.39.jar:?]
   2023-07-31T11:11:17.8557022Z         at 
org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
 [jersey-common-2.39.jar:?]
   2023-07-31T11:11:17.8557976Z         at 
org.glassfish.jersey.server.ServerRuntime$AsyncResponder$2.run(ServerRuntime.java:825)
 [jersey-server-2.39.jar:?]
   2023-07-31T11:11:17.8558644Z         at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
   2023-07-31T11:11:17.8559272Z         at 
java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
   2023-07-31T11:11:17.8559980Z         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
   2023-07-31T11:11:17.8560669Z         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
   2023-07-31T11:11:17.8561193Z         at 
java.lang.Thread.run(Thread.java:829) [?:?]
   ```
   
   [2] 
   ```
   java.io.IOException: 
org.apache.pinot.common.exception.HttpErrorStatusException: Got error status 
code: 500 (Internal Server Error) with reason: "Failed to get a reason, 
exception: com.fasterxml.jackson.databind.exc.MismatchedInputException: No 
content to map due to end-of-input
    at [Source: (String)""; line: 1, column: 0]" while sending request: 
http://localhost:18099/query/sql
        at 
org.apache.pinot.controller.helix.ControllerTest.sendPostRequest(ControllerTest.java:795)
        at 
org.apache.pinot.integration.tests.ClusterTest.postQuery(ClusterTest.java:473)
        at 
org.apache.pinot.integration.tests.ClusterTest.postQuery(ClusterTest.java:450)
        at 
org.apache.pinot.integration.tests.OfflineClusterMemBasedBrokerQueryKillingTest.lambda$testDigestOOMMultipleQueries$1(OfflineClusterMemBasedBrokerQueryKillingTest.java:225)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
   Caused by: org.apache.pinot.common.exception.HttpErrorStatusException: Got 
error status code: 500 (Internal Server Error) with reason: "Failed to get a 
reason, exception: com.fasterxml.jackson.databind.exc.MismatchedInputException: 
No content to map due to end-of-input
    at [Source: (String)""; line: 1, column: 0]" while sending request: 
http://localhost:18099/query/sql
        at 
org.apache.pinot.common.utils.http.HttpClient.wrapAndThrowHttpException(HttpClient.java:448)
        at 
org.apache.pinot.controller.helix.ControllerTest.sendPostRequest(ControllerTest.java:791)
        ... 8 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to