Feature request /Discussion: JK loadbalancer improvements for high load
Hi all, sorry for the stress but it seems that it is a time to come back to the discussion related to the load balancing for JVM (Tomcat). Prehistory: Recently we made benchmark and smoke tests of our product at the sun high tech centre in Langen (Germany). As the webserver apache2.2.4 has been used, container -10xTomcat 5.5.25 and as load balancer - JK connector 1.2.23 with busyness algorithm. Under the high load the strange behaviour was observed: some tomcat workers temporary got the non-proportional load, often 10 times higher then the others for the relatively long periods. As the result the response times that usually stay under 500ms went up to 20+ sec, that in its turn made the overall test results almost two time worst as estimated. At the beginning we were quite confused, because we were sure that it was not the problem of JVM configuration and supposed that the reason is in LB logic of mod_jk, and the both suggestions were right. Actually the following was happening: the LB sends requests and gets the session sticky, continuously sending the upcoming requests to the same cluster node. At the certain period of time the JVM started the major garbage collection (full gc) and spent, mentioned above, 20 seconds. At the same time jk continued to send new requests and the sticky to node requests that led us to the situation where the one node broke the SLA on response times. I ^ve been searching the web for awhile to find the LoadBalancer implementation that takes an account the GC activity and reduces the load accordingly case JVM is close to the major collection, but nothing found. Once again the LB of JVMs under the load is really an issue for production and with optimally distributed load you are able not only to lower the costs, but also able to prevent bad customer experience, not to mention broken SLAs. Feature request: All lb algorithms have to be extended with the bidirectional connection with jvm: Jvm -> Lb: old gen size and the current occupancy Lb -> Jvm: prevent node overload and advice gc on dependent on parameterized free old gen space in %. All the ideas and comments are appreciated. Regards, Yefym.
Re: Feature request /Discussion: JK loadbalancer improvements for high load
Hi Rainer, seems we have still an issue in jvm config. we have not noticed the concurrent mode gc failures that led to tomcat overload and bad responses. 14896.148: [Full GC 14896.148: [CMS (concurrent mode failure): 1102956K->326641K(2359296K), 4.0684463 secs] 1280549K->326641K(3067136K), [CMS Perm : 131071K->62120K(131072K)], 4.0690589 secs] 15346.161: [Full GC 15346.161: [CMS (concurrent mode failure): 1546160K->324956K(2359296K), 3.4321099 secs] 1637284K->324956K(3067136K), [CMS Perm : 131071K->6K(131072K)], 3.4327402 secs] Unfortunetely the concurrent algorithms in java need itself 30%-50% of memory, so if the jvm still allocates the objs in tenured and there is not enough memory for concurrent gc stuff - it makes stop-the-world. Anyway these events are fired in JVM and would be really really nice if the load-balancer could be smart enough to react on it , e.g. redirecting the load. Agreed, that this problem is not easy to solve and may be requires some features from jvm side that are not implemented yet. Regards, Y. -------- Yefym Dmukh InterComponentWare AG R&D Personal Health Record Industriestraße 41 69190 Walldorf (Baden) Germany Phone: +49 6227 385 122 Fax:+49 6227 385 588 Mail: [EMAIL PROTECTED] -- ** www.icw.de** ** www.LifeSensor.com ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * IHE-konform und hoch skalierbar: Der ICW Master Patient Index hat erfolgreich am IHE Connectathon in Berlin teilgenommen und im HP European Performance und Benchmark Center in Sekundenbruchteilen Patientendaten aus einer Datenbank mit 100 Millionen Datensätzen gefunden. Mehr Informationen zu den innovativen Vernetzungslösungen der ICW finden Sie unter http://www.icw.de * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * InterComponentWare AG: Vorstand: Peter Reuschel (Vors.), Norbert Olsacher, Dr. med. Frank Warda / Aufsichtsratsvors.: Michael Kranich Firmensitz: 69190 Walldorf, Industriestr. 41 / AG Mannheim HRB 351761 / USt.-IdNr.: DE 198388516
Re: Feature request /Discussion: JK loadbalancer improvements for high load
Hi Rainer, seems we have still an issue in jvm config. we have not noticed the concurrent mode gc failures that led to tomcat overload and bad responses. 14896.148: [Full GC 14896.148: [CMS (concurrent mode failure): 1102956K->326641K(2359296K), 4.0684463 secs] 1280549K->326641K(3067136K), [CMS Perm : 131071K->62120K(131072K)], 4.0690589 secs] 15346.161: [Full GC 15346.161: [CMS (concurrent mode failure): 1546160K->324956K(2359296K), 3.4321099 secs] 1637284K->324956K(3067136K), [CMS Perm : 131071K->6K(131072K)], 3.4327402 secs] Unfortunetely the concurrent algorithms in java need itself 30%-50% of memory, so if the jvm still allocates the objs in tenured and there is not enough memory for concurrent gc stuff - it makes stop-the-world. Anyway these events are fired in JVM and would be really really nice if the load-balancer could be smart enough to react on it , e.g. redirecting the load. Agreed, that this problem is not easy to solve and may be requires some features from jvm side that are not implemented yet. Regards, Y. -------- Yefym Dmukh InterComponentWare AG R&D Personal Health Record Industriestraße 41 69190 Walldorf (Baden) Germany Phone: +49 6227 385 122 Fax:+49 6227 385 588 Mail: [EMAIL PROTECTED] -- ** www.icw.de** ** www.LifeSensor.com ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * IHE-konform und hoch skalierbar: Der ICW Master Patient Index hat erfolgreich am IHE Connectathon in Berlin teilgenommen und im HP European Performance und Benchmark Center in Sekundenbruchteilen Patientendaten aus einer Datenbank mit 100 Millionen Datensätzen gefunden. Mehr Informationen zu den innovativen Vernetzungslösungen der ICW finden Sie unter http://www.icw.de * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * InterComponentWare AG: Vorstand: Peter Reuschel (Vors.), Norbert Olsacher, Dr. med. Frank Warda / Aufsichtsratsvors.: Michael Kranich Firmensitz: 69190 Walldorf, Industriestr. 41 / AG Mannheim HRB 351761 / USt.-IdNr.: DE 198388516
Re: Feature request /Discussion: JK loadbalancer improvements for high load
>You have oxymoron here. With session stickiness you are willingly >tear down the load balancer correctness because you don't wish/can >have session replication. >Even with the smartest LB and even with the two way communication >it's only possible to make that working in non session stickyness >topology. In other case you would loose sessions on each GC cycle. >However like Rainer said the solution is to choose >the appropriate GC strategy for web based application. Generally you are right, but the ideal world is not the reality: we use apache my faces implementation of jsf where the core session class size is about 500KB, the compression of state kills CPU, the size kills session replication/failover approach. So we have is what we have aqnd we are trying to get the best out of it. BTW, what about the bidirectional jvm-lb connection and the stop-the-world GC managed by lb, as keep it simple approach ? Mladen Turk <[EMAIL PROTECTED]> 05.07.2007 13:19 Please respond to "Tomcat Developers List" To Tomcat Developers List cc Subject Re: Feature request /Discussion: JK loadbalancer improvements for high load Yefym Dmukh wrote: > > Actually the following was happening: the LB sends requests and gets the > session sticky, continuously sending the upcoming requests to the same > cluster node. At the certain period of time the JVM started the major > garbage collection (full gc) and spent, mentioned above, 20 seconds. At > the same time jk continued to send new requests and the sticky to node > requests that led us to the situation where the one node broke the SLA on > response times. > You have oxymoron here. With session stickiness you are willingly tear down the load balancer correctness because you don't wish/can have session replication. Even with the smartest LB and even with the two way communication it's only possible to make that working in non session stickyness topology. In other case you would loose sessions on each GC cycle. However like Rainer said the solution is to choose the appropriate GC strategy for web based application. Regards, Mladen. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Feature request /Discussion: JK loadbalancer improvements for high load
Mladen wrote: >This won't help much. The sticky session requests must be served >by the same node (group of nodes if using domain clustering), >and your requests will still be delayed by the JVM instance GC cycle >(It has to happen sometime, and you cannot depend on request >void intervals) The main point was to use the stop-the-world collectors PROACTIVELY managed by lb in order to: eliminate the memory and cpu usage overhead ConcurrentXXX ParallelXXX (i.e. up to 50% for mem ) collectors.. prevent the overloading of the single node that falls in GC-cycle. reduce the complexity of jvm tunning With an algorithm: heap is almost full -> do not send the requests to the node anymore and load the other nodes (wait some timeto give the chance for the current workers to do the jobs), advise stop-the-world. Heap free? Resume the load. Certainly this makes sense (if any) only for the setups with many small jvms that have small CPU (or limited for the current needs) and huge Mem size for the stateless applications or the applications with the short session lifecycle. Also may be it would make sense for the domained installations, in GC cycles the other domain members will takeover the jobs. all this makes sense only in case if the load balancer manages not only the load but also GC cycles: kind of ideal JVMs level load-balancer for the high load. Regards, Y.
Re: Feature request /Discussion: JK loadbalancer improvements for high load
>Question, why would you need GC,heap or CPU stats from the Tomcat >machine at all? >in most cases, none of these stats can predict response times. >The best way would be to simply record a history of response times, and >then mod_jk can have algorithms (maybe even pluggable) on how to use >those stats for future request. As for me the Busyness algorithm in mod_jk is almost perfect as well as idea as implementation , the only weak point is that it balances the jvms on the background. The point is that Jvm has such an artifact as GC which under the circumstances makes deny of service for awhile :) and loadbalancer notices is too late, even worse because of any reasons (stickness or internal logic with jobs queue) the lb doesn^t react accordingly at the right time. All other issues like threading and cpu are not from me. To my opinion they are useless. I was talking only about JVM loadbalancing which reacts accordingly on JVM specific things.