Hello again, 

I wanted to fill anyone who's interested in this topic, here's is my 
findings.
To recap, i tried to stress-test one instance of CAS 6.6.7(in my context, i 
used Hazelcast to store the tickets) and found out that with one TGT, if a 
user tries to concurrently make ST generation requests, some of that 
concurrent requests return HTTP 404 with "given TGT is not found" even 
though TGT exists. 

Here's the Apache Jmeter Test Case File For This Context (in first thread 
group get TGT and save it to a param name TGT and with using groovy save it 
to a global var named sharedTGT, and use it on the second thread group 
generating concurrent ST generation http request): 
https://yusufgunduz.tr/cas/ExampleConcurrentStGenerationTestPlan.jmx

https://yusufgunduz.tr/cas/ss1.png
https://yusufgunduz.tr/cas/ss2.png

At first i thought about my testing is wrong, making more standardized load 
testing on my local machine and see the results. I recently found out that 
is not the case at all. The reason i later see is what CAS does when 
generating ST's, which is to update the TGT state (updating the latest 
usage and expiration times for example). Here's some trace logs when 
generating a ST: 

2025-06-07 12:47:13,530 TRACE [org.apereo.cas.ticket.AbstractTicket] - 
<Before updating ticket 
[TGT-1-G4YHsg40VoltRu4EIbo8Iu7oq-O9Qjjvo7u2GvtrQqiG--fYm1BDcto5avX7ySkh0ew-cas1]
2025-06-07T12:47:13.531128863Z Previous time used: [null]
2025-06-07T12:47:13.531137061Z Last time used: 
[2025-06-07T12:47:05.217906789Z]
2025-06-07T12:47:13.531142758Z Usage count: [0]>
2025-06-07T12:47:13.531148293Z 2025-06-07 12:47:13,530 TRACE 
[org.apereo.cas.ticket.AbstractTicket] - <After updating ticket 
[TGT-1-G4YHsg40VoltRu4EIbo8Iu7oq-O9Qjjvo7u2GvtrQqiG--fYm1BDcto5avX7ySkh0ew-cas1]
2025-06-07T12:47:13.531153905Z Previous time used: 
[2025-06-07T12:47:05.217906789Z]
2025-06-07T12:47:13.531159254Z Last time used: 
[2025-06-07T12:47:13.530785797Z]
2025-06-07T12:47:13.531164391Z Usage count: [1]>

With concurrent ST requests with given one TGT, this concurrent state 
update operation(and subsequent ones) is done with using Locks(Reentrant 
Lock for this case which is ok for one JVM but not ok for distributed 
deployments, can be configurable though). With updating TGT state 
concurrently, what CAS does is, generate a pool of Locks and within given 
time period (3s is the default) if a lock is not acquired, make it so that 
that request returns HTTP 404 "TGT not found".

For multiple-node deployments, this for-one-JVM approach may not be a good 
idea for consistency (at least my searches would say about this), so i tred 
to override this behaviour to use a distributed locking, which Hazelcast 
can already can provide, named FencedLock. So after reading about CAS docs(my 
hosted one 
<https://yusufgunduz.tr/cas/6.6.x/ticketing/Ticket-Registry-Locking.html#default>),
 
i tried implementing like this on my overlay project: 

package com.example.cas.config;


import java.util.Optional;
import java.util.concurrent.TimeUnit;
import java.util.function.Supplier;

import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.cp.lock.FencedLock;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.apereo.cas.util.lock.LockRepository;
import org.springframework.stereotype.Component;
import  com.example.cas.rest.exception.TooManyStGenerationRequestsException;

@Slf4j
@RequiredArgsConstructor
@Component("hazelcastLockRepository")
public class HazelcastLockRepository implements LockRepository {

private final HazelcastInstance hazelcastInstance;
private static final long LOCK_TIMEOUT_SECONDS = 3;

@Override
public <T> Optional<T> execute(Object lockKey, Supplier<T> supplier) {
String key = lockKey.toString();
FencedLock lock = hazelcastInstance.getCPSubsystem().getLock(key);
boolean acquired = false;

try {
acquired = lock.tryLock(LOCK_TIMEOUT_SECONDS, TimeUnit.SECONDS);
if (acquired) {
LOGGER.debug("Acquired lock for key: {}", key);
return Optional.ofNullable(supplier.get());
} else {
LOGGER.warn("Too many concurrent requests — lock timeout for key: {}", key);
throw new TooManyStGenerationRequestsException("Too many concurrent 
requests for key: " + key);
}
} finally {
if (acquired) {
try {
lock.unlock();
LOGGER.debug("Released lock for key: {}", key);
} catch (Exception e) {
LOGGER.warn("Failed to unlock key: {} - {}", key, e.getMessage(), e);
}
}
}
}
}


and register it like this, as explained on the docs: 

/**
* ST üretiminde, TGT durumu güncellenmekte(TGT expire zamanı güncellenmesi 
gibi) ve concurrent
* ST üretim isteklerinde olası stale-data durumunu önlemek amacıyla CAS 
içerisinde lock
* mekanizması ayarlanmış. Varsayılan olarak,
* single-node kullanımına elverişli olan {@link 
org.apereo.cas.util.lock.DefaultLockRepository}
* ile {@link org.springframework.integration.support.locks.DefaultLockRegistry} 
kullanılmakta,
* ve yoğun isteklerde 1024 adet ayarlanmış {@link 
java.util.concurrent.locks.ReentrantLock}
* dolduğunda, locklanan TGT beklenirken hata vermek yerine boş ({@link 
java.util.Optional#empty})
* dönmekte ve TGT bulunamadığından kullanıcıya 404 TGT bulunamadı hatası 
dönmektedir.
* Bunun yerine hali hazırda kullanılan ve distributed deployment'lardaki(ör 
k8s'te multi-pod
* çalışan CAS'lardaki gibi) kullanıma daha uygun Hazelcast'in FencedLock'ı 
kullanacak
* şekilde customize edip CAS'ın ST üretilirken bunu kullanması ve olası 
durumda kullanıcının
* çok sık ST üretim isteği gönderdiğini belirtmek adına HTTP 429 Too Many 
Requests
* dönecek şekilde olması adına burada ayarladım.
*
* @param hazelcastInstance CAS'ın iç mekanizmasında ayarladığı Hazelcast 
instance'ı.
*/
@Bean
public LockRepository casTicketRegistryLockRepository(HazelcastInstance 
hazelcastInstance) {
return new HazelcastLockRepository(hazelcastInstance);
}

with these configurations, instead of per JVM locks, i could use  
cluster-wide distributed locks and throw a custom exception ( 
TooManyStGenerationRequestsException ) the concurrent requests that can't 
get/hold a lock(for better consistency, configure the 
*cas.ticket.registry.hazelcast.cluster.core.cp-member-count* setting and 
use cp subsystem on your embedded Hazelcast). This config will make your 
request return 500(instead of default 404 behaviour). You can configure it 
as the Default one 
<https://github.com/apereo/cas/blob/v6.6.15.2/api/cas-server-core-api-util/src/main/java/org/apereo/cas/util/lock/DefaultLockRepository.java#L26-L38>
 
does, returning Optional.empty() instead of throwing an exception, which 
will result HTTP 404 TGT not found, or follow the case as below.

I also updated the ST generation endpoint (in ServiceTicketResource 
<https://github.com/apereo/cas/blob/v6.6.15.2/support/cas-server-support-rest-core/src/main/java/org/apereo/cas/support/rest/resources/ServiceTicketResource.java#L115-L117>)
 
on my overlay project, to catch this exception and return a 429 Too Many 
Requests instead of 500 like this:
...
catch (final TooManyStGenerationRequestsException e) {
return new ResponseEntity<>("Too many concurrent ST generation requests 
for: " + StringEscapeUtils.escapeHtml4(tgtId), HttpStatus.TOO_MANY_REQUESTS
);
...

Here's the results of failed ST generation request screenshots with the 
same test case above: 
https://yusufgunduz.tr/cas/ss3.png
https://yusufgunduz.tr/cas/ss4.png  

In summary, my question was about the reason of these 404 requests on 
concurrent ST generations with the same TGTs, and the answer i think is, 
CAS's way of updating the TGT state when generating ST's and its protection 
on concurrently requests. 

On newer CAS versions, i hope that with Hazelcast Ticketing enabled, this 
kind of configuration can be auto configured and used instead of single JVM 
protections and have more documentation on it.

In another hopeful suggestion, i hope that you can use external Hazelcast 
cluster and connect CAS to them, instead of auto configuring an embedded 
one in startup, because of the memory consumption embedded Hazelcast 
requires on production environments.

Thank you and have a nice day,
YG

10 Ağustos 2024 Cumartesi tarihinde saat 12:24:41 UTC+3 itibarıyla Y G 
şunları yazdı:

> Hello Ray, 
> First of all, thank you for your response.
> After sending this mail, i set up and ran the command with 200 concurrent 
> request with 75 query per second with the tool i mentioned. First few times 
> no 404's but after running the command 3rd or 4th time, 404's started 
> appearing on the display and reporting it to me at the end of the test. 
> About the query per second setting of this mini tool, it shows the reached 
> query per second statistics when completed. So i can safely assume that it 
> does the qps thing (i can't say if it can reach given count though)
>
> I chose this tool called oha for the convenience and the display of the 
> progress and results, but after your reply, i will try and use jmeter with 
> the mentions on the docs. I think i'm trying a really niche case (using 1 
> user's TGT to call concurrent requests to ST generation endpoint).
>
> I'll check and get back on this. 
>
> Thanks and a have good day
>
> YG
>
> 10 Ağustos 2024 Cumartesi tarihinde saat 05:49:54 UTC+3 itibarıyla Ray Bon 
> şunları yazdı:
>
>> Yusuf,
>>
>> How long did the test run before the 400's?
>> Do you experience this when only one cas server is running?
>> Is that really 75 queries per second?
>>
>>
>> You might want to try jMeter load tests included with the cas project, 
>> https://github.com/apereo/cas/tree/6.6.x/etc/loadtests
>> This way you can have more than one user and service.
>>
>> Ray 
>>
>> On Fri, 2024-08-09 at 11:17 -0700, Y G wrote:
>>
>> Hello all,
>> I'm having a weird issue when doing a light load testing for a CAS with 
>> Hazelcast enabled. 
>>
>> The steps i took: 
>> 1. Get a TGT for one user (by making a HTTP POST the /cas/v1/tickets with 
>> username and password like explained here 
>> <https://apereo.github.io/cas/6.6.x/protocol/REST-Protocol-Request-TicketGrantingTicket.html>
>> )
>> 2. Using one TGT, repeatedly call the ST generation rest endpoint (HTTP 
>> POST to */cas/v1/tickets/{{TGT}}?service={{MY-VALID-SERVICE-URL}}* like 
>> explained here 
>> <https://apereo.github.io/cas/6.6.x/protocol/REST-Protocol-Request-ServiceTicket.html>)
>>  
>> by using this mini http load testing tool called *oha* 
>> <https://github.com/hatoo/oha/releases>(couldn't make ab work) these 
>> bash codes: 
>>
>> export TGT={{TGT-VALUE- FROM-FIRST-STEP}}
>>
>> ./oha -m POST \                                                *-> makes 
>> a post request*
>> --insecure \                                                         *-> 
>> allow insecure ssl *
>> -H "Content-Type: application/x-www-form-urlencoded" \  
>> -H "Accept: text/plain" \                                   * -> add 
>> header*
>> -n 1000 \                                                              *-> 
>> total number of requests  *
>> -q 75 \                                                                  *-> 
>> QUERY_PER_SECONDcount*
>> -c 500 \                                                               * 
>> -> concurrent request count*
>> --latency-correction \
>> https://localhost:8443/cas/v1/tickets/$TGT?service=https://localhost:8443
>>
>>
>> *[image: oha-screenshot.png]*
>>
>> after a few couple of 100% successful HTTP 200 response codes, i started 
>> seeing * HTTP 400 response*s that returns*:*
>> *"TGT-1-********4ij6IiQ-myhostname could not be found or is considered 
>> invalid"*
>>
>> i checked the logs and only see INVALID_TICKET and  
>> REST_API_SERVICE_TICKET_FAILED audit logs, couldn't see any more 
>> information even if setting the log level to `trace`. 
>>
>> i tried debugging it on the cas-overlay what i could find is an 
>> InvalidTicketException thrown, and i couldn't find where is was thrown, and 
>> can not see the code beyond the 
>> CentralAuthenticationService.grantServiceTicket method (i couldn't find the 
>> implementation class of this, anybody know this?) 
>>
>> i even tested this issue on a 3 node cas cluster with embedded 
>> hazelcast's properly set up and have the same problem. I tried setting 
>> *cas.ticket.registry.core.enable-locking* to false (docs 
>> <https://apereo.github.io/cas/6.6.x/ticketing/Hazelcast-Ticket-Registry.html>)and
>>  
>> see any changes, only see that 404's converting to timeouts (did not give a 
>> long timeout but nonetheless)
>>
>> I read about 
>> https://apereo.github.io/cas/6.6.x/ticketing/Ticket-Registry-Locking.html
>>  
>> but i need more knowledge about this topic so here are  my questions:
>>
>> 1. What's your opinion of this behaviour? why do you think this happens? 
>> Does this issue really about this registry-locking?
>> 2. For a memory backed ticket registry, docs say (and i checked, it 
>> works) default ticket registry implementation does its thing and it works 
>> for single node with no problems on load, but what about clustered and 
>> hazelcast-backed ticket registry??? has it been implemented? 
>> 3. I tried debugging the cas-overlay project but could not properly walk 
>> the stacktrace, anybody can show me the implementation of this interface 
>> method: CentralAuthenticationService.grantServiceTicket? what does this 
>> code do and where, and how to debug it?
>>
>>
>> Thanks for your patience and have a nice weekend.
>> Yusuf
>>
>>
>>
>>

-- 
- Website: https://apereo.github.io/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
--- 
You received this message because you are subscribed to the Google Groups "CAS 
Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to cas-user+unsubscr...@apereo.org.
To view this discussion visit 
https://groups.google.com/a/apereo.org/d/msgid/cas-user/975438d2-4ccf-4b48-b21e-28b9249c5f14n%40apereo.org.

Reply via email to