chewbranca commented on code in PR #5602: URL: https://github.com/apache/couchdb/pull/5602#discussion_r2236594391
########## src/couch_srt/README.md: ########## @@ -0,0 +1,1025 @@ +# couch_srt: Couch Stats Resource Tracker aka CSRT + +The `couch_srt` app introduces the Couch Stats Resource Tracker, aka CSRT for +short. CSRT is a real time stats tracking system that tracks the quantity of +resources induced at the process level in a live queryable manner, while also +generating process lifetime reports containing statistics on the total resource +load of a request, as a function of CouchDB operations like dbs/docs opened, +view and changes rows read, changes returned vs processed, Javascript filter +usage, request duration, and more. This system is a paradigm shift in CouchDB +visibility and introspection, allowing for expressive real time querying +capabilities to introspect, understand, and aggregate CouchDB internal resource +usage, as well as powerful filtering facilities for conditionally generating +reports on "heavy usage" requests or "long/slow" requests. CSRT also extends +`recon:proc_window` with `couch_srt:proc_window` allowing for the same style of +battle hardened introspection with Recon's excellent `proc_window`, but with the +sample window over any of the CSRT tracked CouchDB stats! + +CSRT does this by piggy-backing off of the existing metrics tracked by way of +`couch_stats:increment_counter` at the time when the local process induces those +metrics inc calls, and then CSRT updates an ets entry containing the context +information for the local process, such that global aggregate queries can be +performed against the ets table as well as the generation of the process +resource usage reports at the conclusions of the process's lifecyle.The ability +to do aggregate querying in realtime in addition to the process lifecycle +reports for post facto analysis over time, is a cornerstone of CSRT that is the +result of a series of iterations until a robust and scalable aproach was built. + +The real time querying is achieved by way of a global ets table with +`read_concurrency`, `write_concurrency`, and `decentralized_counters` enabled. +Great care was taken to ensure that _zero_ concurrent writes to the same key +occure in this model, and this entire system is predicated on the fact that +incremental updates to `ets:update_counters` provides *really* fast and +efficient updates in an atomic and isolated fashion when coupled with +decentralized counters and write concurrency. Each process that calls +`couch_stats:increment_counter` tracks their local context in CSRT as well, with +zero concurrent writes from any other processes. Outside of the context setup +and teardown logic, _only_ operations to `ets:update_counter` are performed, one +per process invocation of `couch_stats:increment_counter`, and one for +coordinators to update worker deltas in a single batch, resulting in a 1:1 ratio +of ets calls to real time stats updates for the primary workloads. + +The primary achievement of CSRT is the core framework iself for concurrent +process local stats tracking and real time RPC delta accumulation in a scalable +manner that allows for real time aggregate querying and process lifecycle +reports. This took several versions to find a scalable and robust approach that +induced minimal impact on maximum system throughput. Now that the framework is +in place, it can be extended to track any further desired process local uses of +`couch_stats:increment_counter`. That said, the currently selected set of stats +to track was heavily influenced by the challenges in reotractively understanding +the quantity of resources induced by a query like `/db/_changes?since=$SEQ`, or +similarly, `/db/_find`. + +CSRT started as an extension of the Mango execution stats logic to `_changes` +feeds to get proper visibility into quantity of docs read and filtered per +changes request, but then the focus inverted with the realization that we should +instead use the existing stats tracking mechanisms that have already been deemed +critical information to track, which then also allows for the real time tracking +and aggregate query capabilities. The Mango execution stats can be ported into +CSRT itself and just become one subset of the stats tracked as a whole, and +similarly, any additional desired stats tracking can be easily added and will +be picked up in the RPC deltas and process lifetime reports. + +## A Simple Example + +Given a databse `foo` with 11k documents containg a `doc.value` field that is an +integer value which can be filtered in a design doc by way of even and odd. If +we instantiate a series of while loops in parallel making requests of the form: + +> GET /foo/_changes?filter=bar/even&include_docs=true + +We can generate a good chunk of load on a local laptop dev setup, resulting in +requests that take a few seconds to load through the changes feed, fetch all 11k +docs, and then funnel them through the Javascript engine to filter for even +valued docs; this allows us time to query these heavier requests live and see +them in progress with the real time stats tracking and querying capabilities of +CSRT. + +For example, let's use `couch_srt:proc_window/3` as one would do with +`recon:proc_window/3` to get an idea of the heavy active processes on the +system: + +``` +([email protected])2> rp([{PR, couch_srt:to_json(couch_srt:get_resource(PR))} || {PR, _, _} <- couch_srt:proc_window(ioq_calls, 3, 1000)]). +[{{<0.5090.0>,#Ref<0.2277656623.605290499.37969>}, + #{changes_returned => 3962,db_open => 10,dbname => <<"foo">>, + docs_read => 7917,docs_written => 0,get_kp_node => 54, + get_kv_node => 1241,ioq_calls => 15834,js_filter => 7917, + js_filtered_docs => 7917,nonce => <<"cc5a814ceb">>, + pid_ref => + <<"<0.5090.0>:#Ref<0.2277656623.605290499.37969>">>, + rows_read => 7917, + started_at => <<"2025-07-21T17:25:08.784z">>, + type => + <<"coordinator-{chttpd_db:handle_changes_req}:GET:/foo/_changes">>, + updated_at => <<"2025-07-21T17:25:13.051z">>, + username => <<"adm">>}}, + {{<0.5087.0>,#Ref<0.2277656623.606601217.92191>}, + #{changes_returned => 4310,db_open => 10,dbname => <<"foo">>, + docs_read => 8624,docs_written => 0,get_kp_node => 58, + get_kv_node => 1358,ioq_calls => 17248,js_filter => 8624, + js_filtered_docs => 8624,nonce => <<"0e625c723a">>, + pid_ref => + <<"<0.5087.0>:#Ref<0.2277656623.606601217.92191>">>, + rows_read => 8624, + started_at => <<"2025-07-21T17:25:08.424z">>, + type => + <<"coordinator-{chttpd_db:handle_changes_req}:GET:/foo/_changes">>, + updated_at => <<"2025-07-21T17:25:13.051z">>, + username => <<"adm">>}}, + {{<0.5086.0>,#Ref<0.2277656623.605290499.27728>}, + #{changes_returned => 4285,db_open => 10,dbname => <<"foo">>, + docs_read => 8569,docs_written => 0,get_kp_node => 57, + get_kv_node => 1349,ioq_calls => 17138,js_filter => 8569, + js_filtered_docs => 8569,nonce => <<"962cda1645">>, + pid_ref => + <<"<0.5086.0>:#Ref<0.2277656623.605290499.27728>">>, + rows_read => 8569, + started_at => <<"2025-07-21T17:25:08.406z">>, + type => + <<"coordinator-{chttpd_db:handle_changes_req}:GET:/foo/_changes">>, + updated_at => <<"2025-07-21T17:25:13.051z">>, + username => <<"adm">>}}] +ok +``` + +This shows us the top 3 most active processes (being tracked in CSRT) over the +next 1000 milliseconds, sorted by number of `ioq_calls` induced! All of three +of these processes are incurring heavy usage, reading many thousands of docs +with 15k+ IOQ calls and heavy JS filter usage, exactly the types of requests +you want to be alerted to. CSRT's proc window logic is built on top of Recon's, +which doesn't return the process info itself, so you'll need to fetch the +process status with `couch_srt:get_resource/1` and then pretty print it with +`couch_srt:to_json/1`. Review Comment: Thanks! Then you'll appreciate the even further extend example and additional example I added into the main `csrt/index.rst` documentation page! :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
