gortiz commented on issue #16801:
URL: https://github.com/apache/pinot/issues/16801#issuecomment-3421250258
I'm sorry for not getting back to you sooner. I think your concerns are
valid. I firmly think the idea of QueryLogger is an antipattern by definition
because it goes against the Log4j2 logging principles. Instead we should
1. Classify logs correctly: Although recently we improved this by marking
stable CID for queries, our classification is very poor. Basically, we only
classify by CID and by logger class. Instead we should add other _classes_ like
using markers for logs that may be spammy
2. Do not log queries as part of the message. Include the SQL in the logger
context instead.
# Better classification using Slf4J markers and context
We should never have something like:
```
LOGGER.info("SQL query for request {}: {}", requestId, query);
```
or
```
public void log(QueryLogParams params) {
...
// always log the query last - don't add this to the QueryLogEntry enum
queryLogBuilder.append("query=")
.append(StringUtils.substring(params._requestContext.getQuery(), 0,
_maxQueryLengthToLog));
_logger.info(queryLogBuilder.toString());
}
```
Instead we should log as follows:
```
LoggerConstants.SQL_KEY.registerInMdc(params._requestContext.getQuery());
try {
LOGGER.info(QUERY_RECEIVED_MARKER, "SQL query started");
} finally {
LoggerConstants.SQL_KEY.unregisterFromMdc();
}
```
and
```
LoggerConstants.SQL_KEY.registerInMdc(params._requestContext.getQuery());
try {
// don't add the query to the message, obtain the actual SQL from
QUERY_RECEIVED_MARKER if enabled
// or include in your pattern using %X{pinot.query.sql}
_logger.info(QUERY_FINISHED_MARKER, queryLogBuilder.toString());
} finally {
LoggerConstants.SQL_KEY.unregisterFromMdc();
}
}
```
Where `QUERY_RECEIVED_MARKER` and `QUERY_FINISHED_MARKER` are SLF4J markers
with, for example, the `pinot.sql.received` and `pinot.sql.finished` values.
# Decide what to log with Log4J
Then each user can decide whether these logs should be printed and whether
the query should be printed with them.
For example, in log4j we could decide whether we want to print the initial
SQL message with a filter like:
```
<MarkerFilter marker="pinot.query.received" onMatch="ALLOW"
onMismatch="NEUTRAL"/>
```
Or disable that log with
```
<MarkerFilter marker="pinot.query.received" onMatch="DENY"
onMismatch="NEUTRAL"/>
```
We can also send the log to a pattern layout that includes
`%X{pinot.query.sql}`, which will print the query, or to one that doesn't
include that conversion, in which case the query won't be printed. We could
even send it to a pattern layout like:
```
[cid=%X{pinot.query.cid}] %m [query=%maxLen{%X{pinot.query.query}}{300}]
```
So we would have a log like:
```
[cid=123123123] SQL query started [query=select ...]
```
Where the CID is unique for the SQL query, and the query will be limited to
300 characters.
Then the log printed when queries finish can be redirected to a pattern like
```
[cid=%X{pinot.query.cid}] %m
```
So the log will be:
```
[cid=123123123] table=whatever, timeMs=whatever...
```
But the query _won't_ be printed. Instead, we can pair that log with the
previous one by CID. If instead we want to keep the same previous log, we can
use a pattern like:
```
[cid=%X{pinot.query.cid}] %m, query=%maxLen{%X{pinot.query.query}}{300}
```
# Limit the log spam
Instead of using the very opinionated QueryLogger, we can delegate to the
standard BurstFilter. We can either send to that filter all the logs or all the
logs related to queries. How can we do the latest? Either by sending them all
the logs from QueryLoader (or the equivalent loggers if we end up removing this
class)
# Different Log4j2 sample configs
Most users don't want to spend their time fine-tuning log4j2 config.
Instead, we can provide several log4j2 sample files and modify the bash script,
adding an argument to choose between them.
# Cons
The main problems I see with this approach are:
- Keeping backward compatibility will be difficult. With that I mean to
print the same logs.
- Moving logic to log4j2 files means logging is more difficult to test
# Advantages
- Each customer can customize their logs as much as they want. They can even
change the configuration without a restart (see
https://logging.apache.org/log4j/2.x/manual/configuration.html#configuration-attribute-monitorInterval)
- We can apply the same technique to spammy logs beyond the query.
- We can go further and classify queries by a query option, so users could
mark queries as important (and therefore always log them) or not important
(which may use more constraint burst filter)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]