cyrilou242 opened a new issue #6157: URL: https://github.com/apache/incubator-pinot/issues/6157
## System Thirdeye built from 5f309cff34da802616d92ed361284d95b35ed9b9. BigQuery as a data source. (should not have any impact here) ## Context My data ingestion system has a 3 hours delay, and has a daily granularity, so I wanted to check how the timeframes (especially the upper limit) were computed for anomaly detection. In my case below, the upper date limit seems to be wrong. ### Detection config: Dataset config (shortened): ```yaml "dataset": "dataset", "dimensions": ["dimension1"], "timeColumn": "date", "timeUnit": "DAYS", "timeDuration": 1, "timeFormat": "yyyyMMdd", "timezone": "UTC", "dataSource": "SqlThirdEyeDataSource", "realtime": false, "expectedDelay": { "size": 24, "unit": "HOURS" } ``` Detection config ```yaml detectionName: 'detection1' description: '...' metric: metric1 dataset: dataset cron: "0 0 18 ? * * *" dimensionExploration: dimensions: - dimension1 rules: - detection: - name: 30percentchange type: PERCENTAGE_RULE params: offset: wo1w percentageChange: 0.30 pattern: UP_OR_DOWN ``` ## Detection job run The following queries were run for the detection job, around October 13th, 18pm UTC: On 2020-10-13 18:00:23.019159 UTC: ```sql SELECT MAX(date) FROM dataset ``` I guess the objective of this query is to get the most recent data time in the db. Because my db has a 3 hours delay, the result of the query is: `2020-10-13` On 2020-10-13 18:00:25.735550 UTC: ```sql SELECT dimension1, COUNT(metric1) FROM dataset WHERE UNIX_SECONDS(date) BETWEEN 1602007200 AND 1602612000 GROUP BY dimension1 LIMIT 100000 ``` I guess the objective of this query is to get the different values of `dimension1`, because my detection is configured to split by `dimension1`. The result of the query is [`value1`, `value2`,`value3`] Notice that here the endTime is 1602612000 = 13 October 2020 18:00:00, which seems to corresponds to the cron time. Finally on 2020-10-13 18:00:30.041758 UTC: ```sql SELECT date, COUNT(metric1) FROM dataset WHERE UNIX_SECONDS(date) BETWEEN 1594684800 AND 1602633600 AND metric1 IN ('value1') GROUP BY date LIMIT 100000 ``` The objective of this query is to get the data for the anomaly detection. Here the endTime is 1602633600 = 14 October 2020 00:00:00 ! ### Problem - 14 October is bigger than the observed max date - 13 October would not be correct, because at the time the job is running (6pm on 13 October), the data is not complete for 13 October, so it does not make sense to compare it to previous daily data (between is inclusive) --> The correct value I was expecting was 12 October. This problem (if the implementation is the same) may not be specific to Days granularity or date-precision-only column: - for hours granularity: computing at 16:30 the 16:00 metric will have the same problem - for datetime column: in my example, most_recent_data_time would have been 2020-10-13 15:00:00 --> same problem after I guess Behavior I would expect: something like unix_millis(floorBy(most_recent_data_time, timeUnit)) -1 with floorBy a function rounding to the closest smaller timeUnit, and -1 because the between clause is inclusive. eg1: most_recent_data_time=13 October, granularity is Days: --> return 12 October 23:59:59.999 eg2: most_recent_data_time=13 October 16:30, granularity is Hours: --> return 13 October 15:59:59.999 Did I miss something in the config ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org