cyrilou242 opened a new issue #6157:
URL: https://github.com/apache/incubator-pinot/issues/6157


   ## System
   Thirdeye built from 5f309cff34da802616d92ed361284d95b35ed9b9. 
   BigQuery as a data source. (should not have any impact here)
   
   ## Context
   My data ingestion system has a 3 hours delay, and has a daily granularity, 
so I wanted to check how the timeframes (especially the upper limit) were 
computed for anomaly detection.
   In my case below, the upper date limit seems to be wrong.
   
   ### Detection config: 
   Dataset config (shortened):
   ```yaml
       "dataset": "dataset",
       "dimensions": ["dimension1"],
       "timeColumn": "date",
       "timeUnit": "DAYS",
       "timeDuration": 1,
       "timeFormat": "yyyyMMdd",
       "timezone": "UTC",
       "dataSource": "SqlThirdEyeDataSource",
       "realtime": false,
       "expectedDelay": {
           "size": 24,
           "unit": "HOURS"
       }
   ```
   
   Detection config
   ```yaml
   detectionName: 'detection1'
   description: '...'
   metric: metric1
   dataset: dataset
   cron: "0 0 18 ? * * *"
   dimensionExploration:
       dimensions: 
       - dimension1
   rules:
   - detection:
     - name: 30percentchange      
       type: PERCENTAGE_RULE      
       params:                    
         offset: wo1w             
         percentageChange: 0.30   
         pattern: UP_OR_DOWN
   ```
   
   ## Detection job run
   The following queries were run for the detection job, around October 13th, 
18pm UTC:
   On 2020-10-13 18:00:23.019159 UTC:
   ```sql
   SELECT MAX(date)  FROM dataset
   ```
   I guess the objective  of this query is to get the most recent data time in 
the db.
   Because my db has a 3 hours delay, the result of the query is: `2020-10-13`
   
   On 2020-10-13 18:00:25.735550 UTC:
   ```sql
   SELECT dimension1, COUNT(metric1) 
   FROM dataset 
   WHERE  UNIX_SECONDS(date) BETWEEN 1602007200 AND 1602612000 
   GROUP BY dimension1 
   LIMIT 100000
   ```
   I guess the objective of this query  is to get the different values of  
`dimension1`, because my  detection is configured to split by `dimension1`.
   The result  of the query is [`value1`, `value2`,`value3`]
   Notice that here the endTime is 1602612000 = 13 October 2020 18:00:00, which 
seems to corresponds to the cron time.
   
   Finally on 2020-10-13 18:00:30.041758 UTC:
   ```sql
   SELECT date, COUNT(metric1) 
   FROM dataset 
   WHERE UNIX_SECONDS(date) BETWEEN 1594684800 AND 1602633600 
   AND metric1 IN ('value1') 
   GROUP BY date 
   LIMIT 100000
   ```
   The objective of this query is to get the  data for the anomaly detection.
   Here the endTime is 1602633600   = 14 October 2020 00:00:00 !
   
   ### Problem
   - 14 October is bigger than the observed max date  
   - 13 October would not be correct, because at the time the  job is running  
(6pm on 13 October),  the data is not complete for 13 October, so it does not  
make sense to compare it to  previous daily data (between is inclusive)
   
   --> The correct value I was expecting was 12  October.
   
   This problem (if the implementation is the same) may not be specific to Days 
granularity or date-precision-only column:
   - for hours granularity: computing at  16:30 the 16:00 metric will  have the 
same problem
   - for datetime column: in my example, most_recent_data_time would  have been 
2020-10-13 15:00:00   --> same  problem after I guess
   
   Behavior I would expect:
   something like unix_millis(floorBy(most_recent_data_time, timeUnit)) -1
   with floorBy a function rounding to the closest smaller timeUnit, and -1  
because the between clause is inclusive.
   eg1:
   most_recent_data_time=13 October, granularity is Days:
   --> return  12 October 23:59:59.999
   eg2:
   most_recent_data_time=13 October 16:30, granularity is Hours:
   --> return  13 October 15:59:59.999
   
   Did I miss something in the config ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to