marising opened a new pull request #3994: URL: https://github.com/apache/incubator-doris/pull/3994
#2581 ## Solutions This cache give priority to ensuring data consistency. On this basis, it refines the cache granularity and improves the hit rate. Therefore, it has the following characteristics: * User don't need to worry about data consistency. Cache invalidation is controlled by version. The cached data is consistent with the data queried from be * Without additional components and costs, the cache results are stored in the memory of be, and user can adjust the cache memory size according to their needs * Two caching strategies are implemented, sql and partition cache, which are more granular * The cache algorithm in be is an improved LRU ## Two cache mode ### SQLCache Sql cache stores and fetches the cache according to the **SQL signature**, **partition ID** of the query table, and the **latest version** of the partition. The combination of the three determines a cache dataset. If any one of them changes, such as SQL changes, query fields or conditions are not the same, or the version after data update changes, the cache will not be hit. If multiple tables are joined, the latest partition ID and the latest version number are used. If one of the tables is updated, the partition ID or version number will be different, and the cache will not be hit. Sql cache is more suitable for the scenario of T + 1 update. When the data is updated in the morning, the results of the first query are obtained from be and put into the cache, and the subsequent same query is obtained from the cache. Real time update data can also be used, but there may be a low hit rate. Please refer to the following partitioncache. ### PartitionCache Query the number of users per day in the last 7 days, such as partitioning by date, data is only written to the current partition, and the data of other partitions other than that day are fixed. Under the same query SQL, query a partition that does not update The indicators are fixed. As follows, the number of users in the 7 days before the query on 2020-03-09, the data from 2020-03-03 to 2020-03-07 comes from the cache, the first query from 2020-03-08 comes from the partition, and the subsequent queries come from the cache , 2020-03-09 because of the non-stop writing that day, so from the partition. Therefore, querying the data of N days, the latest D days of the data update, each day is only a query with a similar date range, only need to query D partitions, the other parts are all from the cache, which can effectively reduce the cluster load and reduce the query time. ``` MySQL [(none)]> SELECT eventdate,count(userid) FROM testdb.appevent WHERE eventdate>="2020-03-03" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY eventdate; +------------+-----------------+ | eventdate | count(`userid`) | +------------+-----------------+ | 2020-03-03 | 15 | //From cache | 2020-03-04 | 20 | ... | 2020-03-05 | 25 | | 2020-03-06 | 30 | | 2020-03-07 | 35 | | 2020-03-08 | 40 | //From cache | 2020-03-09 | 25 | //From disk +------------+-----------------+ 7 rows in set (0.02 sec) ``` ## Reference For more information, please read [partition_cache.md](https://github.com/apache/incubator-doris/compare/master...marising:partition_cache_0.3?expand=1#diff-bfa91ffdf5960b0af00c5775fa9a2db2) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org