[GitHub] [incubator-doris] marising opened a new pull request #3994: SQL result cache and partition cache

GitBox Wed, 01 Jul 2020 00:59:54 -0700


marising opened a new pull request #3994:
URL: https://github.com/apache/incubator-doris/pull/3994



   #2581
   
   ## Solutions
   
   This cache give priority to ensuring data consistency. On this basis, it 
refines the cache granularity and improves the hit rate. Therefore, it has the 
following characteristics:
   
   * User don't need to worry about data consistency. Cache invalidation is 
controlled by version. The cached data is consistent with the data queried from 
be
   * Without additional components and costs, the cache results are stored in 
the memory of be, and user can adjust the cache memory size according to their 
needs
   * Two caching strategies are implemented, sql and partition cache, which are 
more granular
   * The cache algorithm in be is an improved LRU
   
   ## Two cache mode
   
   ### SQLCache
   Sql cache stores and fetches the cache according to the **SQL signature**, 
**partition ID** of the query table, and the **latest version** of the 
partition. 
   
   The combination of the three determines a cache dataset. If any one of them 
changes, such as SQL changes, query fields or conditions are not the same, or 
the version after data update changes, the cache will not be hit.
   
   If multiple tables are joined, the latest partition ID and the latest 
version number are used. If one of the tables is updated, the partition ID or 
version number will be different, and the cache will not be hit.
   
   Sql cache is more suitable for the scenario of T + 1 update. When the data 
is updated in the morning, the results of the first query are obtained from be 
and put into the cache, and the subsequent same query is obtained from the 
cache. Real time update data can also be used, but there may be a low hit rate. 
Please refer to the following partitioncache.
   
   ### PartitionCache
   Query the number of users per day in the last 7 days, such as partitioning 
by date, data is only written to the current partition, and the data of other 
partitions other than that day are fixed. Under the same query SQL, query a 
partition that does not update The indicators are fixed. As follows, the number 
of users in the 7 days before the query on 2020-03-09, the data from 2020-03-03 
to 2020-03-07 comes from the cache, the first query from 2020-03-08 comes from 
the partition, and the subsequent queries come from the cache , 2020-03-09 
because of the non-stop writing that day, so from the partition.
   
   Therefore, querying the data of N days, the latest D days of the data 
update, each day is only a query with a similar date range, only need to query 
D partitions, the other parts are all from the cache, which can effectively 
reduce the cluster load and reduce the query time.
   
   ```
   MySQL [(none)]> SELECT eventdate,count(userid) FROM testdb.appevent WHERE 
eventdate>="2020-03-03" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY 
eventdate;
   +------------+-----------------+
   | eventdate  | count(`userid`) |
   +------------+-----------------+
   | 2020-03-03 |              15 | //From cache
   | 2020-03-04 |              20 | ...
   | 2020-03-05 |              25 |
   | 2020-03-06 |              30 |
   | 2020-03-07 |              35 |
   | 2020-03-08 |              40 | //From cache
   | 2020-03-09 |              25 | //From disk
   +------------+-----------------+
   7 rows in set (0.02 sec)
   ```
   
   ## Reference
   For more information, please read 
[partition_cache.md](https://github.com/apache/incubator-doris/compare/master...marising:partition_cache_0.3?expand=1#diff-bfa91ffdf5960b0af00c5775fa9a2db2)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [incubator-doris] marising opened a new pull request #3994: SQL result cache and partition cache

Reply via email to