Re: [D] [Feature] Dynamic Tables [cloudberry]

via GitHub Fri, 06 Dec 2024 04:51:12 -0800


GitHub user avamingli edited a comment on the discussion: [Feature] Dynamic 
Tables


I have been coding some functions, here are design details:

## DDL:

Provide a base table 
```sql
CREATE TABLE t1(a int, b int, c int) distributed by (b);
INSERT INTO t1 SELECT i, i + 1, i + 2 FROM GENERATE_SERIES(1, 10) i;
INSERT INTO t1 SELECT i, i + 1, i + 2 FROM GENERATE_SERIES(1, 5) i;
```
### Create  Dynamic Table:

```sql
CREATE DYNAMIC TABLE dt0 SCHEDULE '5 * * * *' AS
  SELECT a, b, sum(c) FROM t1 GROUP BY a, b WITH NO DATA DISTRIBUTED BY(b);
CREATE DYNAMIC TABLE

\d
                 List of relations
 Schema | Name |     Type      |  Owner  | Storage
--------+------+---------------+---------+---------
 public | dt0  | dynamic table | gpadmin | heap
 public | t1   | table         | gpadmin | heap
(2 rows)
```

CREATE DYNAMIC TABLE xxx AS `Query`
The `Query` allows any valid SELECT SQL of Materialized Views: from single or 
multiple relations,  base tables, materialized views,  and dynamic tables as 
well,  joins, subquery, aggregation, group by and etc.
However, if you want to use it to Answer Query, that is limited by AQUMV: 
currently we allow Select from single base table, aggregation on it or 
aggregation SQL replace directly #705 

#### SCHEDULE: 
A string used to schedule background job which auto-refreshes the dynamic table.
We follow the valid string of pg_cron extension which supports linux crontab, 
refer  https://crontab.guru/ .
```text
 ┌───────────── min (0 - 59)
 │ ┌────────────── hour (0 - 23)
 │ │ ┌─────────────── day of month (1 - 31) or last day of the month ($)
 │ │ │ ┌──────────────── month (1 - 12)
 │ │ │ │ ┌───────────────── day of week (0 - 6) (0 to 6 are Sunday to
 │ │ │ │ │                  Saturday, or use names; 7 is also Sunday)
 │ │ │ │ │
 │ │ │ │ │
 * * * * *
```

You can also use '[1-59] seconds' to schedule a job based on an interval.
The example creates a cron job refreshing the dynamic table at minute 5 of each 
hour.

User don't need to consider the auto-refresh job, however query on pg_task 
catalog if we want to see the task:
```sql
SELECT * FROM pg_task WHERE jobname LIKE 'gp_dynamic_table_refresh%' AND 
command LIKE '%dt0';
 jobid | schedule  |             command              | nodename  | nodeport | 
database | username | active |
   jobname
-------+-----------+----------------------------------+-----------+----------+----------+----------+--------+---------
-----------------------
 17398 | 5 * * * * | REFRESH DYNAMIC TABLE public.dt0 | 127.0.0.1 |     9000 | 
gpadmin  | gpadmin  | t      | gp_dynam
ic_table_refresh_17394
(1 row)
```
And a function **pg_get_dynamic_table_schedule** is provided for users to see 
the SCHEDULE info easily:

```sql
CREATE DYNAMIC TABLE dt_schedule SCHEDULE '1 2 3 4 5' AS SELECT * FROM t2;
SELECT pg_catalog.pg_get_dynamic_table_schedule('dt_schedule'::regclass::oid);
 pg_get_dynamic_table_schedule 
-------------------------------
 1 2 3 4 5
(1 row)
```

As Snowflake, Dynamic Tables should always have a auto-refresh process.
However, for convenience, I make SCHEDULE optional. If user didn't specific it, 
a default schedule is provided: maybe at every 5th minute(snowflake limit at 
most 5 minutes for dynamic table auto-refresh, not sure)?

#### WITH NO DATA: 
Same as Materialized View, will create an empty Dynamic Table if specified.

#### DISTRIBUTED BY:
Same as normal tables in CBDB, Dynamic Tables could support distribution keys 
as materialized views.
Use \d+ to see the distribution keys and the Query SQL of Dynamic Tables.
```sql
\d+ dt0;
                                       Dynamic table "public.dt0"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression | 
Stats target | Description
--------+---------+-----------+----------+---------+---------+-------------+--------------+-------------
 a      | integer |           |          |         | plain   |             |    
          |
 b      | integer |           |          |         | plain   |             |    
          |
 sum    | bigint  |           |          |         | plain   |             |    
          |
View definition:
 SELECT t1.a,
    t1.b,
    sum(t1.c) AS sum
   FROM t1
  GROUP BY t1.a, t1.b;
Distributed by: (b)
Access method: heap
```
#### Refresh Dynamic Table
As seen in pg_task, we put a command to auto-refresh dynamic tables. 
However, if users want to do a REFRESH manually, exec command `REFRESH DYNAMIC 
TABLE` is also supported.
```sql
REFRESH DYNAMIC TABLE dt0;
REFRESH DYNAMIC TABLE
```
#### REFRESH WITH NO DATA;
Same as Materialized Views, Refresh with no data will truncate the Dynamic 
Table and make it unpopulated status.
```sql
REFRESH DYNAMIC TABLE dt0 WITH NO DATA;
REFRESH DYNAMIC TABLE
```

#### Drop Dynamic Table:
```sql
DROP DYNAMIC TABLE dt0;
DROP DYNAMIC TABLE
```
Drop a Dynamic Table will drop its scheduler job automatically.
```sql
SELECT * FROM pg_task WHERE jobname LIKE 'gp_dynamic_table_refresh%' AND 
command LIKE '%dt0';
 jobid | schedule | command | nodename | nodeport | database | username | 
active | jobname
-------+----------+---------+----------+----------+----------+----------+--------+---------
(0 rows)
```
## Privileges
Same as Materialized Views in CBDB:
```sql
\z
                                Access privileges
 Schema | Name |     Type      | Access privileges | Column privileges | 
Policies
--------+------+---------------+-------------------+-------------------+----------
 public | dt1  | dynamic table |                   |                   |
 public | t1   | table         |                   |                   |
(2 rows)
```

## Use Dynamic Tables to answer query

Like Materialized Views, Dynamic Tables could be  used to answer query too:
```sql
CREATE DYNAMIC TABLE dt1  AS
  SELECT * FROM t1 WHERE a = 1 DISTRIBUTED BY(b);
ANALYZE dt1;
SELECT 2
```
```sql
SET enable_answer_query_using_materialized_views = ON;
EXPLAIN(COSTS OFF, VERBOSE)
SELECT * FROM t1 WHERE a = 1;
SELECT * FROM t1 WHERE a = 1;
SET
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   Output: a, b, c
   ->  Seq Scan on public.dt1
         Output: a, b, c
 Settings: enable_answer_query_using_materialized_views = 'on', optimizer = 
'off'
 Optimizer: Postgres query optimizer
(6 rows)

 a | b | c
---+---+---
 1 | 2 | 3
 1 | 2 | 3
(2 rows)
```

> I think we can leverage the combination of these techniques, including 
> materialized views for external tables, dynamic tables (auto refreshing 
> materialized views) and AQUMV to solve the problem often raised by customers 
> who are big fans of a lakehouse architecture: how can we run queries on 
> external tables as fast as internal tables?

We will enable that after #702 is merged, with Dynamic Tables feature.





GitHub link: 
https://github.com/apache/cloudberry/discussions/706#discussioncomment-11316615

----
This is an automatically sent email for dev@cloudberry.apache.org.
To unsubscribe, please send an email to: dev-unsubscr...@cloudberry.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cloudberry.apache.org
For additional commands, e-mail: dev-h...@cloudberry.apache.org

Re: [D] [Feature] Dynamic Tables [cloudberry]

Reply via email to