GitHub user my-ship-it closed a discussion: Enable answer query using 
Materialized View for external table.

### Description

Currently, for inner table,  planner knows to use MV to speed up query, for 
example:

```
create table tbl_heap(c1 int, c2 int);
insert into tbl_heap select generate_series(1,1000000);
create materialized view mv as select c1, c2 from tbl_heap where c1 > 10 and c2 
< 20;
analyze mv;
set optimizer = off;
set enable_answer_query_using_materialized_views = on;

postgres=# explain select c1, c2 from tbl_heap where c1 > 10 and c2 < 20;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1.03 rows=1 width=8)
   ->  Seq Scan on mv  (cost=0.00..1.01 rows=1 width=8)
 Optimizer: Postgres query optimizer
(3 rows)

```

But for external table, it doesn't work
```
create external table tbl_ext(c1 int, c2 int) 
location('gpfdist://localhost:8080/tbl_ext') format 'csv';
create materialized view mv_ext as select c1, c2 from tbl_ext where c1 > 10 and 
c2 < 20;
analyze mv_ext;
set optimizer = off;
set enable_answer_query_using_materialized_views = on;

postgres=# explain select c1, c2 from tbl_ext where c1 > 10 and c2 < 20;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..246940.11 rows=17320508 
width=8)
   ->  Foreign Scan on tbl_ext  (cost=0.00..16000.00 rows=5773503 width=8)
         Filter: ((c1 > 10) AND (c2 < 20))
 Optimizer: Postgres query optimizer
(4 rows)

```

The reason why the replacement for external doesn't work is that, if external 
table changes, planner doesn't know that, and replacement of MV would produce 
wrong results.

But for some cases, access to external table(for example, Iceberg, Hudi, and 
etc) is much slower than inner table(maybe 10X slower or more). If user knows 
there is no data changes for external table, we could speed up query against 
external table, which is a big win in this scenario. Another approach is that 
user load external table into inner table for analysis, but there is latency, 
and loading process is also time consuming.

How about we introduce a GUC to control the behavior? If the user is 100% sure 
that the external table data has not been modified, they could turn on the GUC 
and speed up query.

### Use case/motivation

_No response_

### Related issues

_No response_

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

GitHub link: https://github.com/apache/cloudberry/discussions/693

----
This is an automatically sent email for dev@cloudberry.apache.org.
To unsubscribe, please send an email to: dev-unsubscr...@cloudberry.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cloudberry.apache.org
For additional commands, e-mail: dev-h...@cloudberry.apache.org

Reply via email to