Re: Separate 100 M spatial data in 100 tables VS one big table

2024-03-05 Thread Marc Millas
Salut Kimaidou,
why not a partitioned table with the department a partitioning Key ?
each year just detach the obsolete data, department by
department (ie.detach the partition, almost instantaneous) and drop or keep
the obsolete data.
No delete, quite easy to maintain. For each global index, Postgres will
create one index per each partition. and detach them when you detach a
department partition.
so when importing, first create an appropriate table, load the data, and
attach it to the main partitioned table. Postgres will
automatically recreate all necessary indexes.

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com



On Tue, Mar 5, 2024 at 8:45 AM kimaidou  wrote:

> Hi list,
>
> In France, the total number of cadastral parcels is around 10 000 000
>
> The data can be heavy, because each parcel stores a geometry (PostGIS
> geometry data type inside a geom column).
> Indexes must be created to increase performance of day-to-day requests:
>
> * GIST index on geom for spatial filtering and intersection with other
> geometries (other tables)
> * Primary key and probably another unique code to index
> * one index on the "department" field. There are around 100 "departments"
> (admin boundaries) in France, and the parcels are homogeneously distributed
> (~ 1M parcel per "department")
>
> The initial import of this data is made one department by one department
> (the data source is distributed by department by French authorities). And
> each year, data must be deleted and reimported (data change each year), and
> this is also often done one department at a time.
>
> * Sometimes requests are made with a department filter (for example WHERE
> department IN ('2A', '34', '30', '48') )
> * Sometimes other client database clients must be able to get data from
> the whole dataset ( for example get the parcels for a list of known IDs)
>
> I would like to question the list about  the following 2 strategies to
> maintain such data:
>
> 1/ Put the whole dataset into one big table
> 2/ Create one table per department, and create a VIEW with 100 UNION ALL
> to gather all the parcels
>
> 1/ Seems simpler for the database clients, but it seems to me this can be
> a pain to maintain. For example, each time we will need to replace last
> year data for one department with the upcoming new data, we will need to
> delete 1M lines, reimport the new 1M lines and VACUUM FULL to regain space.
> Indexes will be huge, and I can suffer questions like :
>
> https://www.postgresql.org/message-id/CAMKXKO7yXmduSs4zzMfdRaPUn2kOKtQ6KMnDe1GxEr56Vr8hxA%40mail.gmail.com
> I often need to use pg_repack to regain spaces on this kind of table.
> VACUUM FULL cannot be used because it locks the table, and it takes times
> (!)
>
> 2/ Seems more kiss, but only if queries on the UNION VIEW will be able to
> use the tables indexes (geom, department) and perform as well as the big
> table.
>
>
> Any hint appreciated !
> Regards
>
> Kimaidou
>
>
>


Re: Optimizing count(), but Explain estimates wildly off

2024-03-05 Thread Chema
El lun, 4 mar 2024 a la(s) 7:50 p.m., Greg Sabino Mullane (
[email protected]) escribió:

> On Mon, Mar 4, 2024 at 2:14 PM Chema  wrote:
>
>> There's one JSON column in each table with a couple fields, and a column
>> with long texts  in Items.
>
> and earlier indicated the query was:
>
>> Select * from tenders inner join items
>
>
> You do not want to do a "select star" on both tables unless you 100% need
> every single column and plan to actively do something with it. Especially
> true for large text and json columns. Also, use jsonb not json.
>
Tuples aren't really that long in avg (300 bytes for Tenders,  twice as
much for Items).  In any case, the Select * was to be used with Explain to
obtain an estimated row count instantly from stats, as described in my
first email, but even raising stats to 5k in relevant columns has not
improved the planner's estimates, which are off by almost 1M, and there's
been no suggestion of what could cause that.

Googlin' once again, though, this SO answer
 implies that that might
actually be the normal for anything but the simplest queries:

Depending on the complexity of your query, this number may become less and
less accurate. In fact, in my application, as we added joins and complex
conditions, it became so inaccurate it was completely worthless, even to
know how within a power of 100 how many rows we'd have returned, so we had
to abandon that strategy.


But if your query is simple enough that Pg can predict within some
reasonable margin of error how many rows it will return, it may work for
you.


Re: Optimizing count(), but Explain estimates wildly off

2024-03-05 Thread Greg Sabino Mullane
> columns has not improved the planner's estimates, which are off by almost
> 1M, and there's been no suggestion of what could cause that.

You are asking a lot of the planner - how would it know that the average
number of items is much higher for ids derived indirectly from "Mexico"
versus ids derived from "Columbia"?

One thing you could try just as a general performance gain is index-only
scans, by creating an index like this:

create index tenders_date_country_id on tenders (country, "date") include
(transaction_id);

>>  Parallel Seq Scan on pricescope_items  (cost=0.00..1027794.01
rows=3277101 width=522)
>> (actual time=0.753..41654.507 rows=2621681 loops=3)
> Why does it take over 41 seconds to read a table with less than 3 million
rows?

Good question. I still maintain it's because you are doing a 'select star'
on large, toasted rows.

I made two tables of the same approximate number of rows, and ran the
query. It returned a hash join containing:

->  Parallel Seq Scan on items  (cost=0.00..69602.93 rows=3375592 width=8)
 (actual time=0.015..185.414 rows=2700407 loops=3)

Then I boosted the width by a lot by adding some filled text columns, and
it returned the same number of rows, but much slower:

->  Parallel Seq Scan on items  (cost=0.00..1729664.15 rows=3370715
width=1562)
 (actual time=0.027..36693.986 rows=2700407 loops=3)

A second run with everything in cache was better, but still an order of
magnitude worse the small row:

->  Parallel Seq Scan on items  (cost=0.00..1729664.15 rows=3370715
width=1562)
 (actual time=0.063..1565.486 rows=2700407 loops=3)

Best of all was a "SELECT 1" which switched the entire plan to a much
faster merge join, resulting in:

-> Parallel Index Only Scan using items_tender_transaction_id_index on
items  (cost=0.43..101367.60 rows=3372717 width=4)
 (actual time=0.087..244.878 rows=2700407 loops=3)

Yours will be different, as I cannot exactly duplicate your schema or data
distribution, but give "SELECT 1" a try. This was on Postgres 16, FWIW,
with a default_statistics_target of 100.

Cheers,
Greg


Re: Separate 100 M spatial data in 100 tables VS one big table

2024-03-05 Thread Tomas Vondra
On 3/5/24 13:47, Marc Millas wrote:
> Salut Kimaidou,
> why not a partitioned table with the department a partitioning Key ?
> each year just detach the obsolete data, department by
> department (ie.detach the partition, almost instantaneous) and drop or keep
> the obsolete data.
> No delete, quite easy to maintain. For each global index, Postgres will
> create one index per each partition. and detach them when you detach a
> department partition.
> so when importing, first create an appropriate table, load the data, and
> attach it to the main partitioned table. Postgres will
> automatically recreate all necessary indexes.
> 

Yes, a table partitioned like this is certainly a valid option - and
it's much better than the view with a UNION of all the per-department
tables. The optimizer has very little insight into the view, which
limits how it can optimize queries. For example if the query has a
condition like

   WHERE department = 'X'

with the declarative partitioning the planner can eliminate all other
partitions (and just ignore them), while with the view it will have to
scan all of them.

But is partitioning a good choice? Who knows - it makes some operations
simpler (e.g. you can detach/drop a partition instead of deleting the
rows), but it also makes other operations less efficient. For example a
query that can't eliminate partitions has to do more stuff during execution.

So to answer this we'd need to know how often stuff like bulk deletes /
reloads happen, what queries will be executed, and so on. Both options
(non-partitioned and partitioned table) are valid, but you have to try.

Also, partitioned table may not support / allow some features - for
example unique keys that don't contain the partition key. We're
improving this in every release, but there will always be a gap.

I personally would start with non-partitioned table, because that's the
simplest option. And once I get a better idea how often the reloads
happen, I'd consider if that's something worth the extra complexity of
partitioning the data. If it happens only occasionally (a couple times a
year), it probably is not. You'll just delete the data and reuse the
space for new data.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company