Re: Optimizing count(), but Explain estimates wildly off

2024-03-04 Thread Chema
>
> > ->  Parallel Seq Scan on pricescope_items  (cost=0.00..1027794.01
> rows=3277101 width=522) (actual time=0.753..41654.507 rows=2621681 loops=3)
>
> Why does it take over 41 seconds to read a table with less than
> 3 million rows?  Are the rows so large?  Is the tabe bloated?
> What is the size of the table as measured with pg_relation_size()
> and pg_table_size()?

There's one JSON column in each table with a couple fields, and a column
with long texts  in Items.

-- pg_table_size, pg_relation_size, pg_indexes_size, rows
nametable_sizerelation_sizeindex_sizerow_estimate
tenders 1,775,222,784
1,630,461,952 3,815,567
items 8,158,773,248
6,052,470,784 7,865,043
check_postgres gave a 1.4 bloat score to tenders, 1.9 to items.  I had a
duplicate index on transaction_id (one hand made, other from the unique
constraint) and other text column indexes with 0.3-0.5 bloat scores.  After
Vacuum Full Analyze; sizes are greatly reduced, specially Items:

-- pg_table_size, pg_relation_size, pg_indexes_size, rows
nametable_sizerelation_sizeindex_sizerow_estimate
tenders 1,203,445,760 1,203,421,184 500,482,048 3,815,567
items 4,436,189,184 4,430,790,656 2,326,118,400 7,865,043

There were a couple mass deletions which probably caused the bloating.
Autovacuum is on defaults,  but I guess it doesn't take care of that.
Still, performance seems about the same.

The planner is now using an Index Scan for Colombia without the subselect
hack, but subselect takes ~200ms less in avg, so might as well keep doing
it.

Row estimate is still +1M so still can't use that, but at least now it
takes less than 10s to get the exact count with all countries.


Re: Optimizing count(), but Explain estimates wildly off

2024-03-04 Thread Greg Sabino Mullane
On Mon, Mar 4, 2024 at 2:14 PM Chema  wrote:

> There's one JSON column in each table with a couple fields, and a column
> with long texts  in Items.

and earlier indicated the query was:

> Select * from tenders inner join items


You do not want to do a "select star" on both tables unless you 100% need
every single column and plan to actively do something with it. Especially
true for large text and json columns. Also, use jsonb not json.

Cheers,
Greg


Re: Is partition pruning impacted by data type

2024-03-04 Thread sud
Hello,
Has anybody got experience of using a range partitioning table using
timestamptz or "timestamp with no timezone" Column and saw any of such
known issues in pruning?



> On Tue, 5 Mar, 2024, 1:09 am sud,  wrote:
>
>> Hi,
>> We are designing one application which is currently restricted to one
>> time zone users but has the possibility to go global in future. Some of the
>> transaction tables are going to be daily range partitioned on the
>> transaction_create_date column. But the "date" data type will have no time
>> component in it, so we are thinking to make it as timestamp data
>> type(timestamptz(6)), so that it will help us in us two ways,
>>
>> firstly , though current use cases in which the majority of the queries
>> are going to happen on a day or multiple days of transactions. But if we
>> have any use case which needs further lower granularity like in hourly
>> duration , then having "timestamp" data type with an index created on it
>> will help. And in future , if we plan to partition it based on further
>> lower granularity like hourly , that can be accommodated easily with a
>> "timestamp" data type.
>>
>> However the question we have is ,
>> *1)If there is any downside of having the partition key with "timestamp
>> with timezone" type? Will it impact the partition pruning of the queries
>> anyway by appending any run time "time zone" conversion function during the
>> query planning/execution phase? *
>>
>
> *2) As it will take the default server times , so during daylight saving
>> the server time will change, so in that case, can it cause any unforeseen
>> issue?*
>>
>
> *3)Will this cause the data to be spread unevenly across partitions and
>> make the partitions unevenly sized? If will go for UTC/GMT as db time, the
>> user's one day transaction might span across two daily partitions. *
>>
>>
>> Thanks and Regards
>> Sud
>>
>


Separate 100 M spatial data in 100 tables VS one big table

2024-03-04 Thread kimaidou
Hi list,

In France, the total number of cadastral parcels is around 10 000 000

The data can be heavy, because each parcel stores a geometry (PostGIS
geometry data type inside a geom column).
Indexes must be created to increase performance of day-to-day requests:

* GIST index on geom for spatial filtering and intersection with other
geometries (other tables)
* Primary key and probably another unique code to index
* one index on the "department" field. There are around 100 "departments"
(admin boundaries) in France, and the parcels are homogeneously distributed
(~ 1M parcel per "department")

The initial import of this data is made one department by one department
(the data source is distributed by department by French authorities). And
each year, data must be deleted and reimported (data change each year), and
this is also often done one department at a time.

* Sometimes requests are made with a department filter (for example WHERE
department IN ('2A', '34', '30', '48') )
* Sometimes other client database clients must be able to get data from the
whole dataset ( for example get the parcels for a list of known IDs)

I would like to question the list about  the following 2 strategies to
maintain such data:

1/ Put the whole dataset into one big table
2/ Create one table per department, and create a VIEW with 100 UNION ALL to
gather all the parcels

1/ Seems simpler for the database clients, but it seems to me this can be a
pain to maintain. For example, each time we will need to replace last year
data for one department with the upcoming new data, we will need to delete
1M lines, reimport the new 1M lines and VACUUM FULL to regain space.
Indexes will be huge, and I can suffer questions like :
https://www.postgresql.org/message-id/CAMKXKO7yXmduSs4zzMfdRaPUn2kOKtQ6KMnDe1GxEr56Vr8hxA%40mail.gmail.com
I often need to use pg_repack to regain spaces on this kind of table.
VACUUM FULL cannot be used because it locks the table, and it takes times
(!)

2/ Seems more kiss, but only if queries on the UNION VIEW will be able to
use the tables indexes (geom, department) and perform as well as the big
table.


Any hint appreciated !
Regards

Kimaidou