Re: MAT. VIEW security problems and PG 10-11 versions?

2024-02-13 Thread Daniel Gustafsson
> On 13 Feb 2024, at 08:56, Durumdara  wrote:

> But maybe that's because PG 10 and 11 are no longer supported - and not 
> because they aren't affected by the issues.

EOL versions do not recieve security updates and are not verified during
security analysis and fixing, so hence they aren't listed.  Anyone still
running 10 or 11 is encouraged to upgrade to a supported version.

--
Daniel Gustafsson





FOR UPDATE SKIP LOCKED and get locked row/avoid updating other row(s)

2024-02-13 Thread Wiwwo Staff
Hi!
I am implementing a queue using PostgreSQL.
I am of course using "FOR UPDATE SKIP LOCKED".

Is there any way I can tell PostgreSQL to only "operate" on the locked row,
and/or a way to reference it?

Some explanations of what I mean:

   - I have a table with N rows
   - I lock row X with a PG Function, in transaction 1
   - I do something in my code, using transaction 2
   - I update the status of row X with a PG Function, in transaction 1

In last step, I update the row X status, passing my function the ID of this
row X.
But, nothing stops me from updating row Y in the queue table, for whatever
reason.

My question again: any way to force a "Just update the row X you locked
before, and nothing else/more?"

Thanks!


Re: Compressing large column by moving it to a unique table

2024-02-13 Thread Adrian Garcia Badaracco
Thank you for the reply Ron.

Yes there are many fewer (<1%) the number of rows in new_table.

Thanks for making me think of normalization, I hadn’t seen it that way.
Although there is no theoretical relationship between the rows in the other
columns in the original table and the attributes column, in practice there
is a strong correlation, so I guess what I am trying to capture here is
taking advantage of that correlation, while not completely depending on it
because it can be broken.

In any case, whatever theoretical framework is put around this solution, I
am also interested in the practical aspects, in particular that case of
selecting a subset of columns from the view that I know doesn’t need the
join but the query planner thinks does.

On Tue, Feb 13, 2024 at 3:16 AM Ron Johnson  wrote:

> On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco <
> adr...@adriangb.com> wrote:
>
>> I am using Timescale so I'll be mentioning some timestamp stuff but I
>> think this is a general postgres question for the most part.
>>
>> I have a table with some fixed, small columns (id, timestamp, etc) and a
>> large JSONB column (let's call it `attributes`). `attributes` has 1000s of
>> schemas, but given a schema, there's a lot of duplication. Across all rows,
>> more than 99% of the data is duplicated (as measured by `count(attributes)`
>> vs `count(distinct attributes)`.
>>
>> I can't normalize `attributes` into real columns because it is quite
>> variable (remember 1000s of schemas).
>>
>> My best idea is to make a table like `(day timestamptz, hash text,
>> attributes jsonb)` and then in my original table replace `attributes` with
>> a reference to `new_table`.
>>
>
> Meaning that there are many fewer rows in new_table?
>
>
>> I can then make a view that joins them `select original_table.timestamp,
>> new_table.attributes from original join new_table on (time_bucket('1 day',
>> timestamp) = day AND original.hash = new_table.hash)` or something like
>> that. The idea of time bucketing into 1 day is to balance write and read
>> speed (by relying on timescale to do efficient time partitioning, data
>> retention, etc.).
>>
>
>> I recognize this is essentially creating a key-value store in postgres
>> and also janky compression, so I am cautious about it.
>>
>
> If my interpretation (that there are many fewer rows in new_table) is
> correct, then you've stumbled into the Second Normal Form of database
> design: https://en.wikipedia.org/wiki/Second_normal_form#Example
>
>


Re: FOR UPDATE SKIP LOCKED and get locked row/avoid updating other row(s)

2024-02-13 Thread Laurenz Albe
On Tue, 2024-02-13 at 11:17 +, Wiwwo Staff wrote:
> I am implementing a queue using PostgreSQL.
> I am of course using "FOR UPDATE SKIP LOCKED".
> 
> Is there any way I can tell PostgreSQL to only "operate" on the locked row, 
> and/or a way to reference it?
> 
> Some explanations of what I mean:
>  * I have a table with N rows
>  * I lock row X with a PG Function, in transaction 1
>  * I do something in my code, using transaction 2
>  * I update the status of row X with a PG Function, in transaction 1
> In last step, I update the row X status, passing my function the ID of this 
> row X.
> But, nothing stops me from updating row Y in the queue table, for whatever 
> reason.
> 
> My question again: any way to force a "Just update the row X you locked 
> before, and nothing else/more?"

I don't think there is a way to enforce that.  Your application code
has to do the right thing.

Yours,
Laurenz Albe




Re: Compressing large column by moving it to a unique table

2024-02-13 Thread Ron Johnson
1. Show us the PG version, view definition, the exact query that's slow,
and the EXPLAIN (ANALYZE).
2. Presumably there's an index on each table's *_hash column?

On Tue, Feb 13, 2024 at 8:48 AM Adrian Garcia Badaracco 
wrote:

> Thank you for the reply Ron.
>
> Yes there are many fewer (<1%) the number of rows in new_table.
>
> Thanks for making me think of normalization, I hadn’t seen it that way.
> Although there is no theoretical relationship between the rows in the other
> columns in the original table and the attributes column, in practice there
> is a strong correlation, so I guess what I am trying to capture here is
> taking advantage of that correlation, while not completely depending on it
> because it can be broken.
>
> In any case, whatever theoretical framework is put around this solution, I
> am also interested in the practical aspects, in particular that case of
> selecting a subset of columns from the view that I know doesn’t need the
> join but the query planner thinks does.
>
> On Tue, Feb 13, 2024 at 3:16 AM Ron Johnson 
> wrote:
>
>> On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco <
>> adr...@adriangb.com> wrote:
>>
>>> I am using Timescale so I'll be mentioning some timestamp stuff but I
>>> think this is a general postgres question for the most part.
>>>
>>> I have a table with some fixed, small columns (id, timestamp, etc) and a
>>> large JSONB column (let's call it `attributes`). `attributes` has 1000s of
>>> schemas, but given a schema, there's a lot of duplication. Across all rows,
>>> more than 99% of the data is duplicated (as measured by `count(attributes)`
>>> vs `count(distinct attributes)`.
>>>
>>> I can't normalize `attributes` into real columns because it is quite
>>> variable (remember 1000s of schemas).
>>>
>>> My best idea is to make a table like `(day timestamptz, hash text,
>>> attributes jsonb)` and then in my original table replace `attributes` with
>>> a reference to `new_table`.
>>>
>>
>> Meaning that there are many fewer rows in new_table?
>>
>>
>>> I can then make a view that joins them `select original_table.timestamp,
>>> new_table.attributes from original join new_table on (time_bucket('1 day',
>>> timestamp) = day AND original.hash = new_table.hash)` or something like
>>> that. The idea of time bucketing into 1 day is to balance write and read
>>> speed (by relying on timescale to do efficient time partitioning, data
>>> retention, etc.).
>>>
>>
>>> I recognize this is essentially creating a key-value store in postgres
>>> and also janky compression, so I am cautious about it.
>>>
>>
>> If my interpretation (that there are many fewer rows in new_table) is
>> correct, then you've stumbled into the Second Normal Form of database
>> design: https://en.wikipedia.org/wiki/Second_normal_form#Example
>>
>>


Re: MAT. VIEW security problems and PG 10-11 versions?

2024-02-13 Thread Ron Johnson
On Tue, Feb 13, 2024 at 3:44 AM Daniel Gustafsson  wrote:

> > On 13 Feb 2024, at 08:56, Durumdara  wrote:
>
> > But maybe that's because PG 10 and 11 are no longer supported - and not
> because they aren't affected by the issues.
>
> EOL versions do not recieve security updates and are not verified during
> security analysis and fixing, so hence they aren't listed.  Anyone still
> running 10 or 11 is encouraged to upgrade to a supported version.
>

I wonder if MSFT and Oracle test EOL versions of their software for
discovered bugs.   Maybe, but I doubt it, for the very same reason the PGDG
doesn't do it.

But even if they do... how far back would they go?  Unless they go back to
Windows 1.0, and Oracle 1.0, _someone_ would complain.


Re: How to do faster DML

2024-02-13 Thread Peter J. Holzer
On 2024-02-12 11:28:41 -0500, Greg Sabino Mullane wrote:
> On Mon, Feb 12, 2024 at 1:50 AM veem v  wrote:
> 
> So we were thinking, adding many column to a table should be fine in
> postgres (as here we have a use case in which total number of columns may
> go till ~500+). But then,  considering the access of columns towards the
> end of a row is going to add more time as compared to the column which is
> at the start of the row. As you mentioned, accessing 100th column may add 
> 4
> to 5 times more as compared to the access of first column. So , is it
> advisable here to go for similar approach of breaking the table into two ,
> if the total number of column reaches certain number/threshold for a 
> table?
> 
> 
> I'm not sure of what Peter was testing exactly to get those 4-5x figures,

Sorry, I should have included my benchmark code (it's short enough - see
below).

What i was actually timing was 

select count(*) from t_postgresql_column_bench where v{i} = 'a'

for various i.

> but I presume that is column access time,

That was the goal. Of course there is always some overhead but I figured
that by counting rows where a column has a constant value the overhead
is minimal or at least constant.

> which would not mean a direct effect on your total query time of 4-5x.

Right. In any real application the column access time is only a part of
the total processing time and probably a small part, so the effect on
total processing time is correspondingly smaller.

hp


#!/usr/bin/python3

import random
import time
import psycopg2

n_cols = 100
n_rows = 10

db = psycopg2.connect("")
csr = db.cursor()

csr.execute("drop table if exists t_postgresql_column_bench")
q = "create table t_postgresql_column_bench ("
q += ", ".join(f"v{i} text" for i in range(n_cols))
q += ")"
csr.execute(q)
q = "insert into t_postgresql_column_bench values("
q += ", ".join("%s" for i in range(n_cols))
q += ")"

for j in range(n_rows):
v = [ chr(random.randint(96+1, 96+26)) for i in range(n_cols)]
csr.execute(q, v)
db.commit()

for i in range(n_cols):
q = f"select count(*) from t_postgresql_column_bench where v{i} = 'a'"
t0 = time.clock_gettime(time.CLOCK_MONOTONIC)
csr.execute(q)
r = csr.fetchall()
print(r)
t1 = time.clock_gettime(time.CLOCK_MONOTONIC)
print(i, t1 - t0)
db.commit()


-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: FOR UPDATE SKIP LOCKED and get locked row/avoid updating other row(s)

2024-02-13 Thread David G. Johnston
On Tuesday, February 13, 2024, Wiwwo Staff  wrote:

> Hi!
> I am implementing a queue using PostgreSQL.
> I am of course using "FOR UPDATE SKIP LOCKED".
>
> Is there any way I can tell PostgreSQL to only "operate" on the locked
> row, and/or a way to reference it?
>
> Some explanations of what I mean:
>
>- I have a table with N rows
>- I lock row X with a PG Function, in transaction 1
>- I do something in my code, using transaction 2
>- I update the status of row X with a PG Function, in transaction 1
>
> In last step, I update the row X status, passing my function the ID of
> this row X.
> But, nothing stops me from updating row Y in the queue table, for whatever
> reason.
>
> My question again: any way to force a "Just update the row X you locked
> before, and nothing else/more?"
>
>
Holding locks and open transactions while doing queue job processing is
generally not a good idea anyway so the lack of this ability doesn’t seem
too problematic - but a cursor can probably get you close  You can abstract
interactions with the queue table through a functional API to implement a
server-enforced policy, removing the ability for clients to do arbitrary
queries on the underlying tables.  The checkout function can tag who got
the job and the completion function can validate the input arguments
supplied by the client belong to a job they checked out.

David J.


Re: How to do faster DML

2024-02-13 Thread Peter J. Holzer
On 2024-02-13 01:53:25 +0530, veem v wrote:
> On Mon, 12 Feb 2024 at 03:40, Peter J. Holzer  wrote:
> 
> The fixed width types are those that the CPU can directly process:
> Integers with 16, 32 and 64 bits, floating point numbers with 32 and 64
> bits. The CPU can read and write them with a single memory access, it
> can do arithmetic with a single instruction, etc.
> 
> Number/Numeric are not native types on any CPU. To read them the CPU
> needs several memory accesses (probably one per byte unless you get
> really clever) and then it can't do any calculations with them
> directly, instead it has run a subroutine which does operations on
> little chunks and then puts those chunks together again - basically the
> same as you do when you're doing long addition or multiplication on
> paper. So that's not very efficient.
> 
> 
> So it looks like the fixed length data type(like integer, float) should be the
> first choice while choosing the data type of the attributes wherever possible,
> as these are native types. (Like choosing "Integer/float" over "Numeric",
> "Char" over "Varchar" etc). 

Please do not conflate "char(n)" with native machine types like int or
float. These are very different things. A char(n) is string of fixed but
arbitrary length. This is not something a CPU can process in a single
instruction. It has to go over it character by character.

There is almost never a reason to use char(n). Just use varchar(n) or in
the case of PostgreSQL just varchar or text.

> However I do see even in Oracle databases, we have Integer type too,

Not really. INTEGER is just an alias for NUMBER(38) in Oracle (see for
example
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/Data-Types.html).
It's not the same as an INTEGER in PostgreSQL.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: How should we design our tables and indexes

2024-02-13 Thread Peter J. Holzer
On 2024-02-12 11:46:35 -0500, Greg Sabino Mullane wrote:
> If PR_ID is a must in the Join criteria between these table tables table1,
> table2 in all the queries, then is  it advisable to have a composite index
> like (pr_id, mid), (pr_id,cid) etc rather than having index on individual
> columns?
> 
> 
> No - individual indexes are better, and Postgres has no problem combining them
> when needed.

I'm a bit unsure if I should mention this as veem probably benefits more
from hard and simple rules than more nuanced answers, but that really
depends on the type of query.

For some kinds of queries a composite index can be dramatically faster.
While Postgres can combine indexes that means scanning both indexes and
combining the result, which may need a lot more disk I/O than scanning a
composite index. Indeed, in the cases where a composite index would be
useful but doesn't exist, PostgreSQL usually just chooses the best of
the single column indexes and ignores the rest.

That said, my rule of thumb is to create just single column indexes at
first and only create composite indexes if they are necessary.

hp


-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Compressing large column by moving it to a unique table

2024-02-13 Thread Adrian Garcia Badaracco
I'm using PostgreSQL 15.5.

Here's a self-contained example. I included links to public pgMustard query
plans.

Gist link: https://gist.github.com/adriangb/05a01cca99a438de2a286945903c44f4

Also copied below for archiving:

```sql
CREATE OR REPLACE FUNCTION random_bytes(length integer)
RETURNS bytea AS $$
DECLARE
  bytes bytea := '';
  i integer := 0;
BEGIN
  -- generate 1024 bytes at a time using gen_random_bytes(1024)
  WHILE i < length LOOP
bytes := bytes || gen_random_bytes(least(1024, length - i));
i := i + 1024;
  END LOOP;

  RETURN bytes;
END;
$$ LANGUAGE plpgsql;

DROP TABLE IF EXISTS original, original_hashed, dictionary_table CASCADE;
DROP VIEW IF EXISTS joined_view;

CREATE TABLE original AS (
  WITH large_random_bytes AS (
SELECT n AS id, random_bytes(4096 + n) AS attributes
FROM generate_series(1, 1000) n
  )
  SELECT
-- An incrementing timestamp
'2024-01-01 00:00:00.00+00'::timestamptz + (n || ' seconds')::interval
AS start_timestamp,
-- Another relatively small column, let's just make it a random string
including `n` to make it unique
'random_string_' || (n % 100) AS event_name,
-- The attributes column
lrb.attributes AS attributes
FROM generate_series(0, 100) n
JOIN large_random_bytes lrb ON n % 1000 = lrb.id
);
CREATE INDEX ON original(start_timestamp);

CREATE TABLE dictionary_table AS (
  SELECT DISTINCT time_bucket('1 day', start_timestamp) AS
start_timestamp_range, attributes, md5(attributes) AS hash
  FROM original
);
CREATE INDEX ON dictionary_table (start_timestamp_range, hash);

CREATE TABLE original_hashed AS (
  SELECT
start_timestamp,
event_name,
md5(attributes) AS hash
  FROM original
);
CREATE INDEX ON original_hashed (start_timestamp, time_bucket('1 day',
start_timestamp), hash);


CREATE VIEW joined_view AS (
  SELECT
original_hashed.start_timestamp,
original_hashed.event_name,
dictionary_table.attributes
  FROM original_hashed
  LEFT JOIN dictionary_table ON (
time_bucket('1 day', original_hashed.start_timestamp) =
dictionary_table.start_timestamp_range
AND
original_hashed.hash = dictionary_table.hash
  )
);

-- Select all data
explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM joined_view
WHERE (
  start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
  AND
  start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/f8b3b7d5-c236-4326-9a67-20fa4cdff4bc

explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM original
WHERE (
  start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
  AND
  start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/bcad3767-b8aa-4a37-acc7-7667266f29be

-- A relatively aggregation selective query
explain (analyze, buffers, verbose, settings, format json)
SELECT count(*)
FROM joined_view
WHERE (
  get_byte(attributes, 4) < 100
  AND
  start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
  AND
  start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/c5b68839-4c66-4b14-8de5-757bfc22fdb3

explain (analyze, buffers, verbose, settings, format json)
SELECT count(*)
FROM original
WHERE (
  get_byte(attributes, 4) < 100
  AND
  start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
  AND
  start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/831d9c4c-e911-4caf-a6c3-eee5343e44d8

-- Select only some columns
explain (analyze, buffers, verbose, settings, format json)
SELECT start_timestamp
FROM joined_view
WHERE (
  start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
  AND
  start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/5f617bfa-7816-4f8d-9fd7-24388c08978c

explain (analyze, buffers, verbose, settings, format json)
SELECT start_timestamp
FROM original
WHERE (
  start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
  AND
  start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/f05de3e8-ca8d-4404-ad33-69f103a3b85a


-- A relatively selective query
explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM joined_view
WHERE (
  get_byte(attributes, 4) < 100
  AND
  start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
  AND
  start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/a75b20bd-07d5-4402-a0d8-22419682307a

explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM original
WHERE (
  get_byte(attributes, 4) < 100
  AND
  start_timestamp > ('2024-01-01

Re: Compressing large column by moving it to a unique table

2024-02-13 Thread Ron Johnson
On original_hashed, I think I'd try moving start_timestamp into its own
index.

On Tue, Feb 13, 2024 at 12:02 PM Adrian Garcia Badaracco <
adr...@adriangb.com> wrote:

> I'm using PostgreSQL 15.5.
>
> Here's a self-contained example. I included links to public pgMustard
> query plans.
>
> Gist link:
> https://gist.github.com/adriangb/05a01cca99a438de2a286945903c44f4
>
> Also copied below for archiving:
>
> ```sql
> CREATE OR REPLACE FUNCTION random_bytes(length integer)
> RETURNS bytea AS $$
> DECLARE
>   bytes bytea := '';
>   i integer := 0;
> BEGIN
>   -- generate 1024 bytes at a time using gen_random_bytes(1024)
>   WHILE i < length LOOP
> bytes := bytes || gen_random_bytes(least(1024, length - i));
> i := i + 1024;
>   END LOOP;
>
>   RETURN bytes;
> END;
> $$ LANGUAGE plpgsql;
>
> DROP TABLE IF EXISTS original, original_hashed, dictionary_table CASCADE;
> DROP VIEW IF EXISTS joined_view;
>
> CREATE TABLE original AS (
>   WITH large_random_bytes AS (
> SELECT n AS id, random_bytes(4096 + n) AS attributes
> FROM generate_series(1, 1000) n
>   )
>   SELECT
> -- An incrementing timestamp
> '2024-01-01 00:00:00.00+00'::timestamptz + (n || ' seconds')::interval
> AS start_timestamp,
> -- Another relatively small column, let's just make it a random string
> including `n` to make it unique
> 'random_string_' || (n % 100) AS event_name,
> -- The attributes column
> lrb.attributes AS attributes
> FROM generate_series(0, 100) n
> JOIN large_random_bytes lrb ON n % 1000 = lrb.id
> );
> CREATE INDEX ON original(start_timestamp);
>
> CREATE TABLE dictionary_table AS (
>   SELECT DISTINCT time_bucket('1 day', start_timestamp) AS
> start_timestamp_range, attributes, md5(attributes) AS hash
>   FROM original
> );
> CREATE INDEX ON dictionary_table (start_timestamp_range, hash);
>
> CREATE TABLE original_hashed AS (
>   SELECT
> start_timestamp,
> event_name,
> md5(attributes) AS hash
>   FROM original
> );
> CREATE INDEX ON original_hashed (start_timestamp, time_bucket('1 day',
> start_timestamp), hash);
>
>
> CREATE VIEW joined_view AS (
>   SELECT
> original_hashed.start_timestamp,
> original_hashed.event_name,
> dictionary_table.attributes
>   FROM original_hashed
>   LEFT JOIN dictionary_table ON (
> time_bucket('1 day', original_hashed.start_timestamp) =
> dictionary_table.start_timestamp_range
> AND
> original_hashed.hash = dictionary_table.hash
>   )
> );
>
> -- Select all data
> explain (analyze, buffers, verbose, settings, format json)
> SELECT *
> FROM joined_view
> WHERE (
>   start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
>   AND
>   start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/f8b3b7d5-c236-4326-9a67-20fa4cdff4bc
>
> explain (analyze, buffers, verbose, settings, format json)
> SELECT *
> FROM original
> WHERE (
>   start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
>   AND
>   start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/bcad3767-b8aa-4a37-acc7-7667266f29be
>
> -- A relatively aggregation selective query
> explain (analyze, buffers, verbose, settings, format json)
> SELECT count(*)
> FROM joined_view
> WHERE (
>   get_byte(attributes, 4) < 100
>   AND
>   start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
>   AND
>   start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/c5b68839-4c66-4b14-8de5-757bfc22fdb3
>
> explain (analyze, buffers, verbose, settings, format json)
> SELECT count(*)
> FROM original
> WHERE (
>   get_byte(attributes, 4) < 100
>   AND
>   start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
>   AND
>   start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/831d9c4c-e911-4caf-a6c3-eee5343e44d8
>
> -- Select only some columns
> explain (analyze, buffers, verbose, settings, format json)
> SELECT start_timestamp
> FROM joined_view
> WHERE (
>   start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
>   AND
>   start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/5f617bfa-7816-4f8d-9fd7-24388c08978c
>
> explain (analyze, buffers, verbose, settings, format json)
> SELECT start_timestamp
> FROM original
> WHERE (
>   start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '1 day')
>   AND
>   start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
> '2 day')
> );
> --
> https://app.pgmustard.com/#/explore/f05de3e8-ca8d-4404-ad33-69f103a3b85a
>
>
> -- A relatively selective query
> explain (analyze, buffers, verbose, settings, format j

Re: FOR UPDATE SKIP LOCKED and get locked row/avoid updating other row(s)

2024-02-13 Thread Wiwwo Staff
On Tue, 13 Feb 2024 at 14:49, David G. Johnston 
wrote:

> On Tuesday, February 13, 2024, Wiwwo Staff  wrote:
>
>> Hi!
>> I am implementing a queue using PostgreSQL.
>> I am of course using "FOR UPDATE SKIP LOCKED".
>>
>> Is there any way I can tell PostgreSQL to only "operate" on the locked
>> row, and/or a way to reference it?
>>
>> Some explanations of what I mean:
>>
>>- I have a table with N rows
>>- I lock row X with a PG Function, in transaction 1
>>- I do something in my code, using transaction 2
>>- I update the status of row X with a PG Function, in transaction 1
>>
>> In last step, I update the row X status, passing my function the ID of
>> this row X.
>> But, nothing stops me from updating row Y in the queue table, for
>> whatever reason.
>>
>> My question again: any way to force a "Just update the row X you locked
>> before, and nothing else/more?"
>>
>>
> Holding locks and open transactions while doing queue job processing is
> generally not a good idea anyway so the lack of this ability doesn’t seem
> too problematic - but a cursor can probably get you close  You can abstract
> interactions with the queue table through a functional API to implement a
> server-enforced policy, removing the ability for clients to do arbitrary
> queries on the underlying tables.  The checkout function can tag who got
> the job and the completion function can validate the input arguments
> supplied by the client belong to a job they checked out.
>
> David J.
>
>
Hi David,
thanks for your answer.
You are absolutely right, it is not a good idea. The reasoning behind is 1)
everything is Python controlled and 2) -more importantly- the DB user in
charge of consuming the queue has grants just ion that part, and absolutely
nothing else. Hence the 2 sessions and the process separation.

Cosimo


Re: How should we design our tables and indexes

2024-02-13 Thread veem v
On Tue, 13 Feb 2024 at 20:59, Peter J. Holzer  wrote:

> For some kinds of queries a composite index can be dramatically faster.
> While Postgres can combine indexes that means scanning both indexes and
> combining the result, which may need a lot more disk I/O than scanning a
> composite index. Indeed, in the cases where a composite index would be
> useful but doesn't exist, PostgreSQL usually just chooses the best of
> the single column indexes and ignores the rest.
>
> That said, my rule of thumb is to create just single column indexes at
> first and only create composite indexes if they are necessary.
>
>
Thank you so much. As I understand optimizer uses indexed column as "access
criteria" and rest of the predicate as "filter criteria" while evaluating
the query predicate. And if the majority of the rows are getting eliminated
in the filtered step , that means adding that filtered criteria column to
the index could give us better performance.

So  I was trying to understand say in below query having TABLE1 as driving
table ( if we forget about column selectivity for a moment),

Can the optimizer, only scan the TABLE1  using ACCESS criteria " TABLE1.MID
in ()" or "TABLE1.CID in ()" which will be catered by two different
index i.e one index on column "MID" and other on column "CID"?
OR
It can utilize other columns as access criteria those used in join
conditions like MID, PR_ID, in which case a composite index on  the
columns(CID,PR_ID) (MID, PR_ID) will provide better selectivity and faster
access?

Similarly for TABLE2 a composite index on (ACN_NBR,PR_ID,MID) or just an
index on (ACN_NBR)?

select  ...
from   TABLE1
Left join schema1.TABLE2  on TABLE2.PR_ID = TABLE1.PR_ID  and
TABLE2.MID = TABLE1.MID
and TABLE2.processing_date=TABLE1.processing_date
where TABLE1.processing_date between '2023-04-20' and '2023-05-21'
-- Considering processing_date here as partition key.
   and TABLE2.ACN_NBR = ''
and ( TABLE1.MID in () OR TABLE1.CID in ())
order by   TABLE1.PR_TIME DESC


Re: How to do faster DML

2024-02-13 Thread veem v
On Tue, 13 Feb 2024 at 20:32, Peter J. Holzer  wrote:

> Please do not conflate "char(n)" with native machine types like int or
> float. These are very different things. A char(n) is string of fixed but
> arbitrary length. This is not something a CPU can process in a single
> instruction. It has to go over it character by character.
>
> There is almost never a reason to use char(n). Just use varchar(n) or in
> the case of PostgreSQL just varchar or text.
>
> > However I do see even in Oracle databases, we have Integer type too,
>
> Not really. INTEGER is just an alias for NUMBER(38) in Oracle (see for
> example
>
> https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/Data-Types.html
> ).
> It's not the same as an INTEGER in PostgreSQL.
>
>
Thank you so much for the clarification.

Basically as i understood, we can follow below steps in serial,

Step-1)First wherever possible use Smallint,Integer,bigint,float data types
rather than numeric. This will give better performance.

Step-2)Use the frequently queried columns first and least frequently
queried columns towards last in the row while creating the table. This is
too intended for better performance.

Step-3)Define the columns with typlen desc as per below formula( column
tetris symptom). This is for better storage space utilization.

SELECT a.attname, t.typname, t.typalign, t.typlen
  FROM pg_class c
  JOIN pg_attribute a ON (a.attrelid = c.oid)
  JOIN pg_type t ON (t.oid = a.atttypid)
 WHERE c.relname = 'user_order'
   AND a.attnum >= 0
 ORDER BY t.typlen DESC;

One question here, if we have defined one column as a fixed length data
type "integer" and slowly we noticed the length of data keeps increasing
(in case of a sequence generated PK column which will keep increasing), and
we want to alter the column to "bigint" now. In such scenario, will it
append/pad the additional spaces to all the existing values which were
already stored  with integer type initially in the table. And that would be
then an increase to the existing table storage. Please correct me if I'm
wrong.

Regards
Veem


Re: How to do faster DML

2024-02-13 Thread Ron Johnson
On Tue, Feb 13, 2024 at 4:17 PM veem v  wrote:
[sni[]

> One question here, if we have defined one column as a fixed length data
> type "integer" and slowly we noticed the length of data keeps increasing
> (in case of a sequence generated PK column which will keep increasing), and
> we want to alter the column to "bigint" now. In such scenario, will it
> append/pad the additional spaces to all the existing values which were
> already stored  with integer type initially in the table. And that would be
> then an increase to the existing table storage. Please correct me if I'm
> wrong.
>

ALTER TABLE foo ALTER COLUMN bar BIGINT; rewrites the whole table.  Do
yourself a favor, and start with BIGINT.