Is there a bug in psql? (SELECT ''';)

2023-05-31 Thread Wen Yi
Hi team,
when I learn the postgres, I try to store the ' into the database, 

but something unexpected happend.


postgres=# CREATE TABLE test (str varchar);
CREATE TABLE
postgres=# INSERT INTO test values (''');
postgres'# SELECT * FROM test;
postgres'# exit
Use control-D to quit.
postgres'# \q
Use control-D to quit.
postgres'# 
\q
bash-5.1$ 



And I do another test


postgres=# SELECT ''';
postgres'# exit
Use control-D to quit.
postgres'# \q
Use control-D to quit.
postgres'# 
\q
bash-5.1$ 



Is there a bug in psql?


Yours,
Wen Yi

AW: Is there a bug in psql? (SELECT ''';)

2023-05-31 Thread Marco Lechner
You have to mask the single quote:
https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS

Von: Wen Yi <896634...@qq.com>
Gesendet: Mittwoch, 31. Mai 2023 09:17
An: pgsql-general 
Betreff: Is there a bug in psql? (SELECT ''';)

Hi team,
when I learn the postgres, I try to store the ' into the database,
but something unexpected happend.

postgres=# CREATE TABLE test (str varchar);
CREATE TABLE
postgres=# INSERT INTO test values (''');
postgres'# SELECT * FROM test;
postgres'# exit
Use control-D to quit.
postgres'# \q
Use control-D to quit.
postgres'#
\q
bash-5.1$

And I do another test

postgres=# SELECT ''';
postgres'# exit
Use control-D to quit.
postgres'# \q
Use control-D to quit.
postgres'#
\q
bash-5.1$

Is there a bug in psql?

Yours,
Wen Yi



Re: Pg 16: will pg_dump & pg_restore be faster?

2023-05-31 Thread Bruce Momjian
On Tue, May 30, 2023 at 10:28:58PM -0400, Bruce Momjian wrote:
> On Wed, May 31, 2023 at 02:18:25PM +1200, David Rowley wrote:
> > On Wed, 31 May 2023 at 14:11, Bruce Momjian  wrote:
> > >
> > > On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote:
> > > > "Relation extensions have been improved allowing faster bulk loading
> > > > of data using COPY. These improvements are more significant when
> > > > multiple processes are concurrently loading data into the same table."
> > >
> > > The release notes don't normally get into details on the magnitude of
> > > the improvement in various circumstances.
> > 
> > Sorry, I meant the release announcement rather than the release notes here.
> 
> Oh, yeah, that gets into more details, sure.
> 
> There is also the major features list at the top of the release notes
> --- that needs adjustment based on release text I have updated recently,
> but I don't feel I control that list.

Oh, I now remember I added that and reworded it when I did, so I have
now adjusted it to match the new OUTER parallelism text.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Is there a bug in psql? (SELECT ''';)

2023-05-31 Thread hubert depesz lubaczewski
On Wed, May 31, 2023 at 03:17:14PM +0800, Wen Yi wrote:
> Hi team,
> when I learn the postgres, I try to store the ' into the database, 
> 
> but something unexpected happend.
> 
> 
> postgres=# CREATE TABLE test (str varchar);
> CREATE TABLE
> postgres=# INSERT INTO test values (''');
> postgres'# SELECT * FROM test;

In here, in prompt, you can see that it's waiting for you to end your
'-qutoed string.

You seem to expect
INSERT INTO test values (''');
to do something, which it won't as it's not full query.

What do you think this should be doing?

Best regards,

depesz





Hash Index on Partitioned Table

2023-05-31 Thread peter.boris...@kartographia.com

Dear PostgreSQL Community,
 
I have a rather large database with ~250 billion records in a partitioned 
table. The database has been running and ingesting data continuously for about 
3 years.
 
I had a "regular" BTree index on one of the fields (a unique bigint column) but 
it was getting too big for the disk it was on. The index was consuming 6.4 TB 
of disk space.
 
I created a new disk with double the size, dropped the original index, and 
started to generate a new index.
 
After doing some research I decided to try to create a hash index instead of a 
BTree. For my purposes, the index is only used to find specific numbers ("=" 
and "IN" queries). From what I read, the hash index should run a little faster 
than btree for my use case and should use less disk space.
 
After 115 hours, the hash index is still generating and is using significantly 
more disk space than the original BTree index (8.4 TB vs 6.4 TB). I don't know 
how to check the status of the index creation task so I can't really estimate 
how much longer it will take or how much disk space it will consume.
 
Questions:
 
(1) Why is the hash index consuming more disk space than the btree index? Is it 
because the hash of the bigint values larger than the storing the bigints in 
the btree?
(2) Are there any known issues having a hash index on this many records?
(3) Are there any known issues having a hash index on partitioned tables?
(4) Is there any way to estimate when the index process will complete?
 
Server info:
 - PostgreSQL 13
 - Ubuntu 20.04.6 LTS
 - 64 cores (only 1 is ever used during index creation)
 - Memory usage is steady at 58GB/188GB
 - All disks are high speed NVMe drives
 - 1,686 tables in the partition
 
Thanks in advance,
Peter

Re: Pg 16: will pg_dump & pg_restore be faster?

2023-05-31 Thread Andres Freund
Hi,

On 2023-05-30 21:13:08 -0400, Bruce Momjian wrote:
> On Wed, May 31, 2023 at 09:14:20AM +1200, David Rowley wrote:
> > On Wed, 31 May 2023 at 08:54, Ron  wrote:
> > > https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/
> > > says "PostgreSQL 16 can also improve the performance of concurrent bulk
> > > loading of data using COPY up to 300%."
> > >
> > > Since pg_dump & pg_restore use COPY (or something very similar), will the
> > > speed increase translate to higher speeds for those utilities?
> > 
> > I think the improvements to relation extension only help when multiple
> > backends need to extend the relation at the same time.  pg_restore can
> > have multiple workers, but the tasks that each worker performs are
> > only divided as far as an entire table, i.e. 2 workers will never be
> > working on the same table at the same time. So there is no concurrency
> > in terms of 2 or more workers working on loading data into the same
> > table at the same time.
> > 
> > It might be an interesting project now that we have TidRange scans, to
> > have pg_dump split larger tables into chunks so that they can be
> > restored in parallel.
> 
> Uh, the release notes say:
> 
>   
>   
>   
>   
>   Allow more efficient addition of heap and index pages (Andres Freund)
>   
>   
> 
> There is no mention of concurrency being a requirement.  Is it wrong?  I
> think there was a question of whether you had to add _multiple_ blocks
> ot get a benefit, not if concurrency was needed.  This email about the
> release notes didn't mention the concurrent requirement:

>   
> https://www.postgresql.org/message-id/20230521171341.jjxykfsefsek4kzj%40awork3.anarazel.de

There's multiple improvements that work together to get the overall
improvement. One part of that is filesystem interactions, another is holding
the relation extension lock for a *much* shorter time. The former helps
regardless of concurrency, the latter only with concurrency.

Regards,

Andres




Re: Hash Index on Partitioned Table

2023-05-31 Thread Tom Lane
"peter.boris...@kartographia.com"  writes:
> I have a rather large database with ~250 billion records in a partitioned 
> table. The database has been running and ingesting data continuously for 
> about 3 years.
 
> I had a "regular" BTree index on one of the fields (a unique bigint column) 
> but it was getting too big for the disk it was on. The index was consuming 
> 6.4 TB of disk space.

That's ... really about par for the course.  Each row requires an 8-byte
index entry, plus 12 bytes index overhead.  If I'm doing the math right
then the index is physically about 78% full which is typical to good for
a b-tree.  Reindexing would remove most of the extra space, but only
temporarily.

> After doing some research I decided to try to create a hash index instead of 
> a BTree. For my purposes, the index is only used to find specific numbers 
> ("=" and "IN" queries). From what I read, the hash index should run a little 
> faster than btree for my use case and should use less disk space.

I'm skeptical.  The thing to bear in mind is that btree is the mainstream
use-case and has been refined and optimized far more than the hash index
logic.

> (1) Why is the hash index consuming more disk space than the btree index? Is 
> it because the hash of the bigint values larger than the storing the bigints 
> in the btree?

From memory, the index entries will be the same size in this case,
but hash might have more wasted space.

> (4) Is there any way to estimate when the index process will complete?

An index on a partitioned table isn't a single object, it's one index per
partition.  So you should be able to look at how many partitions have
indexes so far.  You might have to drill down to the point of counting how
many files in the database's directory, if the individual indexes aren't
showing up as committed catalog entries yet.

regards, tom lane




Help needed to understand query planner regression with incremental sort

2023-05-31 Thread Henrik Peinar (nodeSWAT.com)
*Quick description: *After upgrading our Aurora PostgreSQL engine from v11
to v15, one of our often run SQL's started taking tens of seconds instead
of running sub 100ms. Explain analyze showed that the query planner had
switched to using incremental sort. Running ANALYZE on the table fixed the
issue temporarily, but it came back after few days.

I'm looking to dig a bit deeper to be able to submit a bug report to the
PostgreSQL team so maybe something can be done to fix this, as this
definitely feels as a bug (the query running times are 100x in difference)
but I need a bit of help as I'm new to this deep PostgreSQL analysis and
I'm struggling finding resource how to proceed in such a case. And maybe
it's an issue with my own setup or database configuration instead (ie. me
just being stupid).

I know that I can solve the issue for my own environments with turning
incremental sort off in the DB parameters, but I feel this might be worth
putting a bit time into to be able to find the root cause and help me
understand the planner a bit better.


*More detailed description:*
Note: I don't have reproduction steps at this time, this is one of the
reasons I'm asking help to figure out what could I try to do to start
trying reproducing this in local setup. Below info is just giving out as
much info as I can to help understand the setup I have.

The query under question is rather simplistic as far as SQL itself goes at
least (attached: original_query.sql):


*select * from "products" inner join "shops" on "shops"."id" =
"products"."shop_id" where "products"."status" = 'published' and
products.desc_tsv @@ to_tsquery('simple', 'nike:*') order by
"products"."pinned_at" ASC, "products"."listed_at" DESC limit 61;*
With normal running conditions this query produces the following query
plan: https://explain.dalibo.com/plan/af8ch7a59ch6459d (attached:
good_planner_path.txt)

With incremental sorting path, it produces the following
https://explain.dalibo.com/plan/428564152aa3ba37 (attached:
incremental_sort_planner_path.txt)

I've tried to include as slim table / index setup as possible, but the
tables are actually bigger and there are a lot more indexes present, I
don't know how other indexes and / or columns might affect the planner path
in this case.
https://dbdiagram.io/d/64771e48722eb7749422715e (attached: schema_setup.sql)

In production DB the size of products table is around 5 452 840 rows.
According to the information_schema the size of the products table is
around 8.1GB.

If the incremental sort query plan starts happening, running ANALYZE
products; fixes the issue and the planner switches back to the original
plan.


*What am I asking for?*
Any ideas / help / queries that would get me closer to understanding why
planner chooses so much slower query in this case or ideas how to start
trying to reproduce this locally as it seems to happen intermittently in
production environment, I'm unsure if I can replicate that type of load
locally. Maybe someone spots very obvious mistakes / issues with the table
/ index / query setup right away that I've missed.


Any help appreciated,
Henrik.
QUERY PLAN
Limit  (cost=2285.01..5180.32 rows=61 width=2158) (actual 
time=5351.382..5351.399 rows=61 loops=1)
  ->  Incremental Sort  (cost=2285.01..1242947.06 rows=26139 width=2158) 
(actual time=5351.381..5351.393 rows=61 loops=1)
"Sort Key: products.pinned_at, products.listed_at DESC"
Presorted Key: products.pinned_at
Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 266kB  
Peak Memory: 266kB
Pre-sorted Groups: 1  Sort Method: top-N heapsort  Average Memory: 
362kB  Peak Memory: 362kB
->  Nested Loop  (cost=0.86..1241366.91 rows=26139 width=2158) (actual 
time=5.292..5308.158 rows=18411 loops=1)
  ->  Index Scan Backward using products_pinned_at_ix on products  
(cost=0.43..1172249.47 rows=26139 width=1460) (actual time=5.263..5203.180 
rows=18411 loops=1)
Filter: ((desc_tsv @@ '''nike'':*'::tsquery) AND (status = 
'published'::prod_status))
Rows Removed by Filter: 5415895
  ->  Memoize  (cost=0.43..3.64 rows=1 width=698) (actual 
time=0.004..0.004 rows=1 loops=18411)
Cache Key: products.shop_id
Cache Mode: logical
Hits: 10434  Misses: 7977  Evictions: 505  Overflows: 0  
Memory Usage: 8194kB
->  Index Scan using shops_pkey on shops  (cost=0.42..3.63 
rows=1 width=698) (actual time=0.005..0.005 rows=1 loops=7977)
  Index Cond: (id = products.shop_id)
Planning Time: 0.813 ms
Execution Time: 5353.360 ms


original_query.sql
Description: Binary data
QUERY PLAN
Limit  (cost=0.85..9258.41 rows=61 width=2162) (actual time=2.352..33.389 
rows=61 loops=1)
  ->  Nested Loop  (cost=0.85..4025217.24 rows=26523 width=2162) (actual 
time=2.351..33.376 rows=61 loops=1)
->  Index Scan using products_pinned_at_listed_at_ix o

Re: Is there a bug in psql? (SELECT ''';)

2023-05-31 Thread Adrian Klaver

On 5/31/23 00:17, Wen Yi wrote:

Hi team,
when I learn the postgres, I try to store the ' into the database,
but something unexpected happend.

postgres=# CREATE TABLE test (str varchar);
CREATE TABLE
postgres=# INSERT INTO test values (''');
postgres'# SELECT * FROM test;
postgres'# exit
Use control-D to quit.
postgres'# \q
Use control-D to quit.
postgres'#
\q
bash-5.1$

And I do another test

postgres=# SELECT ''';
postgres'# exit
Use control-D to quit.
postgres'# \q
Use control-D to quit.
postgres'#
\q
bash-5.1$

Is there a bug in psql?


See

https://www.postgresql.org/docs/current/sql-syntax-lexical.html

4.1.2.4. Dollar-Quoted String Constants

"While the standard syntax for specifying string constants is usually 
convenient, it can be difficult to understand when the desired string 
contains many single quotes or backslashes, since each of those must be 
doubled. To allow more readable queries in such situations, PostgreSQL 
provides another way, called “dollar quoting”, to write string 
constants.  ... "


Read the entire section for the full story.

As example:

CREATE TABLE test (str varchar);

The hard way:

INSERT INTO test values ();
INSERT 0 1

The easier way:

INSERT INTO test values ($$'$$);
INSERT 0 1

select * from test;
 str
-
 '
 '





Yours,
Wen Yi



--
Adrian Klaver
adrian.kla...@aklaver.com





Re: speed up full table scan using psql

2023-05-31 Thread Adrian Klaver

On 5/30/23 21:25, Lian Jiang wrote:

hi,

I am using psql to periodically dump the postgres tables into json files 
which are imported into snowflake. For large tables (e.g. 70M rows), it 


The command you are using is?

Postgres version?

takes hours for psql to complete. Using spark to read the postgres table 
seems not to work as the postgres read only replication is the 
bottleneck so spark cluster never uses >1 worker node and the working 
node timeout or out of memory.


Will vertical scaling the postgres db speed up psql? Or any thread 
related parameter of psql can help? Thanks for any hints.


Regards
Lian


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Help needed to understand query planner regression with incremental sort

2023-05-31 Thread Tom Lane
"Henrik Peinar (nodeSWAT.com)"  writes:
> *Quick description: *After upgrading our Aurora PostgreSQL engine from v11
> to v15, one of our often run SQL's started taking tens of seconds instead
> of running sub 100ms. Explain analyze showed that the query planner had
> switched to using incremental sort. Running ANALYZE on the table fixed the
> issue temporarily, but it came back after few days.

Hmm.  I think it's quite accidental that you get one plan over the other,
because it looks like the key difference is something the planner doesn't
account for.  In the fast case you have

->  Index Scan using products_pinned_at_listed_at_ix on products  
(cost=0.43..3929423.12 rows=26523 width=1463) (actual time=2.325..32.872 
rows=61 loops=1)
  Filter: ((desc_tsv @@ '''nike'':*'::tsquery) AND (status = 
'published'::prod_status))
  Rows Removed by Filter: 3376

The index is only being used to produce ordered output here: the filter
condition isn't related to the index.  And what we see is that the
query is fast because the desired rows are found in the first 3376+61
rows visited in this direction.  Meanwhile in the slow case you have

  ->  Index Scan Backward using products_pinned_at_ix on products  
(cost=0.43..1172249.47 rows=26139 width=1460) (actual time=5.263..5203.180 
rows=18411 loops=1)
Filter: ((desc_tsv @@ '''nike'':*'::tsquery) AND (status = 
'published'::prod_status))
Rows Removed by Filter: 5415895

Again, the filter condition isn't exploiting the index, we're just using
the index to (partially) satisfy the ORDER BY.  This one takes a long time
because it has to trawl through 5415895+61 rows before reaching the LIMIT.

So AFAICS, the runtime differential is mostly/entirely because the rows
satisfying the filter condition are located near one end of the range of
pinned_at.  That is not a correlation that the planner knows anything
about, so it's unable to see that these two ways of scanning the table
will have significantly different costs.  Moreover, I think you'd be
mistaken to draw any great conclusions from this specific example,
because with some other search term(s) the results might be totally
different due to the required rows not falling in the same place.

What I'd think about if this type of query is important is to set up
an index that can actually be used to satisfy the filter condition,
so that you're not forcing it into "scan the whole table till you
find the rows you want".  It looks like you already have such an
index, ie a GIN index on the desc_tsv column, although I don't trust
that your schema attachment is actually accurate because if it is
then you have a bunch of duplicative indexes.  You might try
dropping the other indexes to see if you can coerce the planner
into using that one, and then seeing what the cost estimate is.

regards, tom lane




Re: Hash Index on Partitioned Table

2023-05-31 Thread peter.boris...@kartographia.com

Hi Tom,
Thanks so much for your quick response. As luck would have it, the index 
FINALLY finished about an hour ago. For a size comparison:
 
BTree: 6,433 GB
Hash: 8,647 GB
 
Although I don't have a proper benchmark to compare performance, I can say the 
hash is working as good as if not faster than the BTree for my use case (web 
application).
 
I guess I was getting a little nervous waiting for the index to complete and 
seeing such a huge difference in file size but I'm ok now :-)
 
Thanks again,
Peter
 
 
-Original Message-
From: "Tom Lane" 
Sent: Wednesday, May 31, 2023 10:07am
To: "peter.boris...@kartographia.com" 
Cc: pgsql-general@lists.postgresql.org
Subject: Re: Hash Index on Partitioned Table



"peter.boris...@kartographia.com"  writes:
> I have a rather large database with ~250 billion records in a partitioned 
> table. The database has been running and ingesting data continuously for 
> about 3 years.

> I had a "regular" BTree index on one of the fields (a unique bigint column) 
> but it was getting too big for the disk it was on. The index was consuming 
> 6.4 TB of disk space.

That's ... really about par for the course. Each row requires an 8-byte
index entry, plus 12 bytes index overhead. If I'm doing the math right
then the index is physically about 78% full which is typical to good for
a b-tree. Reindexing would remove most of the extra space, but only
temporarily.

> After doing some research I decided to try to create a hash index instead of 
> a BTree. For my purposes, the index is only used to find specific numbers 
> ("=" and "IN" queries). From what I read, the hash index should run a little 
> faster than btree for my use case and should use less disk space.

I'm skeptical. The thing to bear in mind is that btree is the mainstream
use-case and has been refined and optimized far more than the hash index
logic.

> (1) Why is the hash index consuming more disk space than the btree index? Is 
> it because the hash of the bigint values larger than the storing the bigints 
> in the btree?

From memory, the index entries will be the same size in this case,
but hash might have more wasted space.

> (4) Is there any way to estimate when the index process will complete?

An index on a partitioned table isn't a single object, it's one index per
partition. So you should be able to look at how many partitions have
indexes so far. You might have to drill down to the point of counting how
many files in the database's directory, if the individual indexes aren't
showing up as committed catalog entries yet.

 regards, tom lane

Re: Question - Does PostgreSQL have an Evaluation Assurance Level?

2023-05-31 Thread Laurenz Albe
On Tue, 2023-05-30 at 13:48 +, Mayer, Nicholas J wrote:
> My name is Nick Mayer, and I had a question concerning PostgreSQL’s EAL. Has 
> PostgreSQL
> been put through any audit/security testing, and does it have an EAL? If so, 
> would I be
> able to get this information? I would appreciate any assistance you are able 
> to provide for this.

I have never heard of that, but I'll reply on the -general list, where the 
question is
more likely to reach the people who know.

Yours,
Laurenz Albe




Re: EXTERNAL: Re: Question - Does PostgreSQL have an Evaluation Assurance Level?

2023-05-31 Thread Laurenz Albe
On Wed, 2023-05-31 at 19:51 +, Mayer, Nicholas J wrote:
> We found out that while PostgreSQL does not have EAL, the 'Crunchy Data' does 
> have EAL of 2.

I see.  I guess you are aware that a closed source fork of PostgreSQL is 
probably no more
secure than the original.  But this is more about ticking off checkboxes, right?

Yours,
Laurenz Albe




Re: speed up full table scan using psql

2023-05-31 Thread Lian Jiang
The command is: psql $db_url -c "copy (select row_to_json(x_tmp_uniq) from
public.mytable x_tmp_uniq) to stdout"
postgres version:  14.7
Does this mean COPY and java CopyManager may not help since my psql command
already uses copy?

Regarding pg_dump, it does not support json format which means extra work
is needed to convert the supported format to jsonl (or parquet) so that
they can be imported into snowflake. Still exploring but want to call it
out early. Maybe 'custom' format can be parquet?


Thanks
Lian


Re: speed up full table scan using psql

2023-05-31 Thread Adrian Klaver

On 5/31/23 13:57, Lian Jiang wrote:
The command is: psql $db_url -c "copy (select row_to_json(x_tmp_uniq) 
from public.mytable x_tmp_uniq) to stdout"

postgres version:  14.7
Does this mean COPY and java CopyManager may not help since my psql 
command already uses copy?


I don't think the issue is COPY itself but row_to_json(x_tmp_uniq).

This:

https://towardsdatascience.com/spark-essentials-how-to-read-and-write-data-with-pyspark-5c45e29227cd

indicates Spark can use CSV as an input source.

Given that I would just COPY the data out as CSV.



Regarding pg_dump, it does not support json format which means extra 
work is needed to convert the supported format to jsonl (or parquet) so 
that they can be imported into snowflake. Still exploring but want to 
call it out early. Maybe 'custom' format can be parquet?



Thanks
Lian


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: speed up full table scan using psql

2023-05-31 Thread Adrian Klaver

On 5/31/23 13:57, Lian Jiang wrote:
The command is: psql $db_url -c "copy (select row_to_json(x_tmp_uniq) 
from public.mytable x_tmp_uniq) to stdout"

postgres version:  14.7
Does this mean COPY and java CopyManager may not help since my psql 
command already uses copy?


Regarding pg_dump, it does not support json format which means extra 
work is needed to convert the supported format to jsonl (or parquet) so 
that they can be imported into snowflake. Still exploring but want to 
call it out early. Maybe 'custom' format can be parquet?


Oops I read this:

'...Using spark to read the postgres table...'

and missed that you are trying to load into Snowflake.

It seems Snowflake supports CSV as well:

https://docs.snowflake.com/en/user-guide/data-load-prepare

So the previous advice should still hold.





Thanks
Lian


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: speed up full table scan using psql

2023-05-31 Thread Thorsten Glaser
On Wed, 31 May 2023, Adrian Klaver wrote:

> Given that I would just COPY the data out as CSV.

I recently did something similar. I found the JSON functions not quite
satisfying and the extra spaces redundant, but it turns out that, for
a numerical table, exporting as CSV, loading that via AJAX then (on
xhr.responseText) substituting newlines with '],[' and prepending '[['
and appending ']]' was enough to let JSON.parse eat it.

With strings this is more complex ofc (though partial use of JSON
functions, e.g. to convert strings to JSONString already, might help).

bye,
//mirabilos
-- 
Infrastrukturexperte • tarent solutions GmbH
Am Dickobskreuz 10, D-53121 Bonn • http://www.tarent.de/
Telephon +49 228 54881-393 • Fax: +49 228 54881-235
HRB AG Bonn 5168 • USt-ID (VAT): DE122264941
Geschäftsführer: Dr. Stefan Barth, Kai Ebenrett, Boris Esser, Alexander Steeg


/⁀\ The UTF-8 Ribbon
╲ ╱ Campaign against  Mit dem tarent-Newsletter nichts mehr verpassen:
 ╳  HTML eMail! Also, https://www.tarent.de/newsletter
╱ ╲ header encryption!





Re: speed up full table scan using psql

2023-05-31 Thread Adrian Klaver

On 5/31/23 13:57, Lian Jiang wrote:
The command is: psql $db_url -c "copy (select row_to_json(x_tmp_uniq) 
from public.mytable x_tmp_uniq) to stdout"


What is taking the stdout and what it is it doing?


postgres version:  14.7
Does this mean COPY and java CopyManager may not help since my psql 
command already uses copy?


Regarding pg_dump, it does not support json format which means extra 
work is needed to convert the supported format to jsonl (or parquet) so 
that they can be imported into snowflake. Still exploring but want to 
call it out early. Maybe 'custom' format can be parquet?



Thanks
Lian


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: speed up full table scan using psql

2023-05-31 Thread Lian Jiang
The whole command is:

psql %(pg_uri)s -c %(sql)s | %(sed)s | %(pv)s | %(split)s) 2>&1 | %(tr)s

where:
sql is "copy (select row_to_json(x_tmp_uniq) from public.mytable
x_tmp_uniq) to stdout"
sed, pv, split, tr together format and split the stdout into jsonl files.

Hope this helps.


On Wed, May 31, 2023 at 9:16 PM Adrian Klaver 
wrote:

> On 5/31/23 13:57, Lian Jiang wrote:
> > The command is: psql $db_url -c "copy (select row_to_json(x_tmp_uniq)
> > from public.mytable x_tmp_uniq) to stdout"
>
> What is taking the stdout and what it is it doing?
>
> > postgres version:  14.7
> > Does this mean COPY and java CopyManager may not help since my psql
> > command already uses copy?
> >
> > Regarding pg_dump, it does not support json format which means extra
> > work is needed to convert the supported format to jsonl (or parquet) so
> > that they can be imported into snowflake. Still exploring but want to
> > call it out early. Maybe 'custom' format can be parquet?
> >
> >
> > Thanks
> > Lian
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>
>

-- 

Create your own email signature