date:20211004

create a new GIN index for my own type

2021-10-04 Thread huangning...@yahoo.com

Hi:  I created a new data type, and then I wanted to create a GIN index for it, 
but when I created the index, the program would crash 。  The version of 
postgresql is 9.6。
The following is part of the code, and I also refer to the code of intarray.

```sqlCREATE OR REPLACE FUNCTION geomgrid_in(cstring) RETURNS geomgrid AS 
'$libdir/module-1.0','geomgrid_in' LANGUAGE 'c' NOT FENCED IMMUTABLE STRICT ;
CREATE OR REPLACE FUNCTION geomgrid_out(geomgrid) RETURNS cstring AS 
'$libdir/module-1.0','geomgrid_out' LANGUAGE 'c' NOT FENCED IMMUTABLE STRICT ;
CREATE OR REPLACE FUNCTION geomgrid_recv(internal) RETURNS geomgrid AS 
'$libdir/module-1.0','geomgrid_recv' LANGUAGE 'c' NOT FENCED IMMUTABLE STRICT ;
CREATE OR REPLACE FUNCTION geomgrid_send(geomgrid) RETURNS bytea AS 
'$libdir/module-1.0','geomgrid_send' LANGUAGE 'c' NOT FENCED IMMUTABLE STRICT ;
CREATE TYPE geomgrid( internallength = 8, input = geomgrid_in, output = 
geomgrid_out, send = geomgrid_send, receive = geomgrid_recv, alignment = 
double, PASSEDBYVALUE = true, storage = plain);
CREATE OPERATOR CLASS gin_grid_opsDEFAULT FOR TYPE _geomgrid USING ginAS    
OPERATOR 3 &&, OPERATOR 6 = (anyarray, anyarray), OPERATOR 7 @>, OPERATOR 8 <@, 
   FUNCTION    1   grid_cmp(geomgrid,geomgrid),    FUNCTION 2 gridarray_extract 
(anyarray, internal, internal), FUNCTION 3 gridarray_queryextract (geomgrid, 
internal, int2, internal, internal, internal, internal),```
```cDatum geomgrid_in(PG_FUNCTION_ARGS){  char *input = PG_GETARG_CSTRING(0);  
int len = strlen(input);  if (len != 16)    PG_RETURN_NULL();
  char *data = palloc(len / 2 );  for (int i = 0, j = 7; i < len; i += 2, j--)  
{    data[j] = Char2Hex(input + i);  }  int64_t* return_data = (int64_t*)data;  
PG_RETURN_INT64(*return_data);}
Datum geomgrid_out(PG_FUNCTION_ARGS){  int64_t out_data = PG_GETARG_INT64(0);  
char* buf_data = (char*)(&out_data);
  unsigned char dst[2] = {0};
  char *result = palloc(16 + 1);  memset(result, 0, 16 + 1);
  for (int i = 7, j = 0; i >= 0; i--, j++)  {    Hex2Char((unsigned 
char)buf_data[i], dst);    result[j * 2 + 1] = dst[0];    result[j * 2] = 
dst[1];  }  PG_RETURN_CSTRING(result);}```
```cDatum gridarray_extract(PG_FUNCTION_ARGS){  ArrayType *array = 
PG_GETARG_ARRAYTYPE_P_COPY(0);  int size = VARSIZE(array);  int32 *nkeys = 
(int32 *)PG_GETARG_POINTER(1);  bool **nullFlags = (bool 
**)PG_GETARG_POINTER(2);
  if (array == NULL || nkeys == NULL || nullFlags == NULL)    ereport(ERROR,    
        (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("Invalid arguments 
for function gridarray_extract")));
  int16 elmlen;  bool elmbyval = false;  char elmalign;  Datum *elems = NULL;  
bool *nulls = NULL;  int nelems;
  get_typlenbyvalalign(ARR_ELEMTYPE(array), &elmlen, &elmbyval, &elmalign);
  deconstruct_array(array, ARR_ELEMTYPE(array), elmlen, elmbyval, elmalign, 
&elems, &nulls, &nelems);
  *nkeys = nelems;  *nullFlags = nulls;  PG_RETURN_POINTER(elems);}```
 Best Regards！

Problem in pg_basebackup of 1.7 TB database while replicating in PostgreSQL 9.4.( Master and slave both are in postgreSQL 9.4)

2021-10-04 Thread Amal Chakravarty

Hi all. I am working with a database of 1.7 TB size which is in PostgreSQL
9.4. Kindly suggest how to take the pg_basebackup of such a huge data while
replicating from master to slave.

Regards,
Amal Chakravarty.

Why would Postgres 11 suddenly ignore the recovery.conf file?

2021-10-04 Thread Dennis Jacobfeuerborn


Hi,
I just stopped and restarted a Postgres 11 docker container which so far 
has happily streamed the WAL from its primary but after the restart it 
now seems to completely ignore the recovery.conf file and just outputs this:


 2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv4 address 
"0.0.0.0", port 5432
2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv6 address "::", 
port 5432
2021-10-04 10:14:19.123 UTC [1] LOG:  listening on Unix socket 
"/var/run/postgresql/.s.PGSQL.5432"
2021-10-04 10:14:19.304 UTC [21] LOG:  database system was shut down in 
recovery at 2021-10-04 10:14:11 UTC

2021-10-04 10:14:19.305 UTC [21] LOG:  entering standby mode
2021-10-04 10:14:19.313 UTC [21] LOG:  could not signal for checkpoint: 
checkpointer is not running

2021-10-04 10:14:19.338 UTC [21] LOG:  redo starts at 2F01/B9044C30
2021-10-04 10:15:35.889 UTC [21] LOG:  consistent recovery state reached 
at 2F02/A2E1A108
2021-10-04 10:15:35.893 UTC [1] LOG:  database system is ready to accept 
read only connections


Notice that there is apparently no attempt being made to connect to the 
primary any more. The recovery.conf file is still present and correct so 
I don't understand why Postgres seems to ignore it completely?


Regards,
  Dennis

How to set up temporary path for starting up psql in any folder?

2021-10-04 Thread Shaozhong SHI

How to set up temporary path for starting up psql in any folder?

I do not want to disrupt existing settings of paths.

Regards,

David

Re: Why would Postgres 11 suddenly ignore the recovery.conf file?

2021-10-04 Thread Josef Šimánek

Isn't this related to merge into postgresql.conf
(https://www.postgresql.org/docs/current/recovery-config.html)?

po 4. 10. 2021 v 12:35 odesílatel Dennis Jacobfeuerborn
 napsal:
>
> Hi,
> I just stopped and restarted a Postgres 11 docker container which so far
> has happily streamed the WAL from its primary but after the restart it
> now seems to completely ignore the recovery.conf file and just outputs this:
>
>   2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv4 address
> "0.0.0.0", port 5432
> 2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv6 address "::",
> port 5432
> 2021-10-04 10:14:19.123 UTC [1] LOG:  listening on Unix socket
> "/var/run/postgresql/.s.PGSQL.5432"
> 2021-10-04 10:14:19.304 UTC [21] LOG:  database system was shut down in
> recovery at 2021-10-04 10:14:11 UTC
> 2021-10-04 10:14:19.305 UTC [21] LOG:  entering standby mode
> 2021-10-04 10:14:19.313 UTC [21] LOG:  could not signal for checkpoint:
> checkpointer is not running
> 2021-10-04 10:14:19.338 UTC [21] LOG:  redo starts at 2F01/B9044C30
> 2021-10-04 10:15:35.889 UTC [21] LOG:  consistent recovery state reached
> at 2F02/A2E1A108
> 2021-10-04 10:15:35.893 UTC [1] LOG:  database system is ready to accept
> read only connections
>
> Notice that there is apparently no attempt being made to connect to the
> primary any more. The recovery.conf file is still present and correct so
> I don't understand why Postgres seems to ignore it completely?
>
> Regards,
>Dennis
>
>

Re: Why would Postgres 11 suddenly ignore the recovery.conf file?

2021-10-04 Thread Dennis Jacobfeuerborn

That happened with Postgres 12 I think so it shouldn't affect this 
installation.


On 10/4/21 12:40, Josef Šimánek wrote:

Isn't this related to merge into postgresql.conf
(https://www.postgresql.org/docs/current/recovery-config.html)?

po 4. 10. 2021 v 12:35 odesílatel Dennis Jacobfeuerborn
 napsal:


Hi,
I just stopped and restarted a Postgres 11 docker container which so far
has happily streamed the WAL from its primary but after the restart it
now seems to completely ignore the recovery.conf file and just outputs this:

   2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv4 address
"0.0.0.0", port 5432
2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv6 address "::",
port 5432
2021-10-04 10:14:19.123 UTC [1] LOG:  listening on Unix socket
"/var/run/postgresql/.s.PGSQL.5432"
2021-10-04 10:14:19.304 UTC [21] LOG:  database system was shut down in
recovery at 2021-10-04 10:14:11 UTC
2021-10-04 10:14:19.305 UTC [21] LOG:  entering standby mode
2021-10-04 10:14:19.313 UTC [21] LOG:  could not signal for checkpoint:
checkpointer is not running
2021-10-04 10:14:19.338 UTC [21] LOG:  redo starts at 2F01/B9044C30
2021-10-04 10:15:35.889 UTC [21] LOG:  consistent recovery state reached
at 2F02/A2E1A108
2021-10-04 10:15:35.893 UTC [1] LOG:  database system is ready to accept
read only connections

Notice that there is apparently no attempt being made to connect to the
primary any more. The recovery.conf file is still present and correct so
I don't understand why Postgres seems to ignore it completely?

Regards,
Dennis

Re: Why would Postgres 11 suddenly ignore the recovery.conf file?

2021-10-04 Thread hubert depesz lubaczewski

On Mon, Oct 04, 2021 at 12:35:48PM +0200, Dennis Jacobfeuerborn wrote:
> Hi,
> I just stopped and restarted a Postgres 11 docker container which so far has
> happily streamed the WAL from its primary but after the restart it now seems
> to completely ignore the recovery.conf file and just outputs this:

If this is 11, then it would mean that at some point in time Pg was told
to finish recovery. Using trigger file, pg_ctl promote, some other
means.

But, perhaps it's pg 12 (or newer)? If so, then recovery.conf no longer
is parsed.

Best regards,

depesz

Re: Why would Postgres 11 suddenly ignore the recovery.conf file?

2021-10-04 Thread Dennis Jacobfeuerborn


On 10/4/21 12:35, Dennis Jacobfeuerborn wrote:

Hi,
I just stopped and restarted a Postgres 11 docker container which so far 
has happily streamed the WAL from its primary but after the restart it 
now seems to completely ignore the recovery.conf file and just outputs 
this:


  2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv4 address 
"0.0.0.0", port 5432
2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv6 address "::", 
port 5432
2021-10-04 10:14:19.123 UTC [1] LOG:  listening on Unix socket 
"/var/run/postgresql/.s.PGSQL.5432"
2021-10-04 10:14:19.304 UTC [21] LOG:  database system was shut down in 
recovery at 2021-10-04 10:14:11 UTC

2021-10-04 10:14:19.305 UTC [21] LOG:  entering standby mode
2021-10-04 10:14:19.313 UTC [21] LOG:  could not signal for checkpoint: 
checkpointer is not running

2021-10-04 10:14:19.338 UTC [21] LOG:  redo starts at 2F01/B9044C30
2021-10-04 10:15:35.889 UTC [21] LOG:  consistent recovery state reached 
at 2F02/A2E1A108
2021-10-04 10:15:35.893 UTC [1] LOG:  database system is ready to accept 
read only connections


Notice that there is apparently no attempt being made to connect to the 
primary any more. The recovery.conf file is still present and correct so 
I don't understand why Postgres seems to ignore it completely?


Ok, so just as I sent the mail the log now showed this:

2021-10-04 10:37:53.841 UTC [45] LOG:  started streaming WAL from 
primary at 2F14/1300 on timeline 1


Notice that it took over 20 minutes for the system to start streaming. 
Could this be a connection related issue? If so is there a way to see 
that Postgres is trying to connect to the primary but hasn't succeeded 
yet? From the looks of it the logs stays silent until is actually has 
started streaming and there is no "initiating recovery" message before 
the connection actually succeeds.


Regards,
  Dennis

Re: PostgreSQL - Ordering Table based of Foreign Key

2021-10-04 Thread FOUTE K . Jaurès

Hello,

Thx for all the feedback.
After googling that, I saw this link that can be a way I can go.
https://stackoverflow.com/questions/51279588/sort-tables-in-order-of-dependency-postgres


Le dim. 3 oct. 2021 à 22:33, David G. Johnston 
a écrit :

>
>
> On Sun, Oct 3, 2021, 00:48 FOUTE K. Jaurès  wrote:
>
>> Hello,
>>
>> I want to order tables based on the foreign key so that I can delete
>> tables one by one without facing "ERROR: update or delete on table
>> "table" violates foreign key constraint. DETAIL: Key is still referenced
>> from table"
>>
>
> As others have said this is why on delete cascade exists.  Unfortunately
> this does require some advanced planning as changing it on the fly doesn't
> really work.
>
> I do not believe there is a built-in way to return an ordered listing of
> dependent relations given a base relation as an input.  But the catalog
> entries do exist should you wish to build such yourself.
>
> That said maybe deferred constraint evaluation will work so that at least
> the order doesn't matter.  But you still.would.need to know which tables to
> write delete commands for.
>
> There is some recent discussion on making this work in a more
> user-friendly away but that would be only available in v15 at best.
>
> David J.
>
>
>>

-- 
Jaurès FOUTE

Re: How to set up temporary path for starting up psql in any folder?

2021-10-04 Thread Francisco Olarte

On Mon, 4 Oct 2021 at 12:40, Shaozhong SHI  wrote:
> How to set up temporary path for starting up psql in any folder?
> I do not want to disrupt existing settings of paths.

You will need to provide a lot more details ( like your OS, and wheter
you just want to open the "psql" binary or modify the paths for other
things ).

In simple OSs like mine, plain Debian Linux, you can start psql in the
folder $FOLDER by just doing "$FOLDER/psql". To launch it using
temporary override of PATH you will need to do "PATH=$FOLDER:$PATH
psql".

But anyway, this is a shell-101 question, not much to do with
postgres, you'll have to ask someone who knows your
os/shell/whatever.

Francisco Olarte.

Re: create a new GIN index for my own type

2021-10-04 Thread Tomas Vondra


On 10/4/21 8:30 AM, huangning...@yahoo.com wrote:

Hi:
I created a new data type, and then I wanted to create a GIN index for 
it, but when I created the index, the program would crash 。

The version of postgresql is 9.6。

The following is part of the code, and I also refer to the code of intarray.



I doubt anyone is going to investigate this unless you provide a more 
complete example - something like an extension where people can do "make 
install" without having to fill in various pieces of code.


To investigate the crash, you need to attach a debugger to the backend 
and run the CREATE INDEX (or whatever triggers the crash). The debugger 
should catch the segfault and you'll be able to identify where exactly 
it crashes and why (and investigate).


1) first get PID of the backend

  SELECT pg_backend_pid();

2) then attach a debugger to the backend

  gdb -p $PID
  (gdb) c

3) run the CREATE INDEX query

4) get backtrace from the debugger

  (gdb) bt


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Problem in pg_basebackup of 1.7 TB database while replicating in PostgreSQL 9.4.( Master and slave both are in postgreSQL 9.4)

2021-10-04 Thread Tomas Vondra


On 10/4/21 9:28 AM, Amal Chakravarty wrote:
Hi all. I am working with a database of 1.7 TB size which is in 
PostgreSQL 9.4. Kindly suggest how to take the pg_basebackup of such a 
huge data while replicating from master to slave.




And what exactly is the issue? Does it fail in some way or are you 
looking for a faster / more efficient way to transfer the data?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: create a new GIN index for my own type

2021-10-04 Thread Tomas Vondra


On 10/4/21 3:32 PM, huangning...@yahoo.com wrote:
I have already debugged the program according to this step, but I found 
that in the DataCopy function, the variable typlen should be 8, but it 
is -1,




Well, if you have debugged this, it'd be nice if you could share more 
information (e.g. backtraces, etc.) otherwise others can just guess what 
you saw. And that makes it much harder to help you.


I see you defined the data type as PASSEDBYVALUE, but you haven't 
specified INTERNALLENGTH, so it's -1 (i.e. variable length). Obviously, 
that can't be passed by value - not sure if this is intentional or just 
a case of CREATE TYPE not checking it.



BTW it's customary not to top post - inline replies are much easier to 
follow, as it makes clearer which parts you respond to. And please make 
sure that you're responding to the mailing list, not just directly to 
the other person.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Problem in pg_basebackup of 1.7 TB database while replicating in PostgreSQL 9.4.( Master and slave both are in postgreSQL 9.4)

2021-10-04 Thread Ron


On 10/4/21 2:28 AM, Amal Chakravarty wrote:
Hi all. I am working with a database of 1.7 TB size which is in PostgreSQL 
9.4. Kindly suggest how to take the pg_basebackup of such a huge data 
while replicating from master to slave.


Obligatory "9.4 is EOL" comment.

--
Angular momentum makes the world go 'round.

Re: Problem in Storing Satellite imagey(Geotiff file)

2021-10-04 Thread Adrian Klaver


On 10/4/21 2:19 AM, Vijay Garg wrote:

Hi Adrian,

I am using postgres 13.4 version & postgis 3.1 version. I have used the 
following command to upload the data.


raster2pgsql -I -C -s 4326  C:\Users\Admin\Desktop\dem.tif raster.dem | 
psql -U postgres -d India -h localhost -p 5432


I am able to upload dem file in postgres but I am not able to check any 
table.


The image shows that you can check(select from) the table. What you are 
seeing is the image data stored in a binary format. You will not see the 
actual image. To get that see this section of PostGIS docs:


https://postgis.net/docs/using_raster_dataman.html#RT_Raster_Applications



I am attaching the screenshot in this email.

image.png

--
*Thanks & Regards*
Vijay Garg
Contact Number: 6283101848
Sabudh Foundation
New Delhi




On Sun, Oct 3, 2021 at 2:18 AM Adrian Klaver 


--
Adrian Klaver
adrian.kla...@aklaver.com

Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

2021-10-04 Thread Shaozhong SHI

Has anyone tested this one?
A Fast Method to Bulk Insert a Pandas DataFrame into Postgres · Ellis
Valentiner


I tried psql_insert_copy method, but I got the following error message.

to_sql() got an unexpected keyword argument 'method'


Can anyone shed light on this?


Regards,


David

Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

2021-10-04 Thread Adrian Klaver


On 10/4/21 8:44 AM, Shaozhong SHI wrote:


Has anyone tested this one?
A Fast Method to Bulk Insert a Pandas DataFrame into Postgres · Ellis 
Valentiner 



I tried psql_insert_copy method, but I got the following error message.

to_sql() got an unexpected keyword argument 'method'


Can anyone shed light on this?


method is a legitimate keyword:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html?highlight=to_sql#pandas.DataFrame.to_sql

So there must be something about how you used it.

Pandas version?

The actual code you used when the error occurred?





Regards,


David




--
Adrian Klaver
adrian.kla...@aklaver.com

Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

2021-10-04 Thread Shaozhong SHI

Hello, Adrian Klaver,

Pandas version is 0.23.0.

I used the following code:

def psql_insert_copy(table, conn, keys, data_iter):
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)

columns = ', '.join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name)
else:
table_name = table.name

sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)

engine = create_engine('postgresql+psycopg2://:5432/postgres')
try:
df.to_sql('test1', engine, schema='public', if_exists='append',
index=False, method=psql_insert_copy)

I could not find obvious reasons.

Regards,

David

On Mon, 4 Oct 2021 at 17:06, Adrian Klaver 
wrote:

> On 10/4/21 8:44 AM, Shaozhong SHI wrote:
> >
> > Has anyone tested this one?
> > A Fast Method to Bulk Insert a Pandas DataFrame into Postgres · Ellis
> > Valentiner
> > <
> https://ellisvalentiner.com/post/a-fast-method-to-insert-a-pandas-dataframe-into-postgres/
> >
> >
> > I tried psql_insert_copy method, but I got the following error message.
> >
> > to_sql() got an unexpected keyword argument 'method'
> >
> >
> > Can anyone shed light on this?
>
> method is a legitimate keyword:
>
>
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html?highlight=to_sql#pandas.DataFrame.to_sql
>
> So there must be something about how you used it.
>
> Pandas version?
>
> The actual code you used when the error occurred?
>
>
> >
> >
> > Regards,
> >
> >
> > David
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>

Growth planning

2021-10-04 Thread Israel Brewster

A couple of months ago I was tasked with setting up a new database to hold the 
results of some new data processing scripts a colleague of my was developing. 
As I knew this would be a decent amount of data, I did my best to set up a 
system that would perform well, assigning the machine 20 processor cores to 
allow for parallel processing workflows with retrieving/processing the data, 
128GB of RAM, and 6TB of NMVe storage. On this I installed the latest 
postgresql (13), did some performance tuning to the settings according to 
various guides I found online, set up a table to hold the data, and created a 
number of indexes that seemed appropriate for the queries I anticipated.

At this point, performance is fine, though not spectacular, with a “normal” 
query of the currently 400GB database taking a couple of seconds. “Normal” in 
this case being defined as the most common SELECT query run against the data to 
plot the current and historical trends. Users are few in number (and will 
likely remain that way), and it can keep up with the insert rate without 
difficulty - data is processed and dumped to the database in 10 minute 
“chunks", and generally the run is only taking around 2 minutes, so there is a 
fair amount of overhead available there.

However, this database is growing at a rate of 14GB/day (as measured via 
looking at df stats), and the script owner is taking about wanting to process 
and pull in “all the historical data we have access to”, which would go back 
several years, not to mention the probable desire to keep things running into 
the foreseeable future. This amounts to a volume of data that is WAY beyond 
anything I have any experience with, especially since I am primarily a software 
engineer, not a DB admin (though administering a DB is often involved with my 
role, as the software I engineer often needs a data source). As such, I am 
looking for advice from people who have dealt with such large volumes of data 
as far as how I should architect things now, before it gets to out-of-hand, to 
best ensure optimal performance in the future.

Some additional data that may or may not be relevant:

- The workload is largely insert intensive. Every 10 minutes, 1Hz records are 
inserted for up to three channels from about 118 stations - so up to around 
212,000 new records inserted every 10 minutes. In practice, the number is 
generally somewhat lower as not all stations actually have three channels of 
data, but the majority do.

- The largest SELECT workflow currently is a script that pulls all available 
data for ONE channel of each station (currently, I suspect that will change to 
all channels in the near future), and runs some post-processing machine 
learning algorithms on it. This script (written in R, if that makes a 
difference) currently takes around half an hour to run, and is run once every 
four hours. I would estimate about 50% of the run time is data retrieval and 
the rest doing its own thing. I am only responsible for integrating this script 
with the database, what it does with the data (and therefore how long that 
takes, as well as what data is needed), is up to my colleague. I have this 
script running on the same machine as the DB to minimize data transfer times.

- Other than the above processing script, workload is fairly light, with only 
one or two users occasionally pulling up graphs of the data for a handful of 
channels on a handful of stations at most (singe station/channel at a time). 
The time range of data needed for these graphs may vary from a few seconds 
(looking at a specific event, which may or may not be recent) to several years 
(looking at historical trends). As such, full-resolution data needs to be 
available quickly - that is, on a user-friendly time scale - for any period of 
time for which we have data.


- Some stats from the pg_stat_user_tables:

volcano_seismology=# SELECT relname, last_vacuum, 
last_analyze,last_autovacuum,last_autoanalyze, autovacuum_count, 
autoanalyze_count, 
n_dead_tup,idx_scan,idx_tup_fetch,n_tup_ins,n_mod_since_analyze,n_ins_since_vacuum
 FROM pg_stat_user_tables WHERE relname='data';
-[ RECORD 1 ]---+--
relname | data
last_vacuum | 2021-09-29 17:33:19.269922+00
last_analyze| 2021-09-29 17:33:32.281416+00
last_autovacuum | 2021-10-04 12:28:38.250069+00
last_autoanalyze| 2021-10-04 15:05:29.745062+00
autovacuum_count| 30
autoanalyze_count   | 37
n_dead_tup  | 122031
idx_scan| 1584854
idx_tup_fetch   | 44873856136
n_tup_ins   | 245597916
n_mod_since_analyze | 1901231
n_ins_since_vacuum  | 5958840

- and the data table definition:

 Column |   Type   | Collation | Nullable | 
Default  
+--+---+--+--
 id | bigint   |   | not null | 
nextval('data_id_

Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

2021-10-04 Thread Rob Sargent



> On Oct 4, 2021, at 10:20 AM, Shaozhong SHI  wrote:
> 
> Hello, Adrian Klaver,
> 
> Pandas version is 0.23.0.
> 
> I used the following code:
> 
> def psql_insert_copy(table, conn, keys, data_iter):
> # gets a DBAPI connection that can provide a cursor
> dbapi_conn = conn.connection
> with dbapi_conn.cursor() as cur:
> s_buf = StringIO()
> writer = csv.writer(s_buf)
> writer.writerows(data_iter)
> s_buf.seek(0)
> 
> columns = ', '.join('"{}"'.format(k) for k in keys)
> if table.schema:
> table_name = '{}.{}'.format(table.schema, table.name 
> )
> else:
> table_name = table.name 
> 
> sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
> table_name, columns)
> cur.copy_expert(sql=sql, file=s_buf)
> 
> engine = create_engine('postgresql+psycopg2://:5432/postgres')
> try:
> df.to_sql('test1', engine, schema='public', if_exists='append', 
> index=False, method=psql_insert_copy) 

you need to quote ‘psql_insert_copy'

DELETE ... USING LATERAL

2021-10-04 Thread Nikhil Benesch

Is it intentional that LATERAL elements in a USING clause of a DELETE
statement can't reference the table declared in the FROM clause?

Here's a somewhat contrived example. Suppose I have a table with one
jsonb column:

create table int_arrays (int_array jsonb);
insert into int_arrays values ('[1]'), ('[1, 2]'), ('[3, 4, 5]'),
('[1, 1, 1]');

If I want to delete every row whose array contains a value greater
than one, I would expect the following query to work:

delete from int_arrays using jsonb_array_each(int_array) _ (val)
where val::integer > 1;

But that fails with:

ERROR:  invalid reference to FROM-clause entry for table "int_arrays"
LINE 1: delete from int_arrays using jsonb_array_each(int_array) _ (...
  ^
HINT:  There is an entry for table "int_arrays", but it cannot be
referenced from this part of the query.

So, ok, fine, the FROM and USING clauses are different scopes or
something. Except that doesn't quite explain the situation, because
you can't reuse the FROM table name in the USING clause:

# delete from int_arrays using int_arrays;
ERROR:  table name "int_arrays" specified more than once

Can someone shed some light on the situation here? Is there a reason
that LATERAL elements in the USING clause must be prevented from
accessing the FROM table or is the restriction just emergent behavior?

Nikhil

Re: Growth planning

2021-10-04 Thread Rob Sargent



> On Oct 4, 2021, at 10:22 AM, Israel Brewster  wrote:
> 
> 
> - and the data table definition:
> 
>  Column |   Type   | Collation | Nullable |   
>   Default  
> +--+---+--+--
>  id | bigint   |   | not null | 
> nextval('data_id_seq'::regclass)
>  datetime   | timestamp with time zone |   | not null | 
>  freq_max1  | double precision |   |  | 
>  freq_max5  | double precision |   |  | 
>  freq_max10 | double precision |   |  | 
>  freq_max20 | double precision |   |  | 
>  freq_max30 | double precision |   |  | 
>  freq_max40 | double precision |   |  | 
>  freq_max50 | double precision |   |  | 
>  freq_max100| double precision |   |  | 
>  sd_freq_max5   | double precision |   |  | 
>  sd_freq_max10  | double precision |   |  | 
>  sd_freq_max20  | double precision |   |  | 
>  sd_freq_max30  | double precision |   |  | 
>  sd_freq_max40  | double precision |   |  | 
>  sd_freq_max50  | double precision |   |  | 
>  sd_freq_max100 | double precision |   |  | 
>  ssa_max1   | double precision |   |  | 
>  ssa_max5   | double precision |   |  | 
>  ssa_max10  | double precision |   |  | 
>  ssa_max20  | double precision |   |  | 
>  ssa_max30  | double precision |   |  | 
>  ssa_max40  | double precision |   |  | 
>  ssa_max50  | double precision |   |  | 
>  ssa_max100 | double precision |   |  | 
>  sd_ssa_max5| double precision |   |  | 
>  sd_ssa_max10   | double precision |   |  | 
>  sd_ssa_max20   | double precision |   |  | 
>  sd_ssa_max30   | double precision |   |  | 
>  sd_ssa_max40   | double precision |   |  | 
>  sd_ssa_max50   | double precision |   |  | 
>  sd_ssa_max100  | double precision |   |  | 
>  station| smallint |   | not null | 
>  channel| character varying(6) |   | not null | 
>  epoch  | integer  |   |  | 
>  rsam   | double precision |   |  | 
>  sd_rsam| double precision |   |  | 
> Indexes:
> "data_pkey" PRIMARY KEY, btree (id)
> "date_station_channel_idx" UNIQUE, btree (datetime, station, channel)
> "station_channel_epoch_idx" UNIQUE, btree (station, channel, epoch)
> "data_station_channel_idx" btree (station, channel)
> "station_data_idx" btree (station)
> "station_date_idx" btree (station, datetime)
> Foreign-key constraints:
> "data_station_fkey" FOREIGN KEY (station) REFERENCES stations(id)
> Triggers:
> update_epoch BEFORE INSERT OR UPDATE OF datetime ON data FOR EACH ROW 
> EXECUTE FUNCTION store_epoch()
> 
Guessing the “sd” is "standard deviation”?  Any chance those stddevs are easily 
calculable from base data?  Could cut your table size in half (and put those 20 
cores to work on the reporting).  And I wonder if the last three indices are 
strictly necessary? They take disc space too.

But my bet is you’re headed for partitioning on datetime or perhaps station.

Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

2021-10-04 Thread Adrian Klaver


On 10/4/21 9:20 AM, Shaozhong SHI wrote:

Hello, Adrian Klaver,

Pandas version is 0.23.0.


The reason the below does not work is method did not show up until 
pandas 0.24.0.




I used the following code:

def psql_insert_copy(table, conn, keys, data_iter):
     # gets a DBAPI connection that can provide a cursor
     dbapi_conn = conn.connection
     with dbapi_conn.cursor() as cur:
         s_buf = StringIO()
         writer = csv.writer(s_buf)
         writer.writerows(data_iter)
         s_buf.seek(0)

         columns = ', '.join('"{}"'.format(k) for k in keys)
         if table.schema:
             table_name = '{}.{}'.format(table.schema, table.name 
)

         else:
             table_name = table.name 

         sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
             table_name, columns)
         cur.copy_expert(sql=sql, file=s_buf)
engine = create_engine('postgresql+psycopg2://:5432/postgres')
try:
     df.to_sql('test1', engine, schema='public', if_exists='append', 
index=False, method=psql_insert_copy)


I could not find obvious reasons.





Regards,

David




--
Adrian Klaver
adrian.kla...@aklaver.com

Re: Growth planning

2021-10-04 Thread Israel Brewster

> On Oct 4, 2021, at 8:46 AM, Rob Sargent  wrote:
> 
>> On Oct 4, 2021, at 10:22 AM, Israel Brewster > > wrote:
> Guessing the “sd” is "standard deviation”?  Any chance those stddevs are 
> easily calculable from base data?  Could cut your table size in half (and put 
> those 20 cores to work on the reporting).

Possible - I’d have to dig into that with the script author. I was just handed 
an R script (I don’t work with R…) and told here’s the data it needs, here’s 
the output we need stored in the DB. I then spent just enough time with the 
script to figure out how to hook up the I/O. The schema is pretty much just a 
raw dump of the output - I haven’t really spent any resources figuring out 
what, exactly, the data is. Maybe I should :-)

>  And I wonder if the last three indices are strictly necessary? They take 
> disc space too.

Not sure. Here’s the output from pg_stat_all_indexes:

volcano_seismology=# select * from pg_stat_all_indexes where relname='data';
 relid | indexrelid | schemaname | relname |   indexrelname| 
idx_scan | idx_tup_read | idx_tup_fetch 
---+++-+---+--+--+---
 19847 |  19869 | public | data| data_pkey |
0 |0 | 0
 19847 |  19873 | public | data| date_station_channel_idx  |   
811884 |  12031143199 |1192412952
 19847 |  19875 | public | data| station_channel_epoch_idx |
8 |   318506 |318044
 19847 |  19876 | public | data| station_data_idx  | 
9072 | 9734 |  1235
 19847 |  19877 | public | data| station_date_idx  |   
721616 |  10927533403 |   10908912092
 19847 |  20479 | public | data| data_station_channel_idx  |
47293 | 194422257262 |6338753379
(6 rows)

so they *have* been used (although not the station_data_idx so much), but this 
doesn’t tell me when it was last used, so some of those may be queries I was 
experimenting with to see what was fastest, but are no longer in use. Maybe I 
should keep an eye on this for a while, see which values are increasing.

> 
> But my bet is you’re headed for partitioning on datetime or perhaps station.

While datetime partitioning seems to be the most common, I’m not clear on how 
that would help here, as the most intensive queries need *all* the datetimes 
for a given station, and even the smaller queries would be getting an arbitrary 
time range potentially spanning several, if not all, partitions. Now portioning 
on station seems to make sense - there are over 100 of those, and pretty much 
any query will only deal with a single station at a time. Perhaps if more 
partitioning would be better, portion by both station and channel? The queries 
that need to be fastest will only be looking at a single channel of a single 
station.

I’ll look into this a bit more, maybe try some experimenting while I still have 
*relatively* little data. My main hesitation here is that in the brief look 
I’ve given partitioning so far, it looks to be a royal pain to get set up. Any 
tips for making that easier?

Thanks for the suggestion!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

2021-10-04 Thread Shaozhong SHI

Hello, Adrian Klaver,
What is the robust way to upgrade Pandas?
Regards,
David

On Monday, 4 October 2021, Adrian Klaver  wrote:

> On 10/4/21 9:20 AM, Shaozhong SHI wrote:
>
>> Hello, Adrian Klaver,
>>
>> Pandas version is 0.23.0.
>>
>
> The reason the below does not work is method did not show up until pandas
> 0.24.0.
>
>
>> I used the following code:
>>
>> def psql_insert_copy(table, conn, keys, data_iter):
>>  # gets a DBAPI connection that can provide a cursor
>>  dbapi_conn = conn.connection
>>  with dbapi_conn.cursor() as cur:
>>  s_buf = StringIO()
>>  writer = csv.writer(s_buf)
>>  writer.writerows(data_iter)
>>  s_buf.seek(0)
>>
>>  columns = ', '.join('"{}"'.format(k) for k in keys)
>>  if table.schema:
>>  table_name = '{}.{}'.format(table.schema, table.name <
>> http://table.name>)
>>  else:
>>  table_name = table.name 
>>
>>  sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
>>  table_name, columns)
>>  cur.copy_expert(sql=sql, file=s_buf)
>> engine = create_engine('postgresql+psycopg2://:5432/postgres')
>> try:
>>  df.to_sql('test1', engine, schema='public', if_exists='append',
>> index=False, method=psql_insert_copy)
>>
>> I could not find obvious reasons.
>>
>
>
>
>> Regards,
>>
>> David
>>
>>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>

Re: Growth planning

2021-10-04 Thread Rob Sargent


On 10/4/21 11:09 AM, Israel Brewster wrote:
On Oct 4, 2021, at 8:46 AM, Rob Sargent > wrote:


On Oct 4, 2021, at 10:22 AM, Israel Brewster > wrote:
Guessing the “sd” is "standard deviation”?  Any chance those stddevs 
are easily calculable from base data?  Could cut your table size in 
half (and put those 20 cores to work on the reporting).


Possible - I’d have to dig into that with the script author. I was 
just handed an R script (I don’t work with R…) and told here’s the 
data it needs, here’s the output we need stored in the DB. I then 
spent just enough time with the script to figure out how to hook up 
the I/O. The schema is pretty much just a raw dump of the output - I 
haven’t really spent any resources figuring out what, exactly, the 
data is. Maybe I should :-)


 And I wonder if the last three indices are strictly necessary? They 
take disc space too.


Not sure. Here’s the output from pg_stat_all_indexes:

volcano_seismology=# select * from pg_stat_all_indexes where 
relname='data';
 relid | indexrelid | schemaname | relname |   indexrelname        | 
idx_scan | idx_tup_read | idx_tup_fetch

---+++-+---+--+--+---
 19847 |      19869 | public     | data    | data_pkey           |    
    0 |            0 |             0
 19847 |      19873 | public     | data    | 
date_station_channel_idx  |   811884 |  12031143199 |   1192412952
 19847 |      19875 | public     | data    | station_channel_epoch_idx 
|        8 |       318506 |   318044
 19847 |      19876 | public     | data    | station_data_idx        
  |     9072 |         9734 |   1235
 19847 |      19877 | public     | data    | station_date_idx        
  |   721616 |  10927533403 |   10908912092
 19847 |      20479 | public     | data    | 
data_station_channel_idx  |    47293 | 194422257262 |   6338753379

(6 rows)

so they *have* been used (although not the station_data_idx so much), 
but this doesn’t tell me when it was last used, so some of those may 
be queries I was experimenting with to see what was fastest, but are 
no longer in use. Maybe I should keep an eye on this for a while, see 
which values are increasing.




But my bet is you’re headed for partitioning on datetime or perhaps 
station.


While datetime partitioning seems to be the most common, I’m not clear 
on how that would help here, as the most intensive queries need *all* 
the datetimes for a given station, and even the smaller queries would 
be getting an arbitrary time range potentially spanning several, if 
not all, partitions. Now portioning on station seems to make sense - 
there are over 100 of those, and pretty much any query will only deal 
with a single station at a time. Perhaps if more partitioning would be 
better, portion by both station and channel? The queries that need to 
be fastest will only be looking at a single channel of a single station.


I’ll look into this a bit more, maybe try some experimenting while I 
still have *relatively* little data. My main hesitation here is that 
in the brief look I’ve given partitioning so far, it looks to be a 
royal pain to get set up. Any tips for making that easier?



If no queries address multiple stations you could do a table per 
station.  Doesn't smell good but you have a lot of data and well, speed 
kills.


I think the date-station-channel could "take over" for the 
station-date.  Naturally the latter is chosen if you give just the two 
fields, but I would be curious to see how well the former performs given 
just its first two fields(when station-date doesn't exist).

Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

2021-10-04 Thread Adrian Klaver


On 10/4/21 10:10 AM, Shaozhong SHI wrote:

Hello, Adrian Klaver,
What is the robust way to upgrade Pandas?


Carefully.

The most recent version is 1.3.3, which is approximately 5 versions 
ahead of where you are now. The big jump is when Pandas went from 0.25 
to 1.0. See docs here:


https://pandas.pydata.org/docs/whatsnew/v1.0.0.html?highlight=upgrade

So the process should be 0.24 -> 0.25, verify, 0.25 -> 1.0, verify. Then 
on to wherever you want to end up a step at a time.


Before each step spend time here:

https://pandas.pydata.org/docs/whatsnew/

to see what the gotcha's are.



Regards,
David





--
Adrian Klaver
adrian.kla...@aklaver.com

Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

2021-10-04 Thread Adrian Klaver


On 10/4/21 10:28 AM, Adrian Klaver wrote:

On 10/4/21 10:10 AM, Shaozhong SHI wrote:

Hello, Adrian Klaver,
What is the robust way to upgrade Pandas?


Carefully.

The most recent version is 1.3.3, which is approximately 5 versions 
ahead of where you are now. The big jump is when Pandas went from 0.25 
to 1.0. See docs here:


https://pandas.pydata.org/docs/whatsnew/v1.0.0.html?highlight=upgrade

So the process should be 0.24 -> 0.25, verify, 0.25 -> 1.0, verify. Then 
on to wherever you want to end up a step at a time.


Before each step spend time here:

https://pandas.pydata.org/docs/whatsnew/

to see what the gotcha's are.


Should have added:

If you are not already working in a virtualenv it would be a good idea 
to do the above in one or more.






Regards,
David








--
Adrian Klaver
adrian.kla...@aklaver.com

Re: Growth planning

2021-10-04 Thread Israel Brewster

> On Oct 4, 2021, at 9:22 AM, Rob Sargent  wrote:
> 
> On 10/4/21 11:09 AM, Israel Brewster wrote:
>>> On Oct 4, 2021, at 8:46 AM, Rob Sargent >> > wrote:
>>> 
 On Oct 4, 2021, at 10:22 AM, Israel Brewster >>> > wrote:
>>> Guessing the “sd” is "standard deviation”?  Any chance those stddevs are 
>>> easily calculable from base data?  Could cut your table size in half (and 
>>> put those 20 cores to work on the reporting).
>> 
>> Possible - I’d have to dig into that with the script author. I was just 
>> handed an R script (I don’t work with R…) and told here’s the data it needs, 
>> here’s the output we need stored in the DB. I then spent just enough time 
>> with the script to figure out how to hook up the I/O. The schema is pretty 
>> much just a raw dump of the output - I haven’t really spent any resources 
>> figuring out what, exactly, the data is. Maybe I should :-)
>> 
>>>  And I wonder if the last three indices are strictly necessary? They take 
>>> disc space too.
>> 
>> Not sure. Here’s the output from pg_stat_all_indexes:
>> 
>> volcano_seismology=# select * from pg_stat_all_indexes where relname='data';
>>  relid | indexrelid | schemaname | relname |   indexrelname| 
>> idx_scan | idx_tup_read | idx_tup_fetch 
>> ---+++-+---+--+--+---
>>  19847 |  19869 | public | data| data_pkey | 
>>0 |0 | 0
>>  19847 |  19873 | public | data| date_station_channel_idx  |   
>> 811884 |  12031143199 |1192412952
>>  19847 |  19875 | public | data| station_channel_epoch_idx | 
>>8 |   318506 |318044
>>  19847 |  19876 | public | data| station_data_idx  | 
>> 9072 | 9734 |  1235
>>  19847 |  19877 | public | data| station_date_idx  |   
>> 721616 |  10927533403 |   10908912092
>>  19847 |  20479 | public | data| data_station_channel_idx  |
>> 47293 | 194422257262 |6338753379
>> (6 rows)
>> 
>> so they *have* been used (although not the station_data_idx so much), but 
>> this doesn’t tell me when it was last used, so some of those may be queries 
>> I was experimenting with to see what was fastest, but are no longer in use. 
>> Maybe I should keep an eye on this for a while, see which values are 
>> increasing.
>> 
>>> 
>>> But my bet is you’re headed for partitioning on datetime or perhaps station.
>> 
>> While datetime partitioning seems to be the most common, I’m not clear on 
>> how that would help here, as the most intensive queries need *all* the 
>> datetimes for a given station, and even the smaller queries would be getting 
>> an arbitrary time range potentially spanning several, if not all, 
>> partitions. Now portioning on station seems to make sense - there are over 
>> 100 of those, and pretty much any query will only deal with a single station 
>> at a time. Perhaps if more partitioning would be better, portion by both 
>> station and channel? The queries that need to be fastest will only be 
>> looking at a single channel of a single station.
>> 
>> I’ll look into this a bit more, maybe try some experimenting while I still 
>> have *relatively* little data. My main hesitation here is that in the brief 
>> look I’ve given partitioning so far, it looks to be a royal pain to get set 
>> up. Any tips for making that easier?
>> 
>> 
> If no queries address multiple stations you could do a table per station.  
> Doesn't smell good but you have a lot of data and well, speed kills.

Indeed. Table per station as opposed to partitioning? The *most* I can 
reasonably envision needing is to query two stations, i.e. I could see 
potentially wanting to compare station a to some “baseline” station b. In 
general, though, the stations are independent, and it seems unlikely that we 
will need any multi-station queries. Perhaps query one station, then a second 
query for a second to display graphs for both side-by-side to look for 
correlations or something, but nothing like that has been suggested at the 
moment.

> 
> I think the date-station-channel could "take over" for the station-date.  
> Naturally the latter is chosen if you give just the two fields, but I would 
> be curious to see how well the former performs given just its first two 
> fields(when station-date doesn't exist).

Ah, that makes sense. I’ll try to run some benchmarks later today/tomorrow.

Thanks again!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: DELETE ... USING LATERAL

2021-10-04 Thread Tom Lane

Nikhil Benesch  writes:
> Is it intentional that LATERAL elements in a USING clause of a DELETE
> statement can't reference the table declared in the FROM clause?

Hmm ... doesn't work for UPDATE, either.

My mental model of these things is that the target table is cross-joined
to the additional tables as though by a comma in FROM, so that what
you have here ought to work much like

select * from int_arrays, jsonb_array_each(int_array) _ (val)
where val::integer > 1;

Clearly it's not doing so as far as the LATERAL scoping is concerned.
Maybe we are adding the target table to the query after the additional
tables, not before them?

Not sure I'd call this a bug exactly, but maybe there's room for
improvement.  Or maybe there is an actual semantic issue that
I'm not seeing right away.

regards, tom lane

Re: DELETE ... USING LATERAL

2021-10-04 Thread Nikhil Benesch

On Mon, Oct 4, 2021 at 1:48 PM Tom Lane  wrote:
> My mental model of these things is that the target table is cross-joined
> to the additional tables as though by a comma in FROM [...]

Mine as well.

I just managed to dredge up some history here though. Turns out you
explicitly disabled this feature for 9.4 to make room for a future
feature to allow left-joining the target table [0]. Is support for
that feature still desired/planned? (If it's been permanently
abandoned for whatever reason, then maybe it's safe just to revert
158b7fa?)

[0]: 
https://github.com/postgres/postgres/commit/158b7fa6a34006bdc70b515e14e120d3e896589b

Nikhil

Re: DELETE ... USING LATERAL

2021-10-04 Thread Michael Lewis

On Mon, Oct 4, 2021, 10:30 AM Nikhil Benesch 
wrote:

> you can't reuse the FROM table name in the USING clause:
>
> # delete from int_arrays using int_arrays;
> ERROR:  table name "int_arrays" specified more than once


> Don't you need to use an alias for the table in the using clause?

Re: DELETE ... USING LATERAL

2021-10-04 Thread Tom Lane

Nikhil Benesch  writes:
> On Mon, Oct 4, 2021 at 1:48 PM Tom Lane  wrote:
>> My mental model of these things is that the target table is cross-joined
>> to the additional tables as though by a comma in FROM [...]

> Mine as well.

> I just managed to dredge up some history here though. Turns out you
> explicitly disabled this feature for 9.4 to make room for a future
> feature to allow left-joining the target table [0]. Is support for
> that feature still desired/planned? (If it's been permanently
> abandoned for whatever reason, then maybe it's safe just to revert
> 158b7fa?)

Ah-hah, I wondered whether we hadn't thought about this already,
but I'd not gotten around to researching it.

Not sure what to tell you about the state of the idea that the
target table could be re-specified in FROM/USING.  I'm hesitant
to close the door on it permanently, because people do periodically
wish to be able to left-join the target to something else.  But
the fact that no one's done anything about it for years suggests
that it's not that high on anyone's wish list.

regards, tom lane

Re: Postgres PANIC when it could not open file in pg_logical/snapshots directory

2021-10-04 Thread Alvaro Herrera

On 2021-Jun-22, Mike Yeap wrote:

> I have a Postgres version 11.11 configured with both physical replication
> slots (for repmgr) as well as some logical replication slots (for AWS
> Database Migration Service (DMS)). This morning, the server went panic with
> the following messages found in the log file:
> 
> 2021-06-22 04:56:35.314 +08 [PID=19457 application="[unknown]"
> user_name=dms database=** host(port)=**(48360)] PANIC:  could not open file
> "pg_logical/snapshots/969-FD606138.snap": Operation not permitted

Hmm, isn't this strange?  open(3) is not documented to return EPERM,
which is what this error string maps to.

You should definitely not mess with the PG data dir; if you do, it's
your problem when things break.  In this case you may not be messing
with the data dir yourself, but perhaps you kernel is buggy or you have
some security module that prevents the operation from working properly,
or something.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: DELETE ... USING LATERAL

2021-10-04 Thread Tom Lane

Michael Lewis  writes:
> On Mon, Oct 4, 2021, 10:30 AM Nikhil Benesch 
> wrote:
>> # delete from int_arrays using int_arrays;
>> ERROR:  table name "int_arrays" specified more than once

>> Don't you need to use an alias for the table in the using clause?

You could, but then you'd be creating a self-join on the target table
(and would need to add suitable WHERE clauses to constrain that join).
This might be the best near-term workaround, but it does seem ugly
and inefficient.

regards, tom lane

Re: Growth planning

2021-10-04 Thread Alban Hertroys



> On 4 Oct 2021, at 18:22, Israel Brewster  wrote:

(…)

> the script owner is taking about wanting to process and pull in “all the 
> historical data we have access to”, which would go back several years, not to 
> mention the probable desire to keep things running into the foreseeable 
> future.

(…)

> - The largest SELECT workflow currently is a script that pulls all available 
> data for ONE channel of each station (currently, I suspect that will change 
> to all channels in the near future), and runs some post-processing machine 
> learning algorithms on it. This script (written in R, if that makes a 
> difference) currently takes around half an hour to run, and is run once every 
> four hours. I would estimate about 50% of the run time is data retrieval and 
> the rest doing its own thing. I am only responsible for integrating this 
> script with the database, what it does with the data (and therefore how long 
> that takes, as well as what data is needed), is up to my colleague. I have 
> this script running on the same machine as the DB to minimize data transfer 
> times.

I suspect that a large portion of time is spent on downloading this data to the 
R script, would it help to rewrite it in PL/R and do (part of) the ML 
calculations at the DB side?

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Re: Growth planning

2021-10-04 Thread Ron


On 10/4/21 12:36 PM, Israel Brewster wrote:
[snip]
Indeed. Table per station as opposed to partitioning? The *most* I can 
reasonably envision needing is to query two stations, i.e. I could see 
potentially wanting to compare station a to some “baseline” station b. In 
general, though, the stations are independent, and it seems unlikely that 
we will need any multi-station queries. Perhaps query one station, then a 
second query for a second to display graphs for both side-by-side to look 
for correlations or something, but nothing like that has been suggested at 
the moment.




Postgresql partitions *are* tables.  What if you partition by station (or 
range of stations)?


--
Angular momentum makes the world go 'round.

Re: Growth planning

2021-10-04 Thread Israel Brewster

> On Oct 4, 2021, at 12:46 PM, Ron  wrote:
> 
> On 10/4/21 12:36 PM, Israel Brewster wrote:
> [snip]
>> Indeed. Table per station as opposed to partitioning? The *most* I can 
>> reasonably envision needing is to query two stations, i.e. I could see 
>> potentially wanting to compare station a to some “baseline” station b. In 
>> general, though, the stations are independent, and it seems unlikely that we 
>> will need any multi-station queries. Perhaps query one station, then a 
>> second query for a second to display graphs for both side-by-side to look 
>> for correlations or something, but nothing like that has been suggested at 
>> the moment.
>> 
> 
> Postgresql partitions are tables.  What if you partition by station (or range 
> of stations)?

Yeah, that’s what I thought, but Rob had said “Table per station”, so I wasn’t 
sure if he was referring to *not* using partitioning, but just making “plain” 
tables.

Regardless, I intend to try portioning by station sometime this week, to see 
how performance compares to the “one big table” I currently have. Also to 
figure out how to get it set up, which from what I’ve seen appears to be a bit 
of a pain point.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145
> 
> -- 
> Angular momentum makes the world go 'round.

Query time related to limit clause

2021-10-04 Thread Shubham Mittal

Hi Team,

*I have shared execution times of two queries below:*

*I need to find only the first row matching the criteria , but limit 1 is
taking more time than limit 15 or more.. If any one can tell an
explanation for this and how I can achieve the same in less time.*

 explain analyze SELECT * from abc where organisation_process_path =
cast('org' as ltree) and  abc_type='secondary' and
common_details->'commonDetails'->'nchBundle'->>'subGridName'='905811-22_MISCN_data'
and status <>
   'CLOSED' AND sub_product_type = 'Prepaid'
AND created_date >= cast('2021-03-23 00:00:00.000' AS TIMESTAMP)
AND created_date <= cast('2021-09-23 00:00:00.000' AS TIMESTAMP)
order by created_date asc *LIMIT 1*

"Limit  (cost=1.31..941.32 rows=1 width=6947) (actual
time=5117.039..5117.042 rows=1 loops=1)"
"  ->  Merge Append  (cost=1.31..4476296.09 rows=4762 width=6947) (actual
time=5117.036..5117.038 rows=1 loops=1)"
"Sort Key: abc_serv_nch_q1_2021.created_date"
"->  Index Scan using abc_serv_nch_q1_2021_created_date_idx on
abc_serv_nch_q1_2021  (cost=0.43..378412.39 rows=1005 width=7025) (actual
time=742.277..742.277 rows=0 loops=1)"
"  Index Cond: ((created_date >= '2021-03-23
00:00:00'::timestamp without time zone) AND (created_date <= '2021-09-23
00:00:00'::timestamp without time zone))"
"  Filter: (((status)::text <> 'CLOSED'::text) AND
(organisation_process_path = 'org'::ltree) AND ((abc_type)::text =
'secondary'::text) AND ((sub_product_type)::text = 'Prepaid'::text) AND
common_details -> 'commonDetails'::text) -> 'nchBundle'::text) ->>
'subGridName'::text) = '905811-22_MISCN_data'::text))"
"  Rows Removed by Filter: 558116"
"->  Index Scan using abc_serv_nch_q2_2021_created_date_idx on
abc_serv_nch_q2_2021  (cost=0.43..2674454.09 rows=3756 width=6928) (actual
time=2074.950..2074.950 rows=1 loops=1)"
"  Index Cond: ((created_date >= '2021-03-23
00:00:00'::timestamp without time zone) AND (created_date <= '2021-09-23
00:00:00'::timestamp without time zone))"
"  Filter: (((status)::text <> 'CLOSED'::text) AND
(organisation_process_path = 'org'::ltree) AND ((abc_type)::text =
'secondary'::text) AND ((sub_product_type)::text = 'Prepaid'::text) AND
common_details -> 'commonDetails'::text) -> 'nchBundle'::text) ->>
'subGridName'::text) = '905811-22_MISCN_data'::text))"
"  Rows Removed by Filter: 1743539"
"->  Index Scan using abc_serv_nch_q3_2021_created_date_idx on
abc_serv_nch_q3_2021  (cost=0.43..1423368.04 rows=1 width=6548) (actual
time=2299.805..2299.805 rows=0 loops=1)"
"  Index Cond: ((created_date >= '2021-03-23
00:00:00'::timestamp without time zone) AND (created_date <= '2021-09-23
00:00:00'::timestamp without time zone))"
"  Filter: (((status)::text <> 'CLOSED'::text) AND
(organisation_process_path = 'org'::ltree) AND ((abc_type)::text =
'secondary'::text) AND ((sub_product_type)::text = 'Prepaid'::text) AND
common_details -> 'commonDetails'::text) -> 'nchBundle'::text) ->>
'subGridName'::text) = '905811-22_MISCN_data'::text))"
"  Rows Removed by Filter: 1320434"



*"Planning Time: 18.563 ms""Execution Time: 5117.157 ms"*

WHEN LIMIT IS GIVEN MORE THAN EQUAL TO
15*

explain analyze SELECT * from abc where organisation_process_path =
cast('org' as ltree) and  abc_type='secondary' and
common_details->'commonDetails'->'nchBundle'->>'subGridName'='905811-22_MISCN_data'
and status <>
   'CLOSED' AND sub_product_type = 'Prepaid'
AND created_date >= cast('2021-03-23 00:00:00.000' AS TIMESTAMP)
AND created_date <= cast('2021-09-23 00:00:00.000' AS TIMESTAMP)
order by created_date asc *LIMIT 15*

   "Limit  (cost=12708.06..12708.09 rows=15 width=6947) (actual
time=0.428..0.431 rows=15 loops=1)"
"  ->  Sort  (cost=12708.06..12719.96 rows=4762 width=6947) (actual
time=0.426..0.428 rows=15 loops=1)"
"Sort Key: abc_serv_nch_q1_2021.created_date"
"Sort Method: top-N heapsort  Memory: 40kB"
"->  Append  (cost=7201.82..12591.22 rows=4762 width=6947) (actual
time=0.081..0.366 rows=299 loops=1)"
"  ->  Bitmap Heap Scan on abc_serv_nch_q1_2021
 (cost=7201.82..8338.60 rows=1005 width=7025) (actual time=0.038..0.038
rows=0 loops=1)"
"Recheck Cond: (common_details ->
'commonDetails'::text) -> 'nchBundle'::text) ->> 'subGridName'::text) =
'905811-22_MISCN_data'::text) AND ((sub_product_type)::text =
'Prepaid'::text) AND ((abc_type)::text = 'secondary'::text) AND
((status)::text <> 'CLOSED'::text) AND (created_date >= '2021-03-23
00:00:00'::timestamp without time zone) AND (created_date <= '2021-09-23
00:00:00'::timestamp without time zone))"
"Filter: (organisation_process_path = 'org'::ltree)"
"->  BitmapAnd  (cost=7201.82..7201.82 rows=1005
width=0)

Re: Growth planning

2021-10-04 Thread Rob Sargent


On 10/4/21 3:09 PM, Israel Brewster wrote:
On Oct 4, 2021, at 12:46 PM, Ron > wrote:


On 10/4/21 12:36 PM, Israel Brewster wrote:
[snip]
Indeed. Table per station as opposed to partitioning? The *most* I 
can reasonably envision needing is to query two stations, i.e. I 
could see potentially wanting to compare station a to some 
“baseline” station b. In general, though, the stations are 
independent, and it seems unlikely that we will need any 
multi-station queries. Perhaps query one station, then a second 
query for a second to display graphs for both side-by-side to look 
for correlations or something, but nothing like that has been 
suggested at the moment.




Postgresql partitions *are* tables.  What if you partition by station 
(or range of stations)?


Yeah, that’s what I thought, but Rob had said “Table per station”, so 
I wasn’t sure if he was referring to *not* using partitioning, but 
just making “plain” tables.


Regardless, I intend to try portioning by station sometime this week, 
to see how performance compares to the “one big table” I currently 
have. Also to figure out how to get it set up, which from what I’ve 
seen appears to be a bit of a pain point.

---

My "strict" table per station suggestion was meant as an option to avoid 
the partitioning pain point entirely if it wasn't going to buy you 
anything. Namely querying more than one station's data.


In a write-once scenario such as this,  would a "clustered index" on 
datetime be stable, performant?  Seems a read-for-export could put the 
head down at time point A and just go?

Re: Growth planning

2021-10-04 Thread Israel Brewster

> On Oct 4, 2021, at 1:21 PM, Rob Sargent  wrote:
> 
> On 10/4/21 3:09 PM, Israel Brewster wrote:
>>> On Oct 4, 2021, at 12:46 PM, Ron >> > wrote:
>>> 
>>> On 10/4/21 12:36 PM, Israel Brewster wrote:
>>> [snip]
 Indeed. Table per station as opposed to partitioning? The *most* I can 
 reasonably envision needing is to query two stations, i.e. I could see 
 potentially wanting to compare station a to some “baseline” station b. In 
 general, though, the stations are independent, and it seems unlikely that 
 we will need any multi-station queries. Perhaps query one station, then a 
 second query for a second to display graphs for both side-by-side to look 
 for correlations or something, but nothing like that has been suggested at 
 the moment.
 
>>> 
>>> Postgresql partitions are tables.  What if you partition by station (or 
>>> range of stations)?
>> 
>> Yeah, that’s what I thought, but Rob had said “Table per station”, so I 
>> wasn’t sure if he was referring to *not* using partitioning, but just making 
>> “plain” tables.
>> 
>> Regardless, I intend to try portioning by station sometime this week, to see 
>> how performance compares to the “one big table” I currently have. Also to 
>> figure out how to get it set up, which from what I’ve seen appears to be a 
>> bit of a pain point.
>> ---
>> 
> My "strict" table per station suggestion was meant as an option to avoid the 
> partitioning pain point entirely if it wasn't going to buy you anything. 
> Namely querying more than one station's data.

Ah, so in theory making “strict” tables for each would be easier than creating 
partitions for each? Something to consider for sure if so.

> 
> In a write-once scenario such as this,  would a "clustered index" on datetime 
> be stable, performant?  Seems a read-for-export could put the head down at 
> time point A and just go? 
> 
That’s beyond my level of DB admin knowledge, unfortunately :) I can certainly 
read up on it and give it a try though!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: Growth planning

2021-10-04 Thread Israel Brewster

> On Oct 4, 2021, at 9:22 AM, Rob Sargent  wrote:
> 
> I think the date-station-channel could "take over" for the station-date.  
> Naturally the latter is chosen if you give just the two fields, but I would 
> be curious to see how well the former performs given just its first two 
> fields(when station-date doesn't exist).
> 

Interesting result here. Technically it appears you are correct - the 
date-station-channel index *can* “take over” for the station-date index. 
Unfortunately, it is about 6x slower (see the EXPLAIN ANALYZE output for the 
station_date_idx here: https://explain.depesz.com/s/COfy 
 vs the one for the date-station-channel 
index here: https://explain.depesz.com/s/hgBt 
) - using the station_date_idx takes around 
2.5 seconds while the date-station-channel index is over 12 seconds, even 
though it has an apparently simpler execution plan. Perhaps something about the 
different sizes of the indexes?

The query I used in both cases was this:

SELECT
to_char(datetime AT TIME ZONE 'UTC','-MM-DD"T"HH24:MI:SS"Z"') as 
text_date,
freq_max10,
sd_freq_max10,
rsam
FROM
data
WHERE datetime>='2021-09-27' 
AND station=27
AND channel=‘BHZ'

Which actually includes all three columns (which makes it even more interesting 
to me that the two column, non-UNIQUE index is preferable), and I ran the query 
several times both with and without the station-date index to (hopefully) make 
sure there were no caching issues.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: Growth planning

2021-10-04 Thread Rob Sargent


On 10/4/21 3:37 PM, Israel Brewster wrote:
On Oct 4, 2021, at 1:21 PM, Rob Sargent > wrote:


My "strict" table per station suggestion was meant as an option to 
avoid the partitioning pain point entirely if it wasn't going to buy 
you anything. Namely querying more than one station's data.


Ah, so in theory making “strict” tables for each would be easier than 
creating partitions for each? Something to consider for sure if so.




In a write-once scenario such as this,  would a "clustered index" on 
datetime be stable, performant?  Seems a read-for-export could put 
the head down at time point A and just go?


That’s beyond my level of DB admin knowledge, unfortunately :) I can 
certainly read up on it and give it a try though!




I was hoping one of the smart people would chime in;)

How to build psycopg2 for Windows

2021-10-04 Thread Dan Davis

Can anyone give me a solution to build psycopg2 statically on Windows?

I have succeeded in building it, but when I run dumpbin /dependents on the
generated file (the PYD file), it still depends on libpq.dll even when I
pass --static-libpq.

*Environment*

- OS: Windows 10
- Psycopg version: psycopg2-2.8.5
- Python version: 3.9
- PostgreSQL version: 14.0 (from ZIP)
- pip version: 21.2.4
- Visual C++ version: 2019

*Procedure*

- Make sure pg_config and psql are in the path
- Download as follows - pip download psycopg2==2.8.5 --no-binary :all:
- Expand the tarball
- Build in that directory as follows:

  python setup.py build_ext --static-libpq

- Try to verify it is indeed static

  dumpbin /dependents
build\lib.win-amd64-3.9\psycopg2\_psycopg.cp39-win_amd64.pyd


*Background*
The purpose here is 2 fold:

* Backfill support for Python 3.9 and psycopg2 to versions that may have
come out before 3.9 was available.
* Make sure our psycopg2 is built against a more recent version of the
PostgreSQL client libraries than 9.x

Re: How to build psycopg2 for Windows

2021-10-04 Thread Daniele Varrazzo

On Tue, 5 Oct 2021 at 00:30, Dan Davis  wrote:
>
> Can anyone give me a solution to build psycopg2 statically on Windows?

You can follow what Appveyor does, which is the CI that builds
psycopg2 packages.

- this is the setup
https://github.com/psycopg/psycopg2/blob/master/.appveyor/packages.yml
- this is the script it calls into:
https://github.com/psycopg/psycopg2/blob/master/scripts/build/appveyor.py

You can use the ML at psyc...@postgres.org for further questions.

-- Daniele

Re: How to build psycopg2 for Windows

2021-10-04 Thread Dan Davis

Daniele (namesake),

Thanks for the quick response.  I tried that, it looks like
https://github.com/psycopg/psycopg2/blob/master/scripts/build/appveyor.py#L291
shows the build step:

python setup.py build_ext -l "libpgcommon libpgport"

After that, I still get a binary that depends on libpq.dll - but when I
install psycopg2-binary, the DLL installed (the PYD file) does not depend
on libpq.dll - it is statically built.

On Mon, Oct 4, 2021 at 6:41 PM Daniele Varrazzo 
wrote:

> On Tue, 5 Oct 2021 at 00:30, Dan Davis  wrote:
> >
> > Can anyone give me a solution to build psycopg2 statically on Windows?
>
> You can follow what Appveyor does, which is the CI that builds
> psycopg2 packages.
>
> - this is the setup
> https://github.com/psycopg/psycopg2/blob/master/.appveyor/packages.yml
> - this is the script it calls into:
> https://github.com/psycopg/psycopg2/blob/master/scripts/build/appveyor.py
>
> You can use the ML at psyc...@postgres.org for further questions.
>
> -- Daniele
>

Re: Growth planning

2021-10-04 Thread Ryan Booz

As for clustering, unfortunately, it's a one-time operation in Postgres (as
far as I'm aware), so you'd have to "cluster" the index every time after an
insert or update of data. If it is partitioned, I presume it can be run on
the index of each partition table individually - but I'm not sure.

On Mon, Oct 4, 2021 at 6:05 PM Rob Sargent  wrote:

> On 10/4/21 3:37 PM, Israel Brewster wrote:
>
> On Oct 4, 2021, at 1:21 PM, Rob Sargent  wrote:
>
> My "strict" table per station suggestion was meant as an option to avoid
> the partitioning pain point entirely if it wasn't going to buy you
> anything. Namely querying more than one station's data.
>
>
> Ah, so in theory making “strict” tables for each would be easier than
> creating partitions for each? Something to consider for sure if so.
>
>
> In a write-once scenario such as this,  would a "clustered index" on
> datetime be stable, performant?  Seems a read-for-export could put the head
> down at time point A and just go?
>
> That’s beyond my level of DB admin knowledge, unfortunately :) I can
> certainly read up on it and give it a try though!
>
>
> I was hoping one of the smart people would chime in;)
>

help implementing OGR Postgres foreign data wrapper

2021-10-04 Thread Brent Wood

Hi,

Apologies, this not strictly a Postgres question, but Postgres is so
fundamental to what I'm trying to do and this list is generally so helpful, so
here I go...

I'm using FDW's to create virtual tables providing access to data from external
data sources. This is working well for data in other Postgres databases using
the Postgres FDW.

I also want to be able to link to data from WFS services via the OGR FDW
(https://www.pgxn.org/dist/ogr_fdw/)
The first service I'm trying to connect to is:
https://dservices3.arcgis.com/fp1tibNcN9mbExhG/arcgis/services/BOMEC_15_Class_WFS/WFSServer

I can retrieve the capabilities doc in a browser (test 1 passed!) and plot the
data as a layer in QGIS (test 2 passed). My next step is to test OGR access via
ogrinfo. This I can't seem to get working. I'm assuming I need to sort this out
before trying to get the OGR_FDW working.

I've tried:
ogrinfo
https://dservices3.arcgis.com/fp1tibNcN9mbExhG/arcgis/services/BOMEC_15_Class_WFS/WFSServer
ogrinfo
https://dservices3.arcgis.com/fp1tibNcN9mbExhG/arcgis/services/BOMEC_15_Class_WFS/WFSServer?SERVICE=WFS&REQUEST=GetCapabilities

I can only get the following response
ogrinfo
https://dservices3.arcgis.com/fp1tibNcN9mbExhG/arcgis/services/BOMEC_15_Class_WFS/WFSServer?SERVICE=WFS&REQUEST=GetCapabilities
[1] 3121502
ERROR 1: HTTP error code : 400
ERROR 1: HTTP error code : 400
FAILURE:
Unable to open datasource
`https://dservices3.arcgis.com/fp1tibNcN9mbExhG/arcgis/services/BOMEC_15_Class_WFS/WFSServer?SERVICE=WFS'
with the following drivers.
-> ESRIC
...
-> OGR_PDS
-> WFS
-> OAPIF
...

Can anyone suggest what might be the problem?

Thanks.

Brent Wood

Principal Technician, Fisheries
NIWA
DDI: +64 (4) 3860529
[https://www.niwa.co.nz/static/niwa-2018-horizontal-180.png]

Brent Wood
Principal Technician - GIS and Spatial Data Management
Programme Leader - Environmental Information Delivery
+64-4-386-0529

National Institute of Water & Atmospheric Research Ltd (NIWA)
301 Evans Bay Parade Hataitai Wellington New Zealand
Connect with NIWA: niwa.co.nz
Facebook
LinkedIn
Twitter
Instagram
To ensure compliance with legal requirements and to maintain cyber security
standards, NIWA's IT systems are subject to ongoing monitoring, activity
logging and auditing. This monitoring and auditing service may be provided by
third parties. Such third parties can access information transmitted to,
processed by and stored on NIWA's IT systems

48 matches

Mail list logo