date:20180321

badly scaling performance with appending to bytea

2018-03-21 Thread Gary Cowell

We are trying to implement postgresql code to load a large object into
a postgresql bytea in chunks to avoid loading the file into memory in
the client.

First attempt was to do

update build_attachment set chunk = chunk || newdata ;

this did not scale and got significantly slower after 4000-5000 updates.

The chunks are 4K in size, and I'm testing with a 128MB input file,
requiring 32,774 chunk updates.

Next, I tried creating an aggregate, thus:

(taken from stackoverflow)

CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);

changed the code to insert the chunks to a temporary table :

create temporary table build_attachment (seq bigserial primary key,
chunk bytea ) on commit drop;

we then insert our 4K chunks to this, which takes very little time (20
seconds for the 32,774 inserts)

Here's an example though of trying to select the aggregate:

gary=> \timing
Timing is on.
gary=> select bytea_agg(chunk order by seq) from build_attachment
where seq < 4000 \g output
Time: 13372.843 ms
gary=> select bytea_agg(chunk order by seq) from build_attachment
where seq < 8000 \g output
Time: 54447.541 ms
gary=> select bytea_agg(chunk order by seq) from build_attachment
where seq < 16000 \g output
Time: 582219.773 ms

So those partial aggregates completed in somewhat acceptable times but ...

gary=> select bytea_agg(chunk order by seq) from build_attachment
where seq < 32000 \g output
this one hadn't completed in an hour - the PostgreSQL connection
process for my connection on the server goes to 100% CPU and stays
there, not using much RAM, not doing much IO, oddly

EXPLAINing these aggregate selects doesn't show anything useful.

Am I doomed to not be able to update a bytea this way? Is there some
way I can tune this?

Re: badly scaling performance with appending to bytea

2018-03-21 Thread Rick Otten

Can you use a materialized view to do the bytea_agg() and then refresh
concurrently whenever you need updated data?
The refresh concurrently might take a few hours or days to run to keep the
matview up to date, but your queries would be pretty fast.

A possible problem is  that you are running out of memory, so the larger
queries are going to disk.  If you can set up temp space on a faster
volume, or bump up your memory configuration it might help.
ie, work_mem, shared_buffers, and file system cache could all play into
larger aggregations running faster.


On Wed, Mar 21, 2018 at 8:03 AM, Gary Cowell  wrote:

> We are trying to implement postgresql code to load a large object into
> a postgresql bytea in chunks to avoid loading the file into memory in
> the client.
>
> First attempt was to do
>
> update build_attachment set chunk = chunk || newdata ;
>
> this did not scale and got significantly slower after 4000-5000 updates.
>
> The chunks are 4K in size, and I'm testing with a 128MB input file,
> requiring 32,774 chunk updates.
>
> Next, I tried creating an aggregate, thus:
>
> (taken from stackoverflow)
>
> CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
>
> changed the code to insert the chunks to a temporary table :
>
> create temporary table build_attachment (seq bigserial primary key,
> chunk bytea ) on commit drop;
>
> we then insert our 4K chunks to this, which takes very little time (20
> seconds for the 32,774 inserts)
>
> Here's an example though of trying to select the aggregate:
>
> gary=> \timing
> Timing is on.
> gary=> select bytea_agg(chunk order by seq) from build_attachment
> where seq < 4000 \g output
> Time: 13372.843 ms
> gary=> select bytea_agg(chunk order by seq) from build_attachment
> where seq < 8000 \g output
> Time: 54447.541 ms
> gary=> select bytea_agg(chunk order by seq) from build_attachment
> where seq < 16000 \g output
> Time: 582219.773 ms
>
> So those partial aggregates completed in somewhat acceptable times but ...
>
> gary=> select bytea_agg(chunk order by seq) from build_attachment
> where seq < 32000 \g output
> this one hadn't completed in an hour - the PostgreSQL connection
> process for my connection on the server goes to 100% CPU and stays
> there, not using much RAM, not doing much IO, oddly
>
> EXPLAINing these aggregate selects doesn't show anything useful.
>
> Am I doomed to not be able to update a bytea this way? Is there some
> way I can tune this?
>
>

Re: badly scaling performance with appending to bytea

2018-03-21 Thread Pavel Stehule

2018-03-21 13:03 GMT+01:00 Gary Cowell :

> We are trying to implement postgresql code to load a large object into
> a postgresql bytea in chunks to avoid loading the file into memory in
> the client.
>
> First attempt was to do
>
> update build_attachment set chunk = chunk || newdata ;
>
> this did not scale and got significantly slower after 4000-5000 updates.
>
> The chunks are 4K in size, and I'm testing with a 128MB input file,
> requiring 32,774 chunk updates.
>
> Next, I tried creating an aggregate, thus:
>
> (taken from stackoverflow)
>
> CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
>
> changed the code to insert the chunks to a temporary table :
>
> create temporary table build_attachment (seq bigserial primary key,
> chunk bytea ) on commit drop;
>
> we then insert our 4K chunks to this, which takes very little time (20
> seconds for the 32,774 inserts)
>
> Here's an example though of trying to select the aggregate:
>
> gary=> \timing
> Timing is on.
> gary=> select bytea_agg(chunk order by seq) from build_attachment
> where seq < 4000 \g output
> Time: 13372.843 ms
> gary=> select bytea_agg(chunk order by seq) from build_attachment
> where seq < 8000 \g output
> Time: 54447.541 ms
> gary=> select bytea_agg(chunk order by seq) from build_attachment
> where seq < 16000 \g output
> Time: 582219.773 ms
>
> So those partial aggregates completed in somewhat acceptable times but ...
>
> gary=> select bytea_agg(chunk order by seq) from build_attachment
> where seq < 32000 \g output
> this one hadn't completed in an hour - the PostgreSQL connection
> process for my connection on the server goes to 100% CPU and stays
> there, not using much RAM, not doing much IO, oddly
>
> EXPLAINing these aggregate selects doesn't show anything useful.
>
> Am I doomed to not be able to update a bytea this way? Is there some
> way I can tune this?
>
>
bytea is immutable object without preallocation - so update of big tasks is
very expensive.

I am thinking so using LO API and then transformation to bytea will be much
more effective

\lo_import path

you can use

 CREATE OR REPLACE FUNCTION attachment_to_bytea(attachment oid)
 RETURNS bytea AS $$
 DECLARE
  fdinteger;
  size  integer;
 BEGIN
  fd   := lo_open(attachment, 262144);
  size := lo_lseek(fd, 0, 2);
  PERFORM lo_lseek(fd, 0, 0);
  RETURN loread(fd, size);
 EXCEPTION WHEN undefined_object THEN
   PERFORM lo_close(fd);
   RETURN NULL;
 END;
 $$ LANGUAGE plpgsql STRICT SECURITY DEFINER SET search_path = 'pg_catalog';

function

import cca 44MB was in few seconds

Regards

Pavel

Re: badly scaling performance with appending to bytea

2018-03-21 Thread Pavel Stehule

2018-03-21 13:56 GMT+01:00 Pavel Stehule :

>
>
> 2018-03-21 13:03 GMT+01:00 Gary Cowell :
>
>> We are trying to implement postgresql code to load a large object into
>> a postgresql bytea in chunks to avoid loading the file into memory in
>> the client.
>>
>> First attempt was to do
>>
>> update build_attachment set chunk = chunk || newdata ;
>>
>> this did not scale and got significantly slower after 4000-5000 updates.
>>
>> The chunks are 4K in size, and I'm testing with a 128MB input file,
>> requiring 32,774 chunk updates.
>>
>> Next, I tried creating an aggregate, thus:
>>
>> (taken from stackoverflow)
>>
>> CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
>>
>> changed the code to insert the chunks to a temporary table :
>>
>> create temporary table build_attachment (seq bigserial primary key,
>> chunk bytea ) on commit drop;
>>
>> we then insert our 4K chunks to this, which takes very little time (20
>> seconds for the 32,774 inserts)
>>
>> Here's an example though of trying to select the aggregate:
>>
>> gary=> \timing
>> Timing is on.
>> gary=> select bytea_agg(chunk order by seq) from build_attachment
>> where seq < 4000 \g output
>> Time: 13372.843 ms
>> gary=> select bytea_agg(chunk order by seq) from build_attachment
>> where seq < 8000 \g output
>> Time: 54447.541 ms
>> gary=> select bytea_agg(chunk order by seq) from build_attachment
>> where seq < 16000 \g output
>> Time: 582219.773 ms
>>
>> So those partial aggregates completed in somewhat acceptable times but ...
>>
>> gary=> select bytea_agg(chunk order by seq) from build_attachment
>> where seq < 32000 \g output
>> this one hadn't completed in an hour - the PostgreSQL connection
>> process for my connection on the server goes to 100% CPU and stays
>> there, not using much RAM, not doing much IO, oddly
>>
>> EXPLAINing these aggregate selects doesn't show anything useful.
>>
>> Am I doomed to not be able to update a bytea this way? Is there some
>> way I can tune this?
>>
>>
> bytea is immutable object without preallocation - so update of big tasks
> is very expensive.
>
> I am thinking so using LO API and then transformation to bytea will be
> much more effective
>
> \lo_import path
>
> you can use
>
>  CREATE OR REPLACE FUNCTION attachment_to_bytea(attachment oid)
>  RETURNS bytea AS $$
>  DECLARE
>   fdinteger;
>   size  integer;
>  BEGIN
>   fd   := lo_open(attachment, 262144);
>   size := lo_lseek(fd, 0, 2);
>   PERFORM lo_lseek(fd, 0, 0);
>   RETURN loread(fd, size);
>  EXCEPTION WHEN undefined_object THEN
>PERFORM lo_close(fd);
>RETURN NULL;
>  END;
>  $$ LANGUAGE plpgsql STRICT SECURITY DEFINER SET search_path =
> 'pg_catalog';
>
> function
>
> import cca 44MB was in few seconds
>

there is native function lo_get

 https://www.postgresql.org/docs/current/static/lo-funcs.html



> Regards
>
> Pavel
>
>

Re: badly scaling performance with appending to bytea

2018-03-21 Thread Gary Cowell

Thank you Pavel for those ideas.

I should probably have mentioned we don't have access to the file
system on the PostgreSQL server, as it's provided by Amazon AWS RDS
service.

These functions look good when you can push the file to be loaded into
the database file system.

I'll see if it's possible to do this on AWS PostgreSQL RDS service but
this sort of thing is usually not

On 21 March 2018 at 12:59, Pavel Stehule  wrote:
>
>
> 2018-03-21 13:56 GMT+01:00 Pavel Stehule :
>>
>>
>>
>> 2018-03-21 13:03 GMT+01:00 Gary Cowell :
>>>
>>> We are trying to implement postgresql code to load a large object into
>>> a postgresql bytea in chunks to avoid loading the file into memory in
>>> the client.
>>>
>>> First attempt was to do
>>>
>>> update build_attachment set chunk = chunk || newdata ;
>>>
>>> this did not scale and got significantly slower after 4000-5000 updates.
>>>
>>> The chunks are 4K in size, and I'm testing with a 128MB input file,
>>> requiring 32,774 chunk updates.
>>>
>>> Next, I tried creating an aggregate, thus:
>>>
>>> (taken from stackoverflow)
>>>
>>> CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
>>>
>>> changed the code to insert the chunks to a temporary table :
>>>
>>> create temporary table build_attachment (seq bigserial primary key,
>>> chunk bytea ) on commit drop;
>>>
>>> we then insert our 4K chunks to this, which takes very little time (20
>>> seconds for the 32,774 inserts)
>>>
>>> Here's an example though of trying to select the aggregate:
>>>
>>> gary=> \timing
>>> Timing is on.
>>> gary=> select bytea_agg(chunk order by seq) from build_attachment
>>> where seq < 4000 \g output
>>> Time: 13372.843 ms
>>> gary=> select bytea_agg(chunk order by seq) from build_attachment
>>> where seq < 8000 \g output
>>> Time: 54447.541 ms
>>> gary=> select bytea_agg(chunk order by seq) from build_attachment
>>> where seq < 16000 \g output
>>> Time: 582219.773 ms
>>>
>>> So those partial aggregates completed in somewhat acceptable times but
>>> ...
>>>
>>> gary=> select bytea_agg(chunk order by seq) from build_attachment
>>> where seq < 32000 \g output
>>> this one hadn't completed in an hour - the PostgreSQL connection
>>> process for my connection on the server goes to 100% CPU and stays
>>> there, not using much RAM, not doing much IO, oddly
>>>
>>> EXPLAINing these aggregate selects doesn't show anything useful.
>>>
>>> Am I doomed to not be able to update a bytea this way? Is there some
>>> way I can tune this?
>>>
>>
>> bytea is immutable object without preallocation - so update of big tasks
>> is very expensive.
>>
>> I am thinking so using LO API and then transformation to bytea will be
>> much more effective
>>
>> \lo_import path
>>
>> you can use
>>
>>  CREATE OR REPLACE FUNCTION attachment_to_bytea(attachment oid)
>>  RETURNS bytea AS $$
>>  DECLARE
>>   fdinteger;
>>   size  integer;
>>  BEGIN
>>   fd   := lo_open(attachment, 262144);
>>   size := lo_lseek(fd, 0, 2);
>>   PERFORM lo_lseek(fd, 0, 0);
>>   RETURN loread(fd, size);
>>  EXCEPTION WHEN undefined_object THEN
>>PERFORM lo_close(fd);
>>RETURN NULL;
>>  END;
>>  $$ LANGUAGE plpgsql STRICT SECURITY DEFINER SET search_path =
>> 'pg_catalog';
>>
>> function
>>
>> import cca 44MB was in few seconds
>
>
> there is native function lo_get
>
>  https://www.postgresql.org/docs/current/static/lo-funcs.html
>
>
>>
>> Regards
>>
>> Pavel
>>
>

Re: badly scaling performance with appending to bytea

2018-03-21 Thread Pavel Stehule

2018-03-21 14:04 GMT+01:00 Gary Cowell :

> Thank you Pavel for those ideas.
>
> I should probably have mentioned we don't have access to the file
> system on the PostgreSQL server, as it's provided by Amazon AWS RDS
> service.
>
> These functions look good when you can push the file to be loaded into
> the database file system.
>
> I'll see if it's possible to do this on AWS PostgreSQL RDS service but
> this sort of thing is usually not
>

lo API doesn't need file access

 https://www.postgresql.org/docs/9.2/static/lo-interfaces.html

you can use lo_write function



> On 21 March 2018 at 12:59, Pavel Stehule  wrote:
> >
> >
> > 2018-03-21 13:56 GMT+01:00 Pavel Stehule :
> >>
> >>
> >>
> >> 2018-03-21 13:03 GMT+01:00 Gary Cowell :
> >>>
> >>> We are trying to implement postgresql code to load a large object into
> >>> a postgresql bytea in chunks to avoid loading the file into memory in
> >>> the client.
> >>>
> >>> First attempt was to do
> >>>
> >>> update build_attachment set chunk = chunk || newdata ;
> >>>
> >>> this did not scale and got significantly slower after 4000-5000
> updates.
> >>>
> >>> The chunks are 4K in size, and I'm testing with a 128MB input file,
> >>> requiring 32,774 chunk updates.
> >>>
> >>> Next, I tried creating an aggregate, thus:
> >>>
> >>> (taken from stackoverflow)
> >>>
> >>> CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
> >>>
> >>> changed the code to insert the chunks to a temporary table :
> >>>
> >>> create temporary table build_attachment (seq bigserial primary key,
> >>> chunk bytea ) on commit drop;
> >>>
> >>> we then insert our 4K chunks to this, which takes very little time (20
> >>> seconds for the 32,774 inserts)
> >>>
> >>> Here's an example though of trying to select the aggregate:
> >>>
> >>> gary=> \timing
> >>> Timing is on.
> >>> gary=> select bytea_agg(chunk order by seq) from build_attachment
> >>> where seq < 4000 \g output
> >>> Time: 13372.843 ms
> >>> gary=> select bytea_agg(chunk order by seq) from build_attachment
> >>> where seq < 8000 \g output
> >>> Time: 54447.541 ms
> >>> gary=> select bytea_agg(chunk order by seq) from build_attachment
> >>> where seq < 16000 \g output
> >>> Time: 582219.773 ms
> >>>
> >>> So those partial aggregates completed in somewhat acceptable times but
> >>> ...
> >>>
> >>> gary=> select bytea_agg(chunk order by seq) from build_attachment
> >>> where seq < 32000 \g output
> >>> this one hadn't completed in an hour - the PostgreSQL connection
> >>> process for my connection on the server goes to 100% CPU and stays
> >>> there, not using much RAM, not doing much IO, oddly
> >>>
> >>> EXPLAINing these aggregate selects doesn't show anything useful.
> >>>
> >>> Am I doomed to not be able to update a bytea this way? Is there some
> >>> way I can tune this?
> >>>
> >>
> >> bytea is immutable object without preallocation - so update of big tasks
> >> is very expensive.
> >>
> >> I am thinking so using LO API and then transformation to bytea will be
> >> much more effective
> >>
> >> \lo_import path
> >>
> >> you can use
> >>
> >>  CREATE OR REPLACE FUNCTION attachment_to_bytea(attachment oid)
> >>  RETURNS bytea AS $$
> >>  DECLARE
> >>   fdinteger;
> >>   size  integer;
> >>  BEGIN
> >>   fd   := lo_open(attachment, 262144);
> >>   size := lo_lseek(fd, 0, 2);
> >>   PERFORM lo_lseek(fd, 0, 0);
> >>   RETURN loread(fd, size);
> >>  EXCEPTION WHEN undefined_object THEN
> >>PERFORM lo_close(fd);
> >>RETURN NULL;
> >>  END;
> >>  $$ LANGUAGE plpgsql STRICT SECURITY DEFINER SET search_path =
> >> 'pg_catalog';
> >>
> >> function
> >>
> >> import cca 44MB was in few seconds
> >
> >
> > there is native function lo_get
> >
> >  https://www.postgresql.org/docs/current/static/lo-funcs.html
> >
> >
> >>
> >> Regards
> >>
> >> Pavel
> >>
> >
>
>

badly scaling performance with appending to bytea

Re: badly scaling performance with appending to bytea

Re: badly scaling performance with appending to bytea

Re: badly scaling performance with appending to bytea

Re: badly scaling performance with appending to bytea

Re: badly scaling performance with appending to bytea

6 matches

Site Navigation

Mail list logo

Footer information