query patterns for multipass aggregating

2018-11-11 Thread Rob Nikander
Hi,

I’ve run into this pattern a few times, and I usually get a little confused. 
I’m wondering if there are some common solutions or techniques.

Simplified example:

I have tables `items`, `colors`, and `images`. Items have many colors, and many 
images.

I want a query to list items, with their colors and images. Each result row is 
an item, so the colors and images must be aggregated into arrays or json. 

If there were only one other table, it’s a simple join and group…

select items.*, array_agg(color_name)
from items join colors on items.id = colors.item_id
group by items.id

Now to get images too, my first try uses a CTE… 

// Tables:
// items (id, name, foo)
// colors (item_id, color_name, color_foo)
// images (item_id, image_file, image_foo)

with items_with_colors as (
  // This is the same query as above 
  select items.*, array_agg(colors.color_name) as color_names
  from items
  join colors on items.id = colors.item_id
  group by items.id
)
// Same idea repeated but now joining to images table
select items.*, array_agg(images.image_file) as image_files
from items_with_colors items
join images on items.id = images.item_id
group by items.id, items.name, items.foo, items.color_names;  // mmm :(

One first problem is on the last line. I have to list out all the columns. In 
my real situation there are many more. Listing them is only an inconvenience, 
but I’m more worried that it is internally comparing all the columns, when 
really it could compare only `items.id`.

So… are there some other techniques that usually (or might) work better? It 
seems like the kind of thing that could run in parallel, if I wrote it right. 

Rob




Re: query patterns for multipass aggregating

2018-11-11 Thread Rob Nikander



> On Nov 11, 2018, at 12:54 PM, Andrew Gierth  
> wrote:
> …

Thank you that is very helpful.

Could the CTE’s theoretically be optimized in a future version of PG, to work 
like the subqueries? I like to use them to give names to intermediate results, 
but I’ll stay away from them for now.

Rob


Re: help with aggregation query across a second text array column

2018-11-12 Thread Rob Nikander



> On Nov 12, 2018, at 9:40 AM, Scot Kreienkamp  
> wrote:
> …

I’m not too confident in my answer here (there could be a better way), but this 
might help. You could use the `unnest` function to transform the array into 
multiple rows. For example, given a table like 

create table t1 (env text, cls text, cls2 text[]);

I can query it like:

select env, string_agg(cls, ‘,’)
  from (select env, cls from t1
   union
   select env, unnest(cls2) from t1) t
group by env;

Rob


querying both text and non-text properties

2018-12-04 Thread Rob Nikander
Hi, 

I’ve got an application where I’d like to search a collection of objects based 
on various properties, some text and others non-text (bools, enums, ints, etc). 
I’ve used full text search before, following the PG docs to set up a index on a 
ts_vector. And of course I’ve used normal indexes before for accelerating basic 
queries that use non-text columns.

Any recommendations on combining the two? For example, if I search for objects 
where the description text matches [some ts_query] and the color = red, I can 
imagine putting the color property into the text index somehow - maybe with 
tokens like: ’color_red’, ‘color_blue’, but for something like an integer … ?

Rob


Re: querying both text and non-text properties

2018-12-05 Thread Rob Nikander



> On Dec 4, 2018, at 4:59 PM, Laurenz Albe  wrote:
> 
> You have two options:
> 
> A combined index:
> 
>  CREATE EXTENSION btree_gin;
>  CREATE INDEX ON fulltext USING gin (to_tsvector('english', doc), color);
> 
> That is the perfect match for a query with
> 
>  WHERE color = 'red' AND to_tsvector('german', doc) @@ to_tsquery('english', 
> 'word');
> 
> But you can also create two indexes:
> 
>  CREATE INDEX ON fulltext USING gin (to_tsvector('english', doc));
>  CREATE INDEX ON fulltext (color);
> 
> Then you don't need the extension, and PostgreSQL can still use them for the 
> search,
> either only one of them if the condition is selective enough, or a 
> "BitmapAnd" of both.

Thanks! I will try both these methods and compare the performance.

Rob


procedures and transactions

2019-02-19 Thread Rob Nikander
Hi,

I’m trying to understand how procedures work with transactions. I tried the 
code below - it’s a simple procedure to print some notices and commit a 
transaction. If I call it from psql after a `begin`, then it gives an error. 
What does that error mean? Are procedures not allowed to commit/rollback if 
they are called within in an outer transaction?

Also, I tried putting a `start transaction` command in the procedure. I got 
another error: `unsupported transaction command in PL/pgSQL`. Are procedures 
not allowed to start transactions? Or is there another command?

thanks,
Rob

create or replace procedure t_test(n integer)
as $$
begin
raise notice 'current isolation level: %', (select 
current_setting('transaction_isolation'));
raise notice 'current txid: %', (select txid_current());
raise notice '---';
commit;
raise notice 'current isolation level: %', (select 
current_setting('transaction_isolation'));
raise notice 'current txid: %', (select txid_current());
end;
$$ language plpgsql;

psql> begin;
psql> call t_test(1);

NOTICE:  current isolation level: read committed
NOTICE:  current txid: 111490
NOTICE:  ---
ERROR:  invalid transaction termination






Re: procedures and transactions

2019-02-19 Thread Rob Nikander
I thought if I had some application logic that needed a certain kind of 
transaction (maybe a non-default isolation level), I could hide that fact in a 
procedure. App code (Java/Python/whatever) could remain unaware of transactions 
(except maybe needing to retry after a failure) and simply send `call foo(?, 
?)` to the DB. But maybe that kind of design is not supported, and application 
code needs to start transactions and set isolation levels. Is that accurate?  I 
supposed a procedure could throw an exception if it doesn’t like the value in 
`current_setting('transaction_isolation’)`.

Rob

> On Feb 19, 2019, at 2:38 PM, David G. Johnston  
> wrote:
> 
> On Tuesday, February 19, 2019, Rob Nikander  <mailto:rob.nikan...@gmail.com>> wrote:
>  Are procedures not allowed to commit/rollback if they are called within in 
> an outer transaction?
> 
>  https://www.postgresql.org/docs/11/sql-call.html 
> <https://www.postgresql.org/docs/11/sql-call.html>
> 
> Also, I tried putting a `start transaction` command in the procedure. I got 
> another error: `unsupported transaction command in PL/pgSQL`. Are procedures 
> not allowed to start transactions? Or is there another command?
> 
>  https://www.postgresql.org/docs/11/plpgsql-transactions.html 
> <https://www.postgresql.org/docs/11/plpgsql-transactions.html>
> 
> David J.
> 



Re: procedures and transactions

2019-02-20 Thread Rob Nikander



> On Feb 20, 2019, at 10:07 AM, Peter Eisentraut 
>  wrote:
> 
> You can run SET TRANSACTION ISOLATION LEVEL in a procedure.

I tried that before but I get this error:

create or replace procedure t_test(n integer)
as $$
begin
set transaction isolation level serializable;
raise notice 'current isolation level: %', (select 
current_setting('transaction_isolation'));
raise notice 'current txid: %', (select txid_current());
end;
$$ language plpgsql;

mydb=# call t_test(1);
ERROR:  SET TRANSACTION ISOLATION LEVEL must be called before any query
CONTEXT:  SQL statement "SET transaction isolation level serializable"


arrays of composite types, and client drivers like JDBC

2019-06-14 Thread Rob Nikander
Hi,

I'm experimenting with Java client libraries (the usual JDBC and some other
async projects, eg [1]). So far, I'm not finding ways to select/read
composite types without ugly string parsing. The simple cases are okay, but
if I have a column that is an array of composites, the client library might
give me a string for a column value like the following, with no builtin way
to parse it.

{"(10,\"abc \"\" \"\" \"\"\",\"2019-06-14
18:16:48.067969\",t)","(11,foo,\"2019-06-14 18:16:48.067969\",f)"}

Maybe I'm missing the part of the JDBC API that I can use here.

If not, then I'm wondering: is there something inherent in the underlying
PG protocol that makes this difficult for all these client/driver
libraries? In other words, maybe the protocol is sending strings meant for
display, not for parsing as data?

I was hoping I'd find an API like...

create type foo as (age int, color text);
create table t (a foo, b foo[]);

var resultSet = conn.prepareStatement("select a,b from
t").executeQuery()
var foos = resultSet.getArray("b")
var foo = foos.getElement(0)
var age = foo.getInt(1)
var color = foo.getString(2)

thanks,
Rob

[1]: https://github.com/cretz/pgnio


Re: arrays of composite types, and client drivers like JDBC

2019-06-16 Thread Rob Nikander



> On Jun 15, 2019, at 1:47 PM, Dave Cramer  wrote:
> 
> Basically because java would have to create a type dynamically to parse the 
> data into.
> There's nothing inherently difficult about parsing the data, the problem is 
> what do we put it into ?

(I accidentally replied off-list, so resending this.)

It could parse it to strings and convert only when I call `getInt`, 
`getTimestamp`, etc. But, like you said, I see that the parsing isn't that 
difficult, so I wrote something that seems to work. Probably that code should 
be in the library, not my application. If I test this out some more and it 
still seems to work, I’ll ask again about contributing to 
https://github.com/pgjdbc/pgjdbc.

Rob



Async client libraries - not worth it?

2019-06-16 Thread Rob Nikander
Hi,

I’m writing a new web app, and I’ve been experimenting with some async DB 
access libraries [1]. I also see some discussion online about a future Java 
standard to replace or supplement JDBC with an async API.

While I understand the benefits of async in some situations, it seems to me 
that these libraries are not going to give much performance benefit, given the 
architecture of a PostgreSQL server. (Nothing against PG; probably most RDBMSs 
are like this.)

I wonder if anyone else has looked at this and agrees, or not. ?

A client library with an async-style API may allow 100,000s of concurrent 
“operations”, but since the PG server itself doesn’t handle connections on that 
scale (and has no plans to, I assume?), the client library is really 
maintaining a queue of operations waiting for a connection pool. Maybe there is 
some performance benefit there, but the most important point - to free up the 
front end to handle many HTTP connections - can also happen by combining an 
operation queue with a synchronous API. 

Rob


[1]: Mentioned here: https://github.com/pgjdbc/pgadba/issues/17

Re: Async client libraries - not worth it?

2019-06-17 Thread Rob Nikander


> On Jun 17, 2019, at 1:12 PM, Dave Cramer  wrote:
> 
> https://www.techempower.com/benchmarks/#section=data-r17&hw=ph&test=db 
> 
> 
> Seems to be worth it.
> 
> Now it appears that ADBA is going to die on the vine, R2DBC and vertx seem to 
> be pretty good

The “async” frameworks are faster, but I think they might be getting the 
performance gain not from the async DB API, but from the fact that they don’t 
block OS threads that are handling frontend HTTP requests. They may be using an 
async DB API to achieve that, but they *could* (I think) also use traditional 
JDBC and other general purpose concurrency tools from Java’s standard library.  
That way would be easier to reason about, in my opinion.

I may just have to write something both ways and wait to get real world 
experience with it to see how it goes.






Re: Async client libraries - not worth it?

2019-06-17 Thread Rob Nikander



> On Jun 17, 2019, at 3:57 PM, Dave Cramer  wrote:
> […] Postgres can pipeline requests if the client is written correctly so it 
> is conceivable that this would be much faster.

Can the JDBC driver do this? I don’t see it documented anywhere.



Is array_append O(n)?

2019-06-18 Thread Rob Nikander
Hi,

Does `array_append(arr, elt)` create a new array and copy everything? In other 
words, is it O(n) or O(1)? I’m trying to use plpgsql and realizing I don’t 
really have my usual data structures for basic algorithms the way I’d normally 
write them. I probably shouldn’t be treating arrays like Java ArrayList or C++ 
vector, where you can add/remove on the end with little cost (O(1), almost.)

Rob



Re: Is array_append O(n)?

2019-06-18 Thread Rob Nikander



> On Jun 18, 2019, at 7:37 PM, Alvaro Herrera  wrote:
> 
> On 2019-Jun-18, Rob Nikander wrote:
> 
>> Does `array_append(arr, elt)` create a new array and copy everything?
>> In other words, is it O(n) or O(1)? […]
> 
> Starting with 9.5, there's an optimization that I think is supposed to
> make it O(1):

Sweet! Thanks. Now I’d like to know how to tell if that optimization is being 
used, or not. 

Same question for the a way to pop an element off the end in O(1) time.








delete inside for plpgsql loop on same relation?

2019-06-18 Thread Rob Nikander
Hi,

Are there guarantees about how this plpgsql behaves? It’s deleting from a table 
while it loops over it. So far it seems like the delete is logically after the 
select, as I hoped, and doesn’t interfere. 

for row in select * from some_stuff loop
delete from some_stuff where …
...
end loop;

I’m using a temporary table of “things to process” and looping over it, 
deleting the ones as I go.

I don’t see anything mentioned here: 
https://www.postgresql.org/docs/11/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING

Rob



reclaiming space from heavily used tables?

2017-12-18 Thread Rob Nikander
Hi,

I've got a large table from which I'd like to completely reclaim space. I read 
the docs and it sounds like I can’t run `vacuum full`, because this table is 
accessed constantly and can’t have downtime. Assuming that’s true, what do you 
think of the following idea? Is there a better alternative?

1. Replace the table T with two tables T1 and T2 and a view T that is `select * 
from T1 union T2`.
2. Write only to T1, and slowly move records from T2 to T1. 
3. When T2 is empty, redefine the view T to be simply `select * from T1`.
4. Recreate or vacuum full T2, so space is fully reclaimed.
5. Redefine view as the union select and repeat process going other direction 
from T1 to T2, as needed.

I guess one question is whether I can do 1 and 3 (renaming tables, redefining 
views) atomically without disturbing concurrent select queries.

Rob




Re: reclaiming space from heavily used tables?

2017-12-18 Thread Rob Nikander
I don’t see how I can use that extension. If some people are saying it’s 
dangerous and others are saying it’s fine, I don’t have the time to drill down 
into PG internals so that I can judge for myself. I’ll probably try the two 
table idea outlined in my original message.

> On Dec 18, 2017, at 5:24 PM, Jeremy Finzel  wrote:
> […]
>> On Dec 18, 2017, at 2:51 PM, Alvaro Herrera  wrote:
>> […]