Index/trigger implementation for accessing latest records

2018-05-02 Thread Alastair McKinley
Hi,


I have a table that stores a location identifier per person which will be 
appended to many times.

However, for many queries in this system we only need to know the most recent 
location per person, which is limited to about 1000 records.


Is the following trigger/index strategy a reasonable and safe approach to fast 
access to the latest location records per person?


  1.  A boolean column (latest_record default true) to identify the latest 
record per person
  2.  A before insert trigger that updates all other records for that person to 
latest_record = false
  3.  A partial index on the latest_record column where latest_record is true


Aside from performance, is it safe to update other records in the table from 
the insert trigger in this way?


Minimal example is shown below:


create table location_records

(

id bigserial,

person_id bigint,

location_id bigint,

latest_record boolean not null default true

);


create function latest_record_update() returns trigger as

$$

BEGIN

update location_records set latest_record = false where person_id = 
new.person_id and latest_record is true and id != new.id;

return new;

END;

$$ language plpgsql;


create trigger latest_record_trigger before insert on location_records

for each row execute procedure latest_record_update();


create index latest_record_index on location_records(latest_record) where 
latest_record is true;


insert into location_records(person_id,location_id) values (1,1);

insert into location_records(person_id,location_id) values (1,2);

insert into location_records(person_id,location_id) values (1,3);


insert into location_records(person_id,location_id) values (2,3);

insert into location_records(person_id,location_id) values (2,4);


select * from location_records;

Best regards,


Alastair


Re: Index/trigger implementation for accessing latest records

2018-05-02 Thread Gavin Flower

Hi Alastair,

See embedded comments.

On 02/05/18 21:51, Alastair McKinley wrote:


Hi,


I have a table that stores a location identifier per person which will 
be appended to many times.


However, for many queries in this system we only need to know the most 
recent location per person, which is limited to about 1000 records.



Is the following trigger/index strategy a reasonable and safe approach 
to fast access to the latest location records per person?



 1. A boolean column (latest_record default true) to identify the
latest record per person
 2. A before insert trigger that updates all other records for that
person to latest_record = false
 3. A partial index on the latest_record column where latest_record is
true

Suggest simplest and fastest is to use timestamptz, a timestamp with 
time zone (copes with changes of daylight saving and different 
timezones. The you have no need of triggers.


Then all you need to do, is search for the person-id with the maximum 
value of the timestampz!


[...]


Cheers,
Gavin



Re: Index/trigger implementation for accessing latest records

2018-05-02 Thread Alvaro Herrera
Alastair McKinley wrote:

> create function latest_record_update() returns trigger as
> $$
> BEGIN
> update location_records set latest_record = false where person_id = 
> new.person_id and latest_record is true and id != new.id;
> return new;
> END;
> $$ language plpgsql;
> 
> create trigger latest_record_trigger before insert on location_records
> for each row execute procedure latest_record_update();

For maximum safety, you should use BEFORE trigger only to modify the row
being inserted/updated (or to abort the operation); any change to other
rows should occur in an AFTER trigger.  One risk associated with failing
to do that is that some other BEFORE trigger further modifies the new
row, making your trigger-invoked UPDATE wrong.  Also, be sure to think
about possible infinite recursion.

Another angle to keep in mind is what happens with insertions of
historical records, i.e. those that are not latest (today you think "ah,
but that never happens" and three months from now this is requested as a
feature.  Time to rethink the whole development ...)  You'd clobber the
latest_record flag without a replacement for it, which is probably
undesirable.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN

2018-05-02 Thread Jim Michaels
what do you think about foreign data wrappers getting CSV file table I/O?
- I had thought that CSVQL db could be implemented completely with
small amount of memory and file I/O, line at a time. EOL detection
would be needed. can be: CR, LF, CR+LF. sometimes beginners get it
backwards (LF+CR), but it's stll detectable because it's an immediate
sequence detectable by state machine or while loop. it should be
written as CR+LF because of standards compliance X2J4579 - that means
go look for it (I have not found it yet).
while (!feof(...)&&(ch=='\r'||ch=='\n')) {
do {
if (1!=fread(&ch, 1, 1, ...)) {
ch=0;
break;
}
if (ch!='\r'&&ch!='\n') {
break;//process non-EOL character
}
//at this point, it's an EOL character
} while (true);
or
while (!feof(...)&&ch=='\t') {
if (1==fread(&ch, 1, 1, ...)) {//read 1 character
}
}
this code could be read in chunks using a buffer. the last chunk would
need to be handled as a special case, since if it exists, it exists as
2,000,000 rows, so don't limit it. try GCC's
fopen64/fsetpos,fgetpos/fpos_t/fclose, other vendors use plain fopen
and that works with >=64-bit file sizes. and it's fast. avoid 32-bit
fseek/ftell.
- indexing could be file/RAM array of uint64_t-like file pointers
convertable to fpos_t if needed.
- biggest needed feature is an easier-to-use ALTER TABLE RENAME. a
memorable alternative/alias would be simply RENAME COLUMN columnName
TO newColumnName.

-- 
==
Jim Michaels 



Re: CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN

2018-05-02 Thread David G. Johnston
On the whole this email is very confusing/hard-to-follow...

On Wed, May 2, 2018 at 2:29 PM, Jim Michaels  wrote:

> what do you think about foreign data wrappers getting CSV file table I/O?
> ​
>

 ​I don't understand the question...​
  ​

> I had thought that CSVQL db could be implemented completely with
> small amount of memory and file I/O, line at a time
>

​Do you no longer think that then?

I don't see PostgreSQL being all that open to implementing a second query
language beside SQL - and doing it functionally seems unlikely unless, like
with JSON, you are storing entire CSV files in a table field.​


> - the microsoft patented CSV would be required for implementation. it
> handles special data with commas and double-quotes in them
>

​If true this seems like a show-stopper to anything PostgreSQL would
implement

- biggest needed feature is an easier-to-use ALTER TABLE RENAME. a
> memorable alternative/alias would be simply RENAME COLUMN columnName
> TO newColumnName.
>

I don't see us adding new syntax for this...

David J.
​​


Re: CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN

2018-05-02 Thread Ron

On 05/02/2018 04:49 PM, David G. Johnston wrote:
[snip]


- the microsoft patented CSV would be required for implementation. it
handles special data with commas and double-quotes in them


​If true this seems like a show-stopper to anything PostgreSQL would implement


If MSFT really holds a patent on the CSV format, then Postgresql is already 
in a world of hurt.


--
Angular momentum makes the world go 'round.


Re: CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN

2018-05-02 Thread John McKown
On Wed, May 2, 2018 at 4:29 PM, Jim Michaels  wrote:

> what do you think about foreign data wrappers getting CSV file table I/O?
> - I had thought that CSVQL db could be implemented completely with
> ​
>

​I don't know what you want to do with this. SQLite already supports it.
SQLite is an embedded SQL database​ software. To a great extent, the SQL
that it understands is similar to what PostgreSQL understands. It is one of
the "standards" that the author uses. It implements "virtual tables" via
extensions. One of these is to process CSV files. ref:
http://sqlite.org/csv.html  The CSV it understands is RFC 4180 format (
https://www.ietf.org/rfc/rfc4180.txt). SQLite is open source and is _PUBLIC
DOMAIN_. That is, it has NO copyright at all. So you can do anything with
it that you want to. You might even be able to use the source to the SQLite
extension (
https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/misc/csv.c ) to
write the PostgreSQL foreign data wrapper.

So, if you just need something so that you can read a CSV using SQL, then
you might consider using SQLite instead of, or in addition to, PostgreSQL.
Or, if you need to do subselects, unions, or other things containing both
PostgreSQL data & CSV data, then I guess you're stuck with the foreign data
wrapper.



> ​
>
> --
> ==
> Jim Michaels 
>
>


-- 
We all have skeletons in our closet.
Mine are so old, they have osteoporosis.

Maranatha! <><
John McKown


Re: Index/trigger implementation for accessing latest records

2018-05-02 Thread Tim Cross

Alastair McKinley  writes:

> Hi,
>
>
> I have a table that stores a location identifier per person which will be 
> appended to many times.
>
> However, for many queries in this system we only need to know the most recent 
> location per person, which is limited to about 1000 records.
>
>
> Is the following trigger/index strategy a reasonable and safe approach to 
> fast access to the latest location records per person?
>
>
>   1.  A boolean column (latest_record default true) to identify the latest 
> record per person
>   2.  A before insert trigger that updates all other records for that person 
> to latest_record = false
>   3.  A partial index on the latest_record column where latest_record is true
>
>
> Aside from performance, is it safe to update other records in the table from 
> the insert trigger in this way?
>
>
> Minimal example is shown below:
>
>
> create table location_records
>
> (
>
> id bigserial,
>
> person_id bigint,
>
> location_id bigint,
>
> latest_record boolean not null default true
>
> );
>
>
> create function latest_record_update() returns trigger as
>
> $$
>
> BEGIN
>
> update location_records set latest_record = false where person_id = 
> new.person_id and latest_record is true and id != new.id;
>
> return new;
>
> END;
>
> $$ language plpgsql;
>
>
> create trigger latest_record_trigger before insert on location_records
>
> for each row execute procedure latest_record_update();
>
>
> create index latest_record_index on location_records(latest_record) where 
> latest_record is true;
>
>
> insert into location_records(person_id,location_id) values (1,1);
>
> insert into location_records(person_id,location_id) values (1,2);
>
> insert into location_records(person_id,location_id) values (1,3);
>
>
> insert into location_records(person_id,location_id) values (2,3);
>
> insert into location_records(person_id,location_id) values (2,4);
>
>
> select * from location_records;
>

My personal bias will come out here 

I don't think using a trigger is a good solution here. Although very
powerful, the problem with triggers is that they are a 'hidden' side
effect which is easily overlooked and often adds an additional
maintenance burden which could be avoided using alternative approaches.

Consider a few months down the road and your on holidays. One of your
colleagues is asked to add a new feature which involves inserting
records into this table. During testing, they observe an odd result - a
field changing which according to the SQL they wrote should not. The
simple new feature now takes twice as long to develop as your colleague
works out there is a trigger on the table. Worse yet, they don't notice
and put there changes into production and then issue start getting
raised about communications going to the wrong location for customers
etc.

Triggers often become a lot more complicated than they will initially
appear. In your example, what happens for updates as opposed to inserts?
What happens if the 'new' location is actually the same as a previously
recorded location etc. 

In your case, I would try to make what your doing more explicit and
avoid the trigger. There are a number of ways to do this such as

- A function to insert the record. The function could check to see if
  that customer has any previous records and if so, set the boolean flag
  to false for all existing records and true for the new one. You might
  even want to break it up into two functions so that you have one which
  just sets the flag based on a unique key parameter - this would
  provide a way of resetting the current location without having to do
  an insert. 

- Use a timestamp instead of a boolean and change your logic to select
  the current location by selecting the record with the latest
  timestamp.

- Keep the two actions as separate SQL - one to insert a record and one
  to set the current location. This has the advantage of making actions
  clear and easier to maintain and can be useful in domains where people
  move between locations (for example, I've done this for a University
  where the data represented the students current address, which would
  change in and out of semester periods, but often cycle between two
  addresses, their college and their parental home). The downside of
  this approach is that applications which insert this information must
  remember to execute both SQL statements. If you have multiple
  interfaces, this might become a maintenance burden (one of the
  advantages of using a DB function). 


Tim


--
Tim Cross



Re: CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN

2018-05-02 Thread Adrian Klaver

On 05/02/2018 02:29 PM, Jim Michaels wrote:



- the microsoft patented CSV would be required for implementation. it
handles special data with commas and double-quotes in them


Huh?:
https://en.wikipedia.org/wiki/Comma-separated_values#History


--
Adrian Klaver
adrian.kla...@aklaver.com



Re: CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN

2018-05-02 Thread raf
Ron wrote:

> On 05/02/2018 04:49 PM, David G. Johnston wrote:
> [snip]
> > 
> > - the microsoft patented CSV would be required for implementation. it
> > handles special data with commas and double-quotes in them
> > 
> > 
> > If true this seems like a show-stopper to anything PostgreSQL would 
> > implement
> 
> If MSFT really holds a patent on the CSV format, then Postgresql is already
> in a world of hurt.

Even if the CSV format was patented, don't patents only last 17 years?
Has anyone found the patent? When was it granted? I would guess in the 1980s.
And can anyone remember MS ever demanding a fee for using it?