Index/trigger implementation for accessing latest records
Hi, I have a table that stores a location identifier per person which will be appended to many times. However, for many queries in this system we only need to know the most recent location per person, which is limited to about 1000 records. Is the following trigger/index strategy a reasonable and safe approach to fast access to the latest location records per person? 1. A boolean column (latest_record default true) to identify the latest record per person 2. A before insert trigger that updates all other records for that person to latest_record = false 3. A partial index on the latest_record column where latest_record is true Aside from performance, is it safe to update other records in the table from the insert trigger in this way? Minimal example is shown below: create table location_records ( id bigserial, person_id bigint, location_id bigint, latest_record boolean not null default true ); create function latest_record_update() returns trigger as $$ BEGIN update location_records set latest_record = false where person_id = new.person_id and latest_record is true and id != new.id; return new; END; $$ language plpgsql; create trigger latest_record_trigger before insert on location_records for each row execute procedure latest_record_update(); create index latest_record_index on location_records(latest_record) where latest_record is true; insert into location_records(person_id,location_id) values (1,1); insert into location_records(person_id,location_id) values (1,2); insert into location_records(person_id,location_id) values (1,3); insert into location_records(person_id,location_id) values (2,3); insert into location_records(person_id,location_id) values (2,4); select * from location_records; Best regards, Alastair
Re: Index/trigger implementation for accessing latest records
Hi Alastair, See embedded comments. On 02/05/18 21:51, Alastair McKinley wrote: Hi, I have a table that stores a location identifier per person which will be appended to many times. However, for many queries in this system we only need to know the most recent location per person, which is limited to about 1000 records. Is the following trigger/index strategy a reasonable and safe approach to fast access to the latest location records per person? 1. A boolean column (latest_record default true) to identify the latest record per person 2. A before insert trigger that updates all other records for that person to latest_record = false 3. A partial index on the latest_record column where latest_record is true Suggest simplest and fastest is to use timestamptz, a timestamp with time zone (copes with changes of daylight saving and different timezones. The you have no need of triggers. Then all you need to do, is search for the person-id with the maximum value of the timestampz! [...] Cheers, Gavin
Re: Index/trigger implementation for accessing latest records
Alastair McKinley wrote: > create function latest_record_update() returns trigger as > $$ > BEGIN > update location_records set latest_record = false where person_id = > new.person_id and latest_record is true and id != new.id; > return new; > END; > $$ language plpgsql; > > create trigger latest_record_trigger before insert on location_records > for each row execute procedure latest_record_update(); For maximum safety, you should use BEFORE trigger only to modify the row being inserted/updated (or to abort the operation); any change to other rows should occur in an AFTER trigger. One risk associated with failing to do that is that some other BEFORE trigger further modifies the new row, making your trigger-invoked UPDATE wrong. Also, be sure to think about possible infinite recursion. Another angle to keep in mind is what happens with insertions of historical records, i.e. those that are not latest (today you think "ah, but that never happens" and three months from now this is requested as a feature. Time to rethink the whole development ...) You'd clobber the latest_record flag without a replacement for it, which is probably undesirable. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN
what do you think about foreign data wrappers getting CSV file table I/O? - I had thought that CSVQL db could be implemented completely with small amount of memory and file I/O, line at a time. EOL detection would be needed. can be: CR, LF, CR+LF. sometimes beginners get it backwards (LF+CR), but it's stll detectable because it's an immediate sequence detectable by state machine or while loop. it should be written as CR+LF because of standards compliance X2J4579 - that means go look for it (I have not found it yet). while (!feof(...)&&(ch=='\r'||ch=='\n')) { do { if (1!=fread(&ch, 1, 1, ...)) { ch=0; break; } if (ch!='\r'&&ch!='\n') { break;//process non-EOL character } //at this point, it's an EOL character } while (true); or while (!feof(...)&&ch=='\t') { if (1==fread(&ch, 1, 1, ...)) {//read 1 character } } this code could be read in chunks using a buffer. the last chunk would need to be handled as a special case, since if it exists, it exists as 2,000,000 rows, so don't limit it. try GCC's fopen64/fsetpos,fgetpos/fpos_t/fclose, other vendors use plain fopen and that works with >=64-bit file sizes. and it's fast. avoid 32-bit fseek/ftell. - indexing could be file/RAM array of uint64_t-like file pointers convertable to fpos_t if needed. - biggest needed feature is an easier-to-use ALTER TABLE RENAME. a memorable alternative/alias would be simply RENAME COLUMN columnName TO newColumnName. -- == Jim Michaels
Re: CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN
On the whole this email is very confusing/hard-to-follow... On Wed, May 2, 2018 at 2:29 PM, Jim Michaels wrote: > what do you think about foreign data wrappers getting CSV file table I/O? > > I don't understand the question... > I had thought that CSVQL db could be implemented completely with > small amount of memory and file I/O, line at a time > Do you no longer think that then? I don't see PostgreSQL being all that open to implementing a second query language beside SQL - and doing it functionally seems unlikely unless, like with JSON, you are storing entire CSV files in a table field. > - the microsoft patented CSV would be required for implementation. it > handles special data with commas and double-quotes in them > If true this seems like a show-stopper to anything PostgreSQL would implement - biggest needed feature is an easier-to-use ALTER TABLE RENAME. a > memorable alternative/alias would be simply RENAME COLUMN columnName > TO newColumnName. > I don't see us adding new syntax for this... David J.
Re: CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN
On 05/02/2018 04:49 PM, David G. Johnston wrote: [snip] - the microsoft patented CSV would be required for implementation. it handles special data with commas and double-quotes in them If true this seems like a show-stopper to anything PostgreSQL would implement If MSFT really holds a patent on the CSV format, then Postgresql is already in a world of hurt. -- Angular momentum makes the world go 'round.
Re: CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN
On Wed, May 2, 2018 at 4:29 PM, Jim Michaels wrote: > what do you think about foreign data wrappers getting CSV file table I/O? > - I had thought that CSVQL db could be implemented completely with > > I don't know what you want to do with this. SQLite already supports it. SQLite is an embedded SQL database software. To a great extent, the SQL that it understands is similar to what PostgreSQL understands. It is one of the "standards" that the author uses. It implements "virtual tables" via extensions. One of these is to process CSV files. ref: http://sqlite.org/csv.html The CSV it understands is RFC 4180 format ( https://www.ietf.org/rfc/rfc4180.txt). SQLite is open source and is _PUBLIC DOMAIN_. That is, it has NO copyright at all. So you can do anything with it that you want to. You might even be able to use the source to the SQLite extension ( https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/misc/csv.c ) to write the PostgreSQL foreign data wrapper. So, if you just need something so that you can read a CSV using SQL, then you might consider using SQLite instead of, or in addition to, PostgreSQL. Or, if you need to do subselects, unions, or other things containing both PostgreSQL data & CSV data, then I guess you're stuck with the foreign data wrapper. > > > -- > == > Jim Michaels > > -- We all have skeletons in our closet. Mine are so old, they have osteoporosis. Maranatha! <>< John McKown
Re: Index/trigger implementation for accessing latest records
Alastair McKinley writes: > Hi, > > > I have a table that stores a location identifier per person which will be > appended to many times. > > However, for many queries in this system we only need to know the most recent > location per person, which is limited to about 1000 records. > > > Is the following trigger/index strategy a reasonable and safe approach to > fast access to the latest location records per person? > > > 1. A boolean column (latest_record default true) to identify the latest > record per person > 2. A before insert trigger that updates all other records for that person > to latest_record = false > 3. A partial index on the latest_record column where latest_record is true > > > Aside from performance, is it safe to update other records in the table from > the insert trigger in this way? > > > Minimal example is shown below: > > > create table location_records > > ( > > id bigserial, > > person_id bigint, > > location_id bigint, > > latest_record boolean not null default true > > ); > > > create function latest_record_update() returns trigger as > > $$ > > BEGIN > > update location_records set latest_record = false where person_id = > new.person_id and latest_record is true and id != new.id; > > return new; > > END; > > $$ language plpgsql; > > > create trigger latest_record_trigger before insert on location_records > > for each row execute procedure latest_record_update(); > > > create index latest_record_index on location_records(latest_record) where > latest_record is true; > > > insert into location_records(person_id,location_id) values (1,1); > > insert into location_records(person_id,location_id) values (1,2); > > insert into location_records(person_id,location_id) values (1,3); > > > insert into location_records(person_id,location_id) values (2,3); > > insert into location_records(person_id,location_id) values (2,4); > > > select * from location_records; > My personal bias will come out here I don't think using a trigger is a good solution here. Although very powerful, the problem with triggers is that they are a 'hidden' side effect which is easily overlooked and often adds an additional maintenance burden which could be avoided using alternative approaches. Consider a few months down the road and your on holidays. One of your colleagues is asked to add a new feature which involves inserting records into this table. During testing, they observe an odd result - a field changing which according to the SQL they wrote should not. The simple new feature now takes twice as long to develop as your colleague works out there is a trigger on the table. Worse yet, they don't notice and put there changes into production and then issue start getting raised about communications going to the wrong location for customers etc. Triggers often become a lot more complicated than they will initially appear. In your example, what happens for updates as opposed to inserts? What happens if the 'new' location is actually the same as a previously recorded location etc. In your case, I would try to make what your doing more explicit and avoid the trigger. There are a number of ways to do this such as - A function to insert the record. The function could check to see if that customer has any previous records and if so, set the boolean flag to false for all existing records and true for the new one. You might even want to break it up into two functions so that you have one which just sets the flag based on a unique key parameter - this would provide a way of resetting the current location without having to do an insert. - Use a timestamp instead of a boolean and change your logic to select the current location by selecting the record with the latest timestamp. - Keep the two actions as separate SQL - one to insert a record and one to set the current location. This has the advantage of making actions clear and easier to maintain and can be useful in domains where people move between locations (for example, I've done this for a University where the data represented the students current address, which would change in and out of semester periods, but often cycle between two addresses, their college and their parental home). The downside of this approach is that applications which insert this information must remember to execute both SQL statements. If you have multiple interfaces, this might become a maintenance burden (one of the advantages of using a DB function). Tim -- Tim Cross
Re: CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN
On 05/02/2018 02:29 PM, Jim Michaels wrote: - the microsoft patented CSV would be required for implementation. it handles special data with commas and double-quotes in them Huh?: https://en.wikipedia.org/wiki/Comma-separated_values#History -- Adrian Klaver adrian.kla...@aklaver.com
Re: CSVQL? CSV SQL? tab-separated table I/O? RENAME COLUMN
Ron wrote: > On 05/02/2018 04:49 PM, David G. Johnston wrote: > [snip] > > > > - the microsoft patented CSV would be required for implementation. it > > handles special data with commas and double-quotes in them > > > > > > If true this seems like a show-stopper to anything PostgreSQL would > > implement > > If MSFT really holds a patent on the CSV format, then Postgresql is already > in a world of hurt. Even if the CSV format was patented, don't patents only last 17 years? Has anyone found the patent? When was it granted? I would guess in the 1980s. And can anyone remember MS ever demanding a fee for using it?