Re: Faster str to int conversion (was Table with large number of int columns, very slow COPY FROM)

2018-07-19 Thread Andres Freund
Hi,

On 2018-07-18 14:34:34 -0400, Robert Haas wrote:
> On Sat, Jul 7, 2018 at 4:01 PM, Andres Freund  wrote:
> > FWIW, here's a rebased version of this patch. Could probably be polished
> > further. One might argue that we should do a bit more wide ranging
> > changes, to convert scanint8 and pg_atoi to be also unified. But it
> > might also just be worthwhile to apply without those, given the
> > performance benefit.
> 
> Wouldn't hurt to do that one too, but might be OK to just do this
> much.  Questions:
> 
> 1. Why the error message changes?  If there's a good reason, it should
> be done as a separate commit, or at least well-documented in the
> commit message.

Because there's a lot of "invalid input syntax for type %s: \"%s\"",
error messages, and we shouldn't force translators to have separate
version that inlines the first %s.  But you're right, it'd be worthwhile
to point that out in the commit message.


> 2. Does the likely/unlikely stuff make a noticeable difference?

Yes. It's also largely a copy from existing code (scanint8), so I don't
really want to differ here.


> 3. If this is a drop-in replacement for pg_atoi, why not just recode
> pg_atoi this way -- or have it call this -- and leave the callers
> unchanged?

Because pg_atoi supports a variable 'terminator'. Supporting that would
create a bit slower code, without being particularly useful.  I think
there's only a single in-core caller left after the patch
(int2vectorin). There's a fair argument that that should just be
open-coded to handle the weird space parsing, but given there's probably
external pg_atoi() callers, I'm not sure it's worth doing so?

I don't think it's a good idea to continue to have pg_atoi as a wrapper
- it takes a size argument, which makes efficient code hard.


> 4. Are we sure this is faster on all platforms, or could it work out
> the other way on, say, BSD?

I'd be *VERY* surprised if any would be faster. It's not easy to write a
faster implmentation, than what I've proposed, and especially not so if
you use strtol() as the API (variable bases, a bit of locale support).

Greetings,

Andres Freund



Re: Why HDD performance is better than SSD in this case

2018-07-19 Thread Mark Kirkwood

One more thought on this:

Query 9 does a lot pf sorting to disk - so there will be writes for that 
and all the reads for the table scans. Thus the location of your 
instance's pgsql_tmp directory(s) will significantly influence results.


I'm wondering if in your HDD test the pgsql_tmp on the *SSD's* is being 
used. This would make the HDDs look faster (obviously - as they only 
need to do reads now). You can check this with iostat while the HDD test 
is being run, there should be *no* activity on the SSDs...if there is 
you have just found one reason for the results being quicker than it 
should be.


FWIW: I had a play with this: ran two version 10.4 instances, one on a 
single 7200 rpm HDD, one on a (ahem slow) Intel 600p NVME. Running query 
9 on the scale 40 databases I get:


- SSD 30 minutes

- HDD 70 minutes

No I'm running these on an a Intel i7 3.4 Ghz 16 GB RAM setup. Also both 
postgres instances have default config apart from random_page_cost.


Comparing my results with yours - the SSD one is consistent...if I had 
two SSDs in RAID0 I might halve the time (I might try this). However my 
HDD result is not at all like yours (mine makes more sense to be 
fair...would expect HDD to be slower in general).


Cheers (thanks for an interesting puzzle)!

Mark



On 18/07/18 13:13, Neto pr wrote:


Dear  Mark
To ensure that the test is honest and has the same configuration the
O.S. and also DBMS, my O.S. is installed on the SSD and DBMS as well.
I have an instance only of DBMS and two database.
- a database called tpch40gnorhdd with tablespace on the HDD disk.
- a database called tpch40gnorssd with tablespace on the SSD disk.
See below:







Re: Why HDD performance is better than SSD in this case

2018-07-19 Thread Mark Kirkwood

FWIW:

re-running query 9 using the SSD setup as 2x crucial M550 RAID0: 10 minutes.


On 20/07/18 11:30, Mark Kirkwood wrote:

One more thought on this:

Query 9 does a lot pf sorting to disk - so there will be writes for 
that and all the reads for the table scans. Thus the location of your 
instance's pgsql_tmp directory(s) will significantly influence results.


I'm wondering if in your HDD test the pgsql_tmp on the *SSD's* is 
being used. This would make the HDDs look faster (obviously - as they 
only need to do reads now). You can check this with iostat while the 
HDD test is being run, there should be *no* activity on the SSDs...if 
there is you have just found one reason for the results being quicker 
than it should be.


FWIW: I had a play with this: ran two version 10.4 instances, one on a 
single 7200 rpm HDD, one on a (ahem slow) Intel 600p NVME. Running 
query 9 on the scale 40 databases I get:


- SSD 30 minutes

- HDD 70 minutes

No I'm running these on an a Intel i7 3.4 Ghz 16 GB RAM setup. Also 
both postgres instances have default config apart from random_page_cost.


Comparing my results with yours - the SSD one is consistent...if I had 
two SSDs in RAID0 I might halve the time (I might try this). However 
my HDD result is not at all like yours (mine makes more sense to be 
fair...would expect HDD to be slower in general).


Cheers (thanks for an interesting puzzle)!

Mark



On 18/07/18 13:13, Neto pr wrote:


DearĀ  Mark
To ensure that the test is honest and has the same configuration the
O.S. and also DBMS, my O.S. is installed on the SSD and DBMS as well.
I have an instance only of DBMS and two database.
- a database called tpch40gnorhdd with tablespace on the HDD disk.
- a database called tpch40gnorssd with tablespace on the SSD disk.
See below:










Re: Why HDD performance is better than SSD in this case

2018-07-19 Thread Mark Kirkwood

And perhaps more interesting:

Re-running query 9 against the (single) HDD setup *but* with pgsql_tmp 
symlinked to the 2x SSD RAID0: 15 minutes


I'm thinking that you have inadvertently configured your HDD test in 
this way (you get 9 minutes because you have 2x HDDs). Essentially most 
of the time taken for this query is in writing and reading files for 
sorting/hashing, so where pgsql_tmp is located hugely influences the 
overall time.


regards

Mark


On 20/07/18 12:33, Mark Kirkwood wrote:

FWIW:

re-running query 9 using the SSD setup as 2x crucial M550 RAID0: 10 
minutes.



On 20/07/18 11:30, Mark Kirkwood wrote:

One more thought on this:

Query 9 does a lot pf sorting to disk - so there will be writes for 
that and all the reads for the table scans. Thus the location of your 
instance's pgsql_tmp directory(s) will significantly influence results.


I'm wondering if in your HDD test the pgsql_tmp on the *SSD's* is 
being used. This would make the HDDs look faster (obviously - as they 
only need to do reads now). You can check this with iostat while the 
HDD test is being run, there should be *no* activity on the SSDs...if 
there is you have just found one reason for the results being quicker 
than it should be.


FWIW: I had a play with this: ran two version 10.4 instances, one on 
a single 7200 rpm HDD, one on a (ahem slow) Intel 600p NVME. Running 
query 9 on the scale 40 databases I get:


- SSD 30 minutes

- HDD 70 minutes

No I'm running these on an a Intel i7 3.4 Ghz 16 GB RAM setup. Also 
both postgres instances have default config apart from random_page_cost.


Comparing my results with yours - the SSD one is consistent...if I 
had two SSDs in RAID0 I might halve the time (I might try this). 
However my HDD result is not at all like yours (mine makes more sense 
to be fair...would expect HDD to be slower in general).


Cheers (thanks for an interesting puzzle)!

Mark



On 18/07/18 13:13, Neto pr wrote:


DearĀ  Mark
To ensure that the test is honest and has the same configuration the
O.S. and also DBMS, my O.S. is installed on the SSD and DBMS as well.
I have an instance only of DBMS and two database.
- a database called tpch40gnorhdd with tablespace on the HDD disk.
- a database called tpch40gnorssd with tablespace on the SSD disk.
See below: