Re: Suggestion about logging only every n-th statement

2018-06-24 Thread Adrien Nayrat
On 06/20/2018 03:06 PM, Janning Vygen wrote:
>> FYI in made this patch which seems do what you want :
>> https://www.postgresql.org/message-id/flat/c30ee535-ee1e-db9f-fa97-146b9f62caed%40anayrat.info#c30ee535-ee1e-db9f-fa97-146b9f62c...@anayrat.info
>>
>>
>> I will add an entry in september's commit fest.
> 
> excellent. Exactly what I want!

FYI, I added an entry in next commit fest (september is not open) :
https://commitfest.postgresql.org/18/1691/

Feel free to review it ;)

Regards,

-- 
Adrien NAYRAT



signature.asc
Description: OpenPGP digital signature


Using COPY to import large xml file

2018-06-24 Thread Anto Aravinth
Hello Everyone,

I have downloaded the Stackoverflow posts xml (contains all SO questions
till date).. the file is around 70GB.. I wanna import the data in those xml
to my table.. is there a way to do so in postgres?


Thanks,
Anto.


Re: Using COPY to import large xml file

2018-06-24 Thread Adrien Nayrat
On 06/24/2018 05:25 PM, Anto Aravinth wrote:
> Hello Everyone,
> 
> I have downloaded the Stackoverflow posts xml (contains all SO questions till
> date).. the file is around 70GB.. I wanna import the data in those xml to my
> table.. is there a way to do so in postgres?
> 
> 
> Thanks, 
> Anto.

Hello Anto,

I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgres

Regards,

-- 
Adrien NAYRAT
https://blog.anayrat.info



signature.asc
Description: OpenPGP digital signature


Re: Using COPY to import large xml file

2018-06-24 Thread Adrien Nayrat
On 06/24/2018 06:07 PM, Anto Aravinth wrote:
> Thanks for the response. I'm not sure, how long does this tool takes for the
> 70GB data. 

In my memory, it took several hours. I can't remember if it is xml conversion or
insert which are longer.

> 
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took 
> around
> 2 hrs using nodejs. Hence, I thought will use COPY command, as suggested on 
> the
> internet. 
> 
> Definitely, will try the code and let you know.. But looks like it uses the 
> same
> INSERT, not copy.. interesting if it runs quick on my machine. 

Yes it use INSERT, maybe it is not difficult to change the code to use COPY 
instead.

-- 
Adrien NAYRAT
https://blog.anayrat.info



signature.asc
Description: OpenPGP digital signature


Re: Using COPY to import large xml file

2018-06-24 Thread Anto Aravinth
Thanks for the response. I'm not sure, how long does this tool takes for
the 70GB data.

I used node to stream the xml files into inserts.. which was very slow..
Actually the xml contains 40 million records, out of which 10Million took
around 2 hrs using nodejs. Hence, I thought will use COPY command, as
suggested on the internet.

Definitely, will try the code and let you know.. But looks like it uses the
same INSERT, not copy.. interesting if it runs quick on my machine.

On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat 
wrote:

> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
> > Hello Everyone,
> >
> > I have downloaded the Stackoverflow posts xml (contains all SO questions
> till
> > date).. the file is around 70GB.. I wanna import the data in those xml
> to my
> > table.. is there a way to do so in postgres?
> >
> >
> > Thanks,
> > Anto.
>
> Hello Anto,
>
> I used this tool :
> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>
> Regards,
>
> --
> Adrien NAYRAT
> https://blog.anayrat.info
>
>


Re: Using COPY to import large xml file

2018-06-24 Thread Adrian Klaver

On 06/24/2018 08:25 AM, Anto Aravinth wrote:

Hello Everyone,

I have downloaded the Stackoverflow posts xml (contains all SO questions 
till date).. the file is around 70GB.. I wanna import the data in those 
xml to my table.. is there a way to do so in postgres?


It is going to require some work. You will need to deal with:

1) The row schema inside the XML is here:

https://ia800107.us.archive.org/27/items/stackexchange/readme.txt

- **posts**.xml

2) The rows are inside a  tag.

Seems to me you have two options:

1) Drop each row into a single XML field and deal with extracting the 
row components in the database.


2) Break down the row into column components before entering them into 
the database.


Adrien has pointed you at a Python program that covers the above:

https://github.com/Networks-Learning/stackexchange-dump-to-postgres

If you are comfortable in Python you can take a look at:

https://github.com/Networks-Learning/stackexchange-dump-to-postgres/blob/master/row_processor.py

to see how the rows are broken down into elements.

I would try this out first on one of the smaller datasets found here:

https://archive.org/details/stackexchange

I personally took a look at:

https://archive.org/download/stackexchange/beer.stackexchange.com.7z

because why not?




Thanks,
Anto.



--
Adrian Klaver
adrian.kla...@aklaver.com



Re: Using COPY to import large xml file

2018-06-24 Thread Christoph Moench-Tegeder
## Adrien Nayrat (adrien.nay...@anayrat.info):

> I used this tool :
> https://github.com/Networks-Learning/stackexchange-dump-to-postgres

That will be awfully slow: this tool commits each INSERT on it's own,
see loop in
https://github.com/Networks-Learning/stackexchange-dump-to-postgres/blob/master/load_into_pg.py#L83

With only small changes - prepare the INSERT, execute for all (or at
least a huge lot of) rows, COMMIT at the end - you can safe quite a lot
of time (500 rows is not "a huge lot"). And when you do that, for
heaven's sake, do not try to create the INSERT statement as a string
with the values - Bobby Tables will eat you. See psycopg documentation
on how it's done (especially watch the warnings):
http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries
On prepared statements with psycopg2, see
http://initd.org/psycopg/articles/2012/10/01/prepared-statements-psycopg/
(python makes stuff rather hard, but I'll spare you the snark and wait until
the language has matured to at least version 5).

Using the COPY protocol with psycopg2 seems to require some hoop-jumping,
but could improve matters even more.

Regards,
Christoph

-- 
Spare Space.



Re: Using COPY to import large xml file

2018-06-24 Thread Tim Cross


Anto Aravinth  writes:

> Thanks for the response. I'm not sure, how long does this tool takes for
> the 70GB data.
>
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took
> around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> suggested on the internet.
>
> Definitely, will try the code and let you know.. But looks like it uses the
> same INSERT, not copy.. interesting if it runs quick on my machine.
>
> On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat 
> wrote:
>
>> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> > Hello Everyone,
>> >
>> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>> till
>> > date).. the file is around 70GB.. I wanna import the data in those xml
>> to my
>> > table.. is there a way to do so in postgres?
>> >
>> >
>> > Thanks,
>> > Anto.
>>
>> Hello Anto,
>>
>> I used this tool :
>> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>>

If you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database 
which is
also performing other processing. 

We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.

Be wary of what indexes your defining on your table. Depending on the
type and number, these can have significant impact on insert times as
well.


-- 
Tim Cross



Re: Using COPY to import large xml file

2018-06-24 Thread Anto Aravinth
On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross  wrote:

>
> Anto Aravinth  writes:
>
> > Thanks for the response. I'm not sure, how long does this tool takes for
> > the 70GB data.
> >
> > I used node to stream the xml files into inserts.. which was very slow..
> > Actually the xml contains 40 million records, out of which 10Million took
> > around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> > suggested on the internet.
> >
> > Definitely, will try the code and let you know.. But looks like it uses
> the
> > same INSERT, not copy.. interesting if it runs quick on my machine.
> >
> > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <
> adrien.nay...@anayrat.info>
> > wrote:
> >
> >> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
> >> > Hello Everyone,
> >> >
> >> > I have downloaded the Stackoverflow posts xml (contains all SO
> questions
> >> till
> >> > date).. the file is around 70GB.. I wanna import the data in those xml
> >> to my
> >> > table.. is there a way to do so in postgres?
> >> >
> >> >
> >> > Thanks,
> >> > Anto.
> >>
> >> Hello Anto,
> >>
> >> I used this tool :
> >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
> >>
>
> If you are using nodejs, then you can easily use the pg-copy-streams
> module to insert the records into your database. I've been using this
> for inserting large numbers of records from NetCDF files. Takes between
> 40 to 50 minutes to insert 60 Million+ records and we are doing
> additional calculations on the values, not just inserting them,
> plus we are inserting into a database over the network and into a database
> which is
> also performing other processing.
>
> We found a significant speed improvement with COPY over blocks of insert
> transactions, which was faster than just individual inserts. The only
> downside with using COPY is that it either completely works or
> completely fails and when it fails, it can be tricky to work out which
> record is causing the failure. A benefit of using blocks of transactions
> is that you have more fine grained control, allowing you to recover from
> some errors or providing more specific detail regarding the cause of the
> error.
>

Sure, let me try that.. I have a question here, COPY usually works when you
move data from files to your postgres instance, right? Now in node.js,
processing the whole file, can I use COPY
programmatically like COPY Stackoverflow ?
Because from doc:

https://www.postgresql.org/docs/9.2/static/sql-copy.html

I don't see its possible. May be I need to convert the files to copy
understandable first?

Anto.

>
> Be wary of what indexes your defining on your table. Depending on the
> type and number, these can have significant impact on insert times as
> well.
>
>
> --
> Tim Cross
>


Re: Using COPY to import large xml file

2018-06-24 Thread Tim Cross
On Mon, 25 Jun 2018 at 11:38, Anto Aravinth 
wrote:

>
>
> On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross  wrote:
>
>>
>> Anto Aravinth  writes:
>>
>> > Thanks for the response. I'm not sure, how long does this tool takes for
>> > the 70GB data.
>> >
>> > I used node to stream the xml files into inserts.. which was very slow..
>> > Actually the xml contains 40 million records, out of which 10Million
>> took
>> > around 2 hrs using nodejs. Hence, I thought will use COPY command, as
>> > suggested on the internet.
>> >
>> > Definitely, will try the code and let you know.. But looks like it uses
>> the
>> > same INSERT, not copy.. interesting if it runs quick on my machine.
>> >
>> > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <
>> adrien.nay...@anayrat.info>
>> > wrote:
>> >
>> >> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> >> > Hello Everyone,
>> >> >
>> >> > I have downloaded the Stackoverflow posts xml (contains all SO
>> questions
>> >> till
>> >> > date).. the file is around 70GB.. I wanna import the data in those
>> xml
>> >> to my
>> >> > table.. is there a way to do so in postgres?
>> >> >
>> >> >
>> >> > Thanks,
>> >> > Anto.
>> >>
>> >> Hello Anto,
>> >>
>> >> I used this tool :
>> >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>> >>
>>
>> If you are using nodejs, then you can easily use the pg-copy-streams
>> module to insert the records into your database. I've been using this
>> for inserting large numbers of records from NetCDF files. Takes between
>> 40 to 50 minutes to insert 60 Million+ records and we are doing
>> additional calculations on the values, not just inserting them,
>> plus we are inserting into a database over the network and into a
>> database which is
>> also performing other processing.
>>
>> We found a significant speed improvement with COPY over blocks of insert
>> transactions, which was faster than just individual inserts. The only
>> downside with using COPY is that it either completely works or
>> completely fails and when it fails, it can be tricky to work out which
>> record is causing the failure. A benefit of using blocks of transactions
>> is that you have more fine grained control, allowing you to recover from
>> some errors or providing more specific detail regarding the cause of the
>> error.
>>
>
> Sure, let me try that.. I have a question here, COPY usually works when
> you move data from files to your postgres instance, right? Now in node.js,
> processing the whole file, can I use COPY
> programmatically like COPY Stackoverflow ?
> Because from doc:
>
> https://www.postgresql.org/docs/9.2/static/sql-copy.html
>
> I don't see its possible. May be I need to convert the files to copy
> understandable first?
>
> Anto.
>
>>
>>
>
Yes. Essentially what you do is create a stream and feed whatever
information you want to copy into that stream. PG sees the. data as if it
was seeing each line in a file, so you push data onto the stream wherre
each item is seperated by a tab (or whatever). Here is the basic low level
function I use (Don't know how the formatting will go!)

async function copyInsert(sql, stringifyFN, records) {
  const logName = `${moduleName}.copyInsert`;
  var client;

  assert.ok(Array.isArray(records), "The records arg must be an array");
  assert.ok(typeof(stringifyFN) === "function", "The stringifyFN arg must
be a function");

  return getClient()
.then(c => {
  client = c;
  return new Promise(function(resolve, reject) {
var stream, rs;
var idx = 0;

function done() {
  releaseClient(client);
  client = undefined;
  resolve(idx + 1);
}

function onError(err) {
  if (client !== undefined) {
releaseClient(client);
  }
  reject(new VError(err, `${logName}: COPY failed at record
${idx}`));
}

function arrayRead() {
  if (idx === records.length) {
rs.push(null);
  } else {
let rec = records[idx];
rs.push(stringifyFN(rec));
idx += 1;
  }
}

rs = new Readable;
rs._read = arrayRead;
rs.on("error", onError);
stream = client.query(copyFrom(sql));
stream.on("error", onError);
stream.on("end", done);
rs.pipe(stream);
  });
})
.catch(err => {
  throw new VError(err, `${logName} Failed COPY insert`);
});
}

and I will call it like

copyInsert(sql, stringifyClimateRecord, records)

where sql and stringifycomateRecord arguments are

  const sql = `COPY access_s.climate_data_ensemble_${ensemble} `
+ "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds,"
+ "vprp_09,vprp_15,wind_speed) FROM STDIN";

  function stringifyClimateRecord(rec) {
return `'${rec[0].format("-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
  +
`${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`;
  }

The stringifyClimateRecord returns a record to be inserted as

Re: Using COPY to import large xml file

2018-06-24 Thread Christoph Moench-Tegeder
## Anto Aravinth (anto.aravinth@gmail.com):

> Sure, let me try that.. I have a question here, COPY usually works when you
> move data from files to your postgres instance, right? Now in node.js,
> processing the whole file, can I use COPY
> programmatically like COPY Stackoverflow ?
> Because from doc:
> 
> https://www.postgresql.org/docs/9.2/static/sql-copy.html
> 
> I don't see its possible. May be I need to convert the files to copy
> understandable first?

"COPY ... FROM STDIN"
 STDIN Specifies that input comes from the client application.

It's on the page...

Regards,
Christoph

-- 
Spare Space.