Re: Suggestion about logging only every n-th statement
On 06/20/2018 03:06 PM, Janning Vygen wrote: >> FYI in made this patch which seems do what you want : >> https://www.postgresql.org/message-id/flat/c30ee535-ee1e-db9f-fa97-146b9f62caed%40anayrat.info#c30ee535-ee1e-db9f-fa97-146b9f62c...@anayrat.info >> >> >> I will add an entry in september's commit fest. > > excellent. Exactly what I want! FYI, I added an entry in next commit fest (september is not open) : https://commitfest.postgresql.org/18/1691/ Feel free to review it ;) Regards, -- Adrien NAYRAT signature.asc Description: OpenPGP digital signature
Using COPY to import large xml file
Hello Everyone, I have downloaded the Stackoverflow posts xml (contains all SO questions till date).. the file is around 70GB.. I wanna import the data in those xml to my table.. is there a way to do so in postgres? Thanks, Anto.
Re: Using COPY to import large xml file
On 06/24/2018 05:25 PM, Anto Aravinth wrote: > Hello Everyone, > > I have downloaded the Stackoverflow posts xml (contains all SO questions till > date).. the file is around 70GB.. I wanna import the data in those xml to my > table.. is there a way to do so in postgres? > > > Thanks, > Anto. Hello Anto, I used this tool : https://github.com/Networks-Learning/stackexchange-dump-to-postgres Regards, -- Adrien NAYRAT https://blog.anayrat.info signature.asc Description: OpenPGP digital signature
Re: Using COPY to import large xml file
On 06/24/2018 06:07 PM, Anto Aravinth wrote: > Thanks for the response. I'm not sure, how long does this tool takes for the > 70GB data. In my memory, it took several hours. I can't remember if it is xml conversion or insert which are longer. > > I used node to stream the xml files into inserts.. which was very slow.. > Actually the xml contains 40 million records, out of which 10Million took > around > 2 hrs using nodejs. Hence, I thought will use COPY command, as suggested on > the > internet. > > Definitely, will try the code and let you know.. But looks like it uses the > same > INSERT, not copy.. interesting if it runs quick on my machine. Yes it use INSERT, maybe it is not difficult to change the code to use COPY instead. -- Adrien NAYRAT https://blog.anayrat.info signature.asc Description: OpenPGP digital signature
Re: Using COPY to import large xml file
Thanks for the response. I'm not sure, how long does this tool takes for the 70GB data. I used node to stream the xml files into inserts.. which was very slow.. Actually the xml contains 40 million records, out of which 10Million took around 2 hrs using nodejs. Hence, I thought will use COPY command, as suggested on the internet. Definitely, will try the code and let you know.. But looks like it uses the same INSERT, not copy.. interesting if it runs quick on my machine. On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat wrote: > On 06/24/2018 05:25 PM, Anto Aravinth wrote: > > Hello Everyone, > > > > I have downloaded the Stackoverflow posts xml (contains all SO questions > till > > date).. the file is around 70GB.. I wanna import the data in those xml > to my > > table.. is there a way to do so in postgres? > > > > > > Thanks, > > Anto. > > Hello Anto, > > I used this tool : > https://github.com/Networks-Learning/stackexchange-dump-to-postgres > > Regards, > > -- > Adrien NAYRAT > https://blog.anayrat.info > >
Re: Using COPY to import large xml file
On 06/24/2018 08:25 AM, Anto Aravinth wrote: Hello Everyone, I have downloaded the Stackoverflow posts xml (contains all SO questions till date).. the file is around 70GB.. I wanna import the data in those xml to my table.. is there a way to do so in postgres? It is going to require some work. You will need to deal with: 1) The row schema inside the XML is here: https://ia800107.us.archive.org/27/items/stackexchange/readme.txt - **posts**.xml 2) The rows are inside a tag. Seems to me you have two options: 1) Drop each row into a single XML field and deal with extracting the row components in the database. 2) Break down the row into column components before entering them into the database. Adrien has pointed you at a Python program that covers the above: https://github.com/Networks-Learning/stackexchange-dump-to-postgres If you are comfortable in Python you can take a look at: https://github.com/Networks-Learning/stackexchange-dump-to-postgres/blob/master/row_processor.py to see how the rows are broken down into elements. I would try this out first on one of the smaller datasets found here: https://archive.org/details/stackexchange I personally took a look at: https://archive.org/download/stackexchange/beer.stackexchange.com.7z because why not? Thanks, Anto. -- Adrian Klaver adrian.kla...@aklaver.com
Re: Using COPY to import large xml file
## Adrien Nayrat (adrien.nay...@anayrat.info): > I used this tool : > https://github.com/Networks-Learning/stackexchange-dump-to-postgres That will be awfully slow: this tool commits each INSERT on it's own, see loop in https://github.com/Networks-Learning/stackexchange-dump-to-postgres/blob/master/load_into_pg.py#L83 With only small changes - prepare the INSERT, execute for all (or at least a huge lot of) rows, COMMIT at the end - you can safe quite a lot of time (500 rows is not "a huge lot"). And when you do that, for heaven's sake, do not try to create the INSERT statement as a string with the values - Bobby Tables will eat you. See psycopg documentation on how it's done (especially watch the warnings): http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries On prepared statements with psycopg2, see http://initd.org/psycopg/articles/2012/10/01/prepared-statements-psycopg/ (python makes stuff rather hard, but I'll spare you the snark and wait until the language has matured to at least version 5). Using the COPY protocol with psycopg2 seems to require some hoop-jumping, but could improve matters even more. Regards, Christoph -- Spare Space.
Re: Using COPY to import large xml file
Anto Aravinth writes: > Thanks for the response. I'm not sure, how long does this tool takes for > the 70GB data. > > I used node to stream the xml files into inserts.. which was very slow.. > Actually the xml contains 40 million records, out of which 10Million took > around 2 hrs using nodejs. Hence, I thought will use COPY command, as > suggested on the internet. > > Definitely, will try the code and let you know.. But looks like it uses the > same INSERT, not copy.. interesting if it runs quick on my machine. > > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat > wrote: > >> On 06/24/2018 05:25 PM, Anto Aravinth wrote: >> > Hello Everyone, >> > >> > I have downloaded the Stackoverflow posts xml (contains all SO questions >> till >> > date).. the file is around 70GB.. I wanna import the data in those xml >> to my >> > table.. is there a way to do so in postgres? >> > >> > >> > Thanks, >> > Anto. >> >> Hello Anto, >> >> I used this tool : >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres >> If you are using nodejs, then you can easily use the pg-copy-streams module to insert the records into your database. I've been using this for inserting large numbers of records from NetCDF files. Takes between 40 to 50 minutes to insert 60 Million+ records and we are doing additional calculations on the values, not just inserting them, plus we are inserting into a database over the network and into a database which is also performing other processing. We found a significant speed improvement with COPY over blocks of insert transactions, which was faster than just individual inserts. The only downside with using COPY is that it either completely works or completely fails and when it fails, it can be tricky to work out which record is causing the failure. A benefit of using blocks of transactions is that you have more fine grained control, allowing you to recover from some errors or providing more specific detail regarding the cause of the error. Be wary of what indexes your defining on your table. Depending on the type and number, these can have significant impact on insert times as well. -- Tim Cross
Re: Using COPY to import large xml file
On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross wrote: > > Anto Aravinth writes: > > > Thanks for the response. I'm not sure, how long does this tool takes for > > the 70GB data. > > > > I used node to stream the xml files into inserts.. which was very slow.. > > Actually the xml contains 40 million records, out of which 10Million took > > around 2 hrs using nodejs. Hence, I thought will use COPY command, as > > suggested on the internet. > > > > Definitely, will try the code and let you know.. But looks like it uses > the > > same INSERT, not copy.. interesting if it runs quick on my machine. > > > > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat < > adrien.nay...@anayrat.info> > > wrote: > > > >> On 06/24/2018 05:25 PM, Anto Aravinth wrote: > >> > Hello Everyone, > >> > > >> > I have downloaded the Stackoverflow posts xml (contains all SO > questions > >> till > >> > date).. the file is around 70GB.. I wanna import the data in those xml > >> to my > >> > table.. is there a way to do so in postgres? > >> > > >> > > >> > Thanks, > >> > Anto. > >> > >> Hello Anto, > >> > >> I used this tool : > >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres > >> > > If you are using nodejs, then you can easily use the pg-copy-streams > module to insert the records into your database. I've been using this > for inserting large numbers of records from NetCDF files. Takes between > 40 to 50 minutes to insert 60 Million+ records and we are doing > additional calculations on the values, not just inserting them, > plus we are inserting into a database over the network and into a database > which is > also performing other processing. > > We found a significant speed improvement with COPY over blocks of insert > transactions, which was faster than just individual inserts. The only > downside with using COPY is that it either completely works or > completely fails and when it fails, it can be tricky to work out which > record is causing the failure. A benefit of using blocks of transactions > is that you have more fine grained control, allowing you to recover from > some errors or providing more specific detail regarding the cause of the > error. > Sure, let me try that.. I have a question here, COPY usually works when you move data from files to your postgres instance, right? Now in node.js, processing the whole file, can I use COPY programmatically like COPY Stackoverflow ? Because from doc: https://www.postgresql.org/docs/9.2/static/sql-copy.html I don't see its possible. May be I need to convert the files to copy understandable first? Anto. > > Be wary of what indexes your defining on your table. Depending on the > type and number, these can have significant impact on insert times as > well. > > > -- > Tim Cross >
Re: Using COPY to import large xml file
On Mon, 25 Jun 2018 at 11:38, Anto Aravinth wrote: > > > On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross wrote: > >> >> Anto Aravinth writes: >> >> > Thanks for the response. I'm not sure, how long does this tool takes for >> > the 70GB data. >> > >> > I used node to stream the xml files into inserts.. which was very slow.. >> > Actually the xml contains 40 million records, out of which 10Million >> took >> > around 2 hrs using nodejs. Hence, I thought will use COPY command, as >> > suggested on the internet. >> > >> > Definitely, will try the code and let you know.. But looks like it uses >> the >> > same INSERT, not copy.. interesting if it runs quick on my machine. >> > >> > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat < >> adrien.nay...@anayrat.info> >> > wrote: >> > >> >> On 06/24/2018 05:25 PM, Anto Aravinth wrote: >> >> > Hello Everyone, >> >> > >> >> > I have downloaded the Stackoverflow posts xml (contains all SO >> questions >> >> till >> >> > date).. the file is around 70GB.. I wanna import the data in those >> xml >> >> to my >> >> > table.. is there a way to do so in postgres? >> >> > >> >> > >> >> > Thanks, >> >> > Anto. >> >> >> >> Hello Anto, >> >> >> >> I used this tool : >> >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres >> >> >> >> If you are using nodejs, then you can easily use the pg-copy-streams >> module to insert the records into your database. I've been using this >> for inserting large numbers of records from NetCDF files. Takes between >> 40 to 50 minutes to insert 60 Million+ records and we are doing >> additional calculations on the values, not just inserting them, >> plus we are inserting into a database over the network and into a >> database which is >> also performing other processing. >> >> We found a significant speed improvement with COPY over blocks of insert >> transactions, which was faster than just individual inserts. The only >> downside with using COPY is that it either completely works or >> completely fails and when it fails, it can be tricky to work out which >> record is causing the failure. A benefit of using blocks of transactions >> is that you have more fine grained control, allowing you to recover from >> some errors or providing more specific detail regarding the cause of the >> error. >> > > Sure, let me try that.. I have a question here, COPY usually works when > you move data from files to your postgres instance, right? Now in node.js, > processing the whole file, can I use COPY > programmatically like COPY Stackoverflow ? > Because from doc: > > https://www.postgresql.org/docs/9.2/static/sql-copy.html > > I don't see its possible. May be I need to convert the files to copy > understandable first? > > Anto. > >> >> > Yes. Essentially what you do is create a stream and feed whatever information you want to copy into that stream. PG sees the. data as if it was seeing each line in a file, so you push data onto the stream wherre each item is seperated by a tab (or whatever). Here is the basic low level function I use (Don't know how the formatting will go!) async function copyInsert(sql, stringifyFN, records) { const logName = `${moduleName}.copyInsert`; var client; assert.ok(Array.isArray(records), "The records arg must be an array"); assert.ok(typeof(stringifyFN) === "function", "The stringifyFN arg must be a function"); return getClient() .then(c => { client = c; return new Promise(function(resolve, reject) { var stream, rs; var idx = 0; function done() { releaseClient(client); client = undefined; resolve(idx + 1); } function onError(err) { if (client !== undefined) { releaseClient(client); } reject(new VError(err, `${logName}: COPY failed at record ${idx}`)); } function arrayRead() { if (idx === records.length) { rs.push(null); } else { let rec = records[idx]; rs.push(stringifyFN(rec)); idx += 1; } } rs = new Readable; rs._read = arrayRead; rs.on("error", onError); stream = client.query(copyFrom(sql)); stream.on("error", onError); stream.on("end", done); rs.pipe(stream); }); }) .catch(err => { throw new VError(err, `${logName} Failed COPY insert`); }); } and I will call it like copyInsert(sql, stringifyClimateRecord, records) where sql and stringifycomateRecord arguments are const sql = `COPY access_s.climate_data_ensemble_${ensemble} ` + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds," + "vprp_09,vprp_15,wind_speed) FROM STDIN"; function stringifyClimateRecord(rec) { return `'${rec[0].format("-MM-DD")}'\t${rec[2]}\t${rec[3]}\t` + `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`; } The stringifyClimateRecord returns a record to be inserted as
Re: Using COPY to import large xml file
## Anto Aravinth (anto.aravinth@gmail.com): > Sure, let me try that.. I have a question here, COPY usually works when you > move data from files to your postgres instance, right? Now in node.js, > processing the whole file, can I use COPY > programmatically like COPY Stackoverflow ? > Because from doc: > > https://www.postgresql.org/docs/9.2/static/sql-copy.html > > I don't see its possible. May be I need to convert the files to copy > understandable first? "COPY ... FROM STDIN" STDIN Specifies that input comes from the client application. It's on the page... Regards, Christoph -- Spare Space.