Importing a Large .ndjson file
Hi I have a .ndjson file. It is a new-line-delimited JSON file. It is about 10GB and has about 100,000 records. Some sample records: ``` { "key11": "value11", "key12": [ "value12.1", "value12.2"], "key13": { "k111": "v111" } } \n\r { "key21": "value21", "key22": [ "value22.1", "value22.2"] } ``` Now I want to INSERT these json records into my postgres table of the following schema: ``` CREATE TABLE myTable (id BIGSERIAL, content JSONB); ``` Where I want the records to be inserted to the `content` field of my postgres table. What is the best way to do this on a postgresql database, deployed in kubernetes, with a 1 GB RAM allocated ? I can probably write a that would read this file line-by-line and INSERT into the database, in a transaction. But that I believe would take a lot of network traffic and I want to know if there is a better way to do this. Thanks. -- Sankar P http://psankar.blogspot.com
Re: Importing a Large .ndjson file
> It looks like plain old COPY would do this just fine, along the lines > of (in psql) > > \copy myTable(content) from 'myfile.ndjson' > Indeed. Thanks. -- Sankar P http://psankar.blogspot.com
Using postgresql and JSONB on very large datasets
Hi I have a table t, with just two columns, an `id SERIAL` and a `record JSONB`. I have about 3 million records where I save some JSON data in each record. Each `record` column jsonb has about 40 json key-values, of which about 10 are nested fields. Now, if I run a query, like: ``` select distinct record -> 'destinationServiceName' from t; ``` it takes about 1 minute or so to return the list of unique values for this column, to search across the 3 million records. The number of records in my setup may go up to a few tens of millions. I want these queries to be faster. I also intend to do a few more aggregation queries, like, "Give the maximum responseTime for each distinct destinationServiceName" etc. in future. I have used Elasticsearch and kibana in the past and I am planning to migrate to postgres jsonb field (not as individual columns), as vertical scaling is preferred for my case and the schema could not be fixed. I assumed that if I add more indexes, these types of queries would become faster. So I tried to create an index using the command: ``` CREATE INDEX idx_records ON t USING GIN (record); ``` But this index creation fails now as my postbird client timesout. So my questions are: 1) Is it a good idea to use postgres as a JSON database, if I have tens of millions of records ? If you run such a setup, can you share any words of advice on Dos and Don'ts ? 2) Is there a better way to do the index creation, reliably, for a table with a few million records already ? 3) There are various types of indexes. I am using GIN based on some short-term googling. Are there any good blog posts, links, tutorials, courses etc. that you have found useful in understanding the postgres jsonb performance tuning ? Thanks. -- Sankar P http://psankar.blogspot.com
Re: Importing a Large .ndjson file
> > It looks like plain old COPY would do this just fine, along the lines > > of (in psql) > > > > \copy myTable(content) from 'myfile.ndjson' I spoke too soon. While this worked fine when there were no indexes and finished within 10 minutes, with GIN index on the jsonb column, it is taking hours and still not completing. -- Sankar P http://psankar.blogspot.com
Re: Importing a Large .ndjson file
> Sankar P writes: > > I spoke too soon. While this worked fine when there were no indexes > > and finished within 10 minutes, with GIN index on the jsonb column, it > > is taking hours and still not completing. > > There's the generic advice that building an index after-the-fact > is often cheaper than updating it incrementally. For GIN indexes > specifically, see also > > https://www.postgresql.org/docs/current/gin-tips.html Thanks guys. -- Sankar P http://psankar.blogspot.com
DISTINCT on jsonb fields and Indexes
I have a table with the schema: CREATE TABLE fluent (id BIGSERIAL, record JSONB); Then I created a couple of indexes: 1) CREATE INDEX idx_records ON fluent USING GIN (record); 2) CREATE INDEX idx_btree_records ON fluent USING BTREE ((record ->> 'destinationServiceName')); Now, if I run a query to look up the distinct values of the field `destinationServiceName`, via: select distinct record ->> 'destinationServiceName' from fluent; This query takes a lot of time, and does full table scan. The query planning is: # explain analyze select distinct record ->> 'destinationServiceName' from fluent; QUERY PLAN Unique (cost=1103803.97..1118803.97 rows=300 width=32) (actual time=77282.528..78549.877 rows=10 loops=1) -> Sort (cost=1103803.97..303.97 rows=300 width=32) (actual time=77282.525..78046.992 rows=300 loops=1) Sort Key: ((record ->> 'destinationServiceName'::text)) Sort Method: external merge Disk: 117456kB -> Seq Scan on fluent (cost=0.00..637500.00 rows=300 width=32) (actual time=14.440..69545.867 rows=300 loops=1) Planning Time: 0.187 ms Execution Time: 78574.221 ms I see that none of the indexes are used. I want to do a few aggregations, like "what are the distinct pairs of `destinationServiceName` and `sourceServiceName` etc. " in these records. Now, is such a querying possible at all without doing full table scans ? I get such kind of aggregation support in elasticsearch + kibana, without doing full-table scan (I believe so, but I do not have data to back this claim) and I am trying to see if this is possible with any other extra index creation in postgres. Any suggestions ? Thanks. -- Sankar P http://psankar.blogspot.com
Re: DISTINCT on jsonb fields and Indexes
> > There is some work in progress to improve this type of query, but > it'll be at least PG14 before we see that. oh okay. > > For your version, you might want to look at > https://wiki.postgresql.org/wiki/Loose_indexscan and try some of the > proposed solutions from there. Thanks a lot :) -- Sankar P http://psankar.blogspot.com