Importing a Large .ndjson file

2020-06-17 Thread Sankar P
Hi

I have a .ndjson file. It is a new-line-delimited JSON file. It is
about 10GB and has about 100,000 records.

Some sample records:
```
{ "key11": "value11", "key12": [ "value12.1", "value12.2"], "key13": {
"k111": "v111" } } \n\r
{ "key21": "value21", "key22": [ "value22.1", "value22.2"] }
```
Now I want to INSERT these json records into my postgres table of the
following schema:

```
CREATE TABLE myTable (id BIGSERIAL, content JSONB);
```

Where I want the records to be inserted to the `content` field of my
postgres table.

What is the best way to do this on a postgresql database, deployed in
kubernetes, with a 1 GB RAM allocated ?

I can probably write a that would read this file line-by-line and
INSERT into the database, in a transaction. But that I believe would
take a lot of network traffic and I want to know if there is a better
way to do this.

Thanks.

-- 
Sankar P
http://psankar.blogspot.com




Re: Importing a Large .ndjson file

2020-06-18 Thread Sankar P
> It looks like plain old COPY would do this just fine, along the lines
> of (in psql)
>
> \copy myTable(content) from 'myfile.ndjson'
>

Indeed. Thanks.

-- 
Sankar P
http://psankar.blogspot.com




Using postgresql and JSONB on very large datasets

2020-06-18 Thread Sankar P
Hi

I have a table t, with just two columns, an `id SERIAL` and a `record
JSONB`. I have about 3 million records where I save some JSON data in
each record.

Each `record` column jsonb has about 40 json key-values, of which
about 10 are nested fields.

Now, if I run a query, like:

```
select distinct record -> 'destinationServiceName' from t;
```

it takes about 1 minute or so to return the list of unique values for
this column, to search across the 3 million records. The number of
records in my setup may go up to a few tens of millions.

I want these queries to be faster. I also intend to do a few more
aggregation queries, like, "Give the maximum responseTime for each
distinct destinationServiceName" etc. in future.

I have used Elasticsearch and kibana in the past and I am planning to
migrate to postgres jsonb field (not as individual columns), as
vertical scaling is preferred for my case and the schema could not be
fixed.

I assumed that if I add more indexes, these types of queries would
become faster. So I tried to create an index using the command:

```
CREATE INDEX idx_records ON t USING GIN (record);
```

But this index creation fails now as my postbird client timesout.

So my questions are:
1) Is it a good idea to use postgres as a JSON database, if I have
tens of millions of records ? If you run such a setup, can you share
any words of advice on Dos and Don'ts ?
2) Is there a better way to do the index creation, reliably, for a
table with a few million records already ?
3) There are various types of indexes. I am using GIN based on some
short-term googling. Are there any good blog posts, links, tutorials,
courses etc. that you have found useful in understanding the postgres
jsonb performance tuning ?

Thanks.

-- 
Sankar P
http://psankar.blogspot.com




Re: Importing a Large .ndjson file

2020-06-18 Thread Sankar P
> > It looks like plain old COPY would do this just fine, along the lines
> > of (in psql)
> >
> > \copy myTable(content) from 'myfile.ndjson'

I spoke too soon. While this worked fine when there were no indexes
and finished within 10 minutes, with GIN index on the jsonb column, it
is taking hours and still not completing.

-- 
Sankar P
http://psankar.blogspot.com




Re: Importing a Large .ndjson file

2020-06-18 Thread Sankar P
> Sankar P  writes:
> > I spoke too soon. While this worked fine when there were no indexes
> > and finished within 10 minutes, with GIN index on the jsonb column, it
> > is taking hours and still not completing.
>
> There's the generic advice that building an index after-the-fact
> is often cheaper than updating it incrementally.  For GIN indexes
> specifically, see also
>
> https://www.postgresql.org/docs/current/gin-tips.html

Thanks guys.

-- 
Sankar P
http://psankar.blogspot.com




DISTINCT on jsonb fields and Indexes

2020-06-21 Thread Sankar P
I have a table with the schema:

CREATE TABLE fluent (id BIGSERIAL, record JSONB);

Then I created a couple of indexes:
1) CREATE INDEX idx_records ON fluent USING GIN (record);
2) CREATE INDEX idx_btree_records ON fluent USING BTREE ((record ->>
'destinationServiceName'));

Now, if I run a query to look up the distinct values of the field
`destinationServiceName`, via:

select distinct record ->> 'destinationServiceName' from fluent;

This query takes a lot of time, and does full table scan. The query planning is:

# explain analyze select distinct record ->> 'destinationServiceName'
from fluent;
QUERY PLAN
  Unique
(cost=1103803.97..1118803.97 rows=300 width=32) (actual
time=77282.528..78549.877 rows=10 loops=1)
   ->  Sort  (cost=1103803.97..303.97 rows=300 width=32)
(actual time=77282.525..78046.992 rows=300 loops=1)
 Sort Key: ((record ->> 'destinationServiceName'::text))
 Sort Method: external merge  Disk: 117456kB
 ->  Seq Scan on fluent  (cost=0.00..637500.00 rows=300
width=32) (actual time=14.440..69545.867 rows=300 loops=1)
 Planning Time: 0.187 ms
 Execution Time: 78574.221 ms

I see that none of the indexes are used. I want to do a few
aggregations, like "what are the distinct pairs of
`destinationServiceName` and `sourceServiceName` etc. " in these
records. Now, is such a querying possible at all without doing full
table scans ? I get such kind of aggregation support in elasticsearch
+ kibana, without doing full-table scan (I believe so, but I do not
have data to back this claim) and I am trying to see if this is
possible with any other extra index creation in postgres.

Any suggestions ? Thanks.

-- 
Sankar P
http://psankar.blogspot.com




Re: DISTINCT on jsonb fields and Indexes

2020-06-22 Thread Sankar P
>
> There is some work in progress to improve this type of query, but
> it'll be at least PG14 before we see that.

oh okay.

>
> For your version, you might want to look at
> https://wiki.postgresql.org/wiki/Loose_indexscan and try some of the
> proposed solutions from there.

Thanks a lot :)

-- 
Sankar P
http://psankar.blogspot.com