Re: [Rd] readLines() segfaults on large file & question on how to work around

Jeroen Ooms Sun, 03 Sep 2017 01:16:18 -0700

On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon <[email protected]> wrote:
> I have a 2.1GB JSON file. Typically I use readLines() and
> jsonlite:fromJSON() to extract data from a JSON file.


If your data consists of one json object per line, this is called
'ndjson'. There are several packages specialized to read ndjon files:

 - corpus::read_ndjson
 - ndjson::stream_in
 - jsonlite::stream_in

In particular the 'corpus' package handles large files really well
because it has an option to memory-map the file instead of reading all
of its data into memory.

If the data is too large to read, you can preprocess it using
https://stedolan.github.io/jq/ to extract the fields that you need.

You really don't need hadoop/spark/etc for this.

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] readLines() segfaults on large file & question on how to work around

Reply via email to