Re: [Rd] readLines() segfaults on large file & question on how to work around

Tomas Kalibera Mon, 04 Sep 2017 04:37:50 -0700

As of R-devel 72925 one gets a proper error message instead of the crash.


Tomas


On 09/04/2017 08:46 AM, rh...@eoos.dds.nl wrote:

Although the problem can apparently be avoided in this case. readLinescausing a segfault still seems unwanted behaviour to me. I canreplicate this with the example below (sessionInfo is further down):
# Generate an example file
l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE),
  collapse="")
con <- file("test.txt", "wt")
for (i in seq_len(2500)) {
  writeLines(l, con, sep ="")
}
close(con)


# Causes segfault:
readLines("test.txt")
Also the error reported by readr is also reproduced (a moreinformative error message and checking for integer overflows would benice). I will report this with readr.
library(readr)
read_file("test.txt")
# Error in read_file_(ds, locale) : negative length vectors are not
# allowed


--
Jan








> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C LC_TIME=nl_NL.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=nl_NL.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=nl_NL.UTF-8LC_IDENTIFICATION=C
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

other attached packages:
[1] readr_1.1.1

loaded via a namespace (and not attached):
[1] compiler_3.4.1 R6_2.2.2 hms_0.3 tools_3.4.1tibble_1.3.3 Rcpp_0.12.12 rlang_0.1.2
On 03-09-17 20:50, Jennifer Lyon wrote:
Jeroen:

Thank you for pointing me to ndjson, which I had not heard of and is
exactly my case.

My experience:
jsonlite::stream_in - segfaults
ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is tooold
       so it won't compile the package
corpus::read_ndjson - works!!! Of course it does a differentsimplificationthan jsonlite::fromJSON, so I have to change some code, but itworksbeautifully at least in simple tests. The memory-map option maybe of
      use in the future.
Another correspondent said that strings in R can only be 2^31-1 long,whichis why any "solution" that tries to load the whole file into R firstas a
string, will fail.

Thanks for suggesting a path forward for me!

Jen
On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <jeroeno...@gmail.com>wrote:
On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon<jennifer.s.l...@gmail.com>
wrote:
I have a 2.1GB JSON file. Typically I use readLines() and
jsonlite:fromJSON() to extract data from a JSON file.
If your data consists of one json object per line, this is called
'ndjson'. There are several packages specialized to read ndjon files:

  - corpus::read_ndjson
  - ndjson::stream_in
  - jsonlite::stream_in

In particular the 'corpus' package handles large files really well
because it has an option to memory-map the file instead of reading all
of its data into memory.

If the data is too large to read, you can preprocess it using
https://stedolan.github.io/jq/ to extract the fields that you need.

You really don't need hadoop/spark/etc for this.
    [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] readLines() segfaults on large file & question on how to work around

Reply via email to