On 15/07/2017 11:33 AM, Anthony Damico wrote:
hi, i realized that the segfault happens on the text file in a new R session. so, creating the segfault-generating text file requires a contributed package, but prompting the actual segfault does not -- pretty sure that means this is a base R bug? submitted here: https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully i am not doing something remarkably stupid. the text file itself is 4GB so cannot upload it to bugzilla, and from the R_AllocStringBugger error in the previous message, i think most or all of it needs to be there to trigger the segfault. thanks!
Hopefully someone can debug it with the info you provided. Duncan Murdoch
On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdam...@gmail.com <mailto:ajdam...@gmail.com>> wrote: hi, thanks Dr. Murdoch i'd appreciate if anyone on r-help could help me narrow this down? i believe the segfault occurs because there's a single line with 4GB and also embedded nuls, but i am not sure how to artificially construct that? the lodown package can be removed from my example.. it is just for file download cacheing, so `lodown::cachaca` can be replaced with `download.file` my current example requires a huge download, so sort of painful to repeat but i'm pretty confident that's not the issue. the archive::archive_extract() function unzips a (probably corrupt) .RAR file and creates a text file with 80,937 lines. this file is 4GB: > file.size(infile) [1] 4078192743 <tel:(407)%20819-2743> i am pretty sure that nearly all of that 4GB is contained on a single line in the file. here's what happens when i create a file connection and scan through.. > file_con <- file( infile , 'r' ) > > first_80936_lines <- readLines( file_con , n = 80936 ) > scan( w , n = 1 , what = character() ) Read 1 item [1] "1000023930632009" > scan( w , n = 1 , what = character() ) Read 1 item [1] "36F2924009PAULO" > scan( w , n = 1 , what = character() ) Read 1 item [1] "AFONSO" > scan( w , n = 1 , what = character() ) Read 1 item [1] "BA11" > scan( w , n = 1 , what = character() ) Read 1 item [1] "00000" > scan( w , n = 1 , what = character() ) Read 1 item [1] "00" > scan( w , n = 1 , what = character() ) Read 1 item [1] "2924009PAULO" > scan( w , n = 1 , what = character() ) Read 1 item [1] "AFONSO" > scan( w , n = 1 , what = character() ) Read 1 item [1] "BA1111" > scan( w , n = 1 , what = character() ) Read 1 item [1] "467.20" > scan( w , n = 1 , what = character() ) Read 1 item [1] "346.10" > scan( w , n = 1 , what = character() ) Read 1 item [1] "414.40" > scan( w , n = 1 , what = character() ) Error in scan(w, n = 1, what = character()) : could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer' making a huge single-line file does not reproduce the problem, i think the embedded nuls have something to do with it-- # WARNING do not run with less than 64GB RAM tf <- tempfile() a <- rep( "a" , 1000000000 ) b <- paste( a , collapse = '' ) writeLines( b , tf ) ; rm( b ) ; gc() d <- readLines( tf ) On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> wrote: On 15/07/2017 7:35 AM, Anthony Damico wrote: hello, the last line of the code below causes a segfault for me on 3.4.1. i think i should submit to https://bugs.r-project.org/ unless others have advice? thanks Segfaults are usually worth reporting as bugs. Try to come up with a self-contained example, not using the lodown and archive packages. I imagine you can do this by uploading the file you downloaded, or enough of a subset of it to trigger the segfault. If you can't do that, then likely the bug is with one of those packages, not with R. Duncan Murdoch install.packages( "devtools" ) devtools::install_github("ajdamico/lodown") devtools::install_github("jimhester/archive") file_folder <- file.path( tempdir() , "file_folder" ) tf <- tempfile() # large download! cachaca saves on your local disk if already downloaded lodown::cachaca( ' http://download.inep.gov.br/microdados/microdados_enem2009.rar <http://download.inep.gov.br/microdados/microdados_enem2009.rar>' , tf , mode = 'wb' ) archive::archive_extract( tf , dir = normalizePath( file_folder ) ) unzipped_files <- list.files( file_folder , recursive = TRUE , full.names = TRUE ) infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE ) # works R.utils::countLines( infile ) # works with warning my_file <- readLines( infile , skipNul = TRUE ) # crash my_file <- readLines( infile ) # run just before crash sessionInfo() # R version 3.4.1 (2017-06-30) # Platform: x86_64-w64-mingw32/x64 (64-bit) # Running under: Windows 10 x64 (build 15063) # Matrix products: default # locale: # [1] LC_COLLATE=English_United States.1252 # [2] LC_CTYPE=English_United States.1252 # [3] LC_MONETARY=English_United States.1252 # [4] LC_NUMERIC=C # [5] LC_TIME=English_United States.1252 # attached base packages: # [1] stats graphics grDevices utils datasets methods base # loaded via a namespace (and not attached): # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 withr_1.0.2 # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 memoise_1.1.0 # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 lodown_0.1.0 # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2 R.oo_1.21.0 # [17] archive_0.0.0.9000 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org <mailto:R-help@r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html <http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.