On 15/07/2017 11:33 AM, Anthony Damico wrote:
hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not --
pretty sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
am not doing something remarkably stupid.  the text file itself is 4GB
so cannot upload it to bugzilla, and from the R_AllocStringBugger error
in the previous message, i think most or all of it needs to be there to
trigger the segfault.  thanks!

Hopefully someone can debug it with the info you provided.

Duncan Murdoch


On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdam...@gmail.com
<mailto:ajdam...@gmail.com>> wrote:

    hi, thanks Dr. Murdoch


    i'd appreciate if anyone on r-help could help me narrow this down?
    i believe the segfault occurs because there's a single line with 4GB
    and also embedded nuls, but i am not sure how to artificially
    construct that?


    the lodown package can be removed from my example..  it is just for
    file download cacheing, so `lodown::cachaca` can be replaced with
    `download.file`  my current example requires a huge download, so
    sort of painful to repeat but i'm pretty confident that's not the issue.


    the archive::archive_extract() function unzips a (probably corrupt)
    .RAR file and creates a text file with 80,937 lines.  this file is 4GB:

        > file.size(infile)
        [1] 4078192743 <tel:(407)%20819-2743>


    i am pretty sure that nearly all of that 4GB is contained on a
    single line in the file.  here's what happens when i create a file
    connection and scan through..

        > file_con <- file( infile , 'r' )
        >
        > first_80936_lines <- readLines( file_con , n = 80936 )
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "1000023930632009"
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "36F2924009PAULO"
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "AFONSO"
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "BA11"
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "00000"
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "00"
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "2924009PAULO"
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "AFONSO"
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "BA1111"
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "467.20"
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "346.10"
        > scan( w , n = 1 , what = character() )
        Read 1 item
        [1] "414.40"
        > scan( w , n = 1 , what = character() )
        Error in scan(w, n = 1, what = character()) :
          could not allocate memory (2048 Mb) in C function
    'R_AllocStringBuffer'



    making a huge single-line file does not reproduce the problem, i
    think the embedded nuls have something to do with it--


        # WARNING do not run with less than 64GB RAM
        tf <- tempfile()
        a <- rep( "a" , 1000000000 )
        b <- paste( a , collapse = '' )
        writeLines( b , tf ) ; rm( b ) ; gc()
        d <- readLines( tf )



    On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch
    <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> wrote:

        On 15/07/2017 7:35 AM, Anthony Damico wrote:

            hello, the last line of the code below causes a segfault for
            me on 3.4.1.
            i think i should submit to https://bugs.r-project.org/
            unless others have
            advice?  thanks


        Segfaults are usually worth reporting as bugs.  Try to come up
        with a self-contained example, not using the lodown and archive
        packages.  I imagine you can do this by uploading the file you
        downloaded, or enough of a subset of it to trigger the
        segfault.  If you can't do that, then likely the bug is with one
        of those packages, not with R.

        Duncan Murdoch






            install.packages( "devtools" )
            devtools::install_github("ajdamico/lodown")
            devtools::install_github("jimhester/archive")


            file_folder <- file.path( tempdir() , "file_folder" )

            tf <- tempfile()

            # large download!  cachaca saves on your local disk if
            already downloaded
            lodown::cachaca( '
            http://download.inep.gov.br/microdados/microdados_enem2009.rar
            <http://download.inep.gov.br/microdados/microdados_enem2009.rar>'
            , tf , mode
            = 'wb' )

            archive::archive_extract( tf , dir = normalizePath(
            file_folder ) )

            unzipped_files <- list.files( file_folder , recursive = TRUE
            , full.names =
            TRUE  )

            infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value
            = TRUE )

            # works
            R.utils::countLines( infile )

            # works with warning
            my_file <- readLines( infile , skipNul = TRUE )

            # crash
            my_file <- readLines( infile )


            # run just before crash
            sessionInfo()
            # R version 3.4.1 (2017-06-30)
            # Platform: x86_64-w64-mingw32/x64 (64-bit)
            # Running under: Windows 10 x64 (build 15063)

            # Matrix products: default

            # locale:
            # [1] LC_COLLATE=English_United States.1252
            # [2] LC_CTYPE=English_United States.1252
            # [3] LC_MONETARY=English_United States.1252
            # [4] LC_NUMERIC=C
            # [5] LC_TIME=English_United States.1252

            # attached base packages:
            # [1] stats     graphics  grDevices utils     datasets
            methods   base

            # loaded via a namespace (and not attached):
             # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
               withr_1.0.2
             # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
            memoise_1.1.0
             # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
                lodown_0.1.0
            # [13] R.utils_2.5.0      rlang_0.1.1
            devtools_1.13.2    R.oo_1.21.0
            # [17] archive_0.0.0.9000

                    [[alternative HTML version deleted]]

            ______________________________________________
            R-help@r-project.org <mailto:R-help@r-project.org> mailing
            list -- To UNSUBSCRIBE and more, see
            https://stat.ethz.ch/mailman/listinfo/r-help
            <https://stat.ethz.ch/mailman/listinfo/r-help>
            PLEASE do read the posting guide
            http://www.R-project.org/posting-guide.html
            <http://www.R-project.org/posting-guide.html>
            and provide commented, minimal, self-contained, reproducible
            code.





______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to