Dear R Users

This is a summary of the things I tried with read.table.ffdf and fixed-width 
files. I would like to thank Jan Wijffels and Jan van der Laan for their 
suggestions and the time they spent on my problem!

My objective was to import a file with 6'079'455 lines and 32 variables using 
the tools provided by the ff package. The "fixed-width file" I got was supposed 
to have a total width of 238. But it turned out that the last column, which 
should have had a width of four, contained either no entry, or entries with one 
or two characters followed by \n\r. The corresponding spaces were dropped when 
the file was created. This could be shown by

       lines <- readLines("my_file.txt")
       range(nchar(lines))

which resulted in 235 237 instead of 238. So the file was not really fixed 
width...

I tried importing the file with

library(ff)
library(stringr)
my.data  <- read.table.ffdf(file="my_file.txt",
                    FUN="read.fwf",
                    widths = my.widths,
                    header=F, VERBOSE=TRUE, first.rows=100000,
                    col.names = my.names,
                    fileEncoding = "LATIN1",
                    transFUN=function(x){
                      z <- sapply(x, function(y) {
                        y <- str_trim(y)
                        y[y==""] <- NA
                        factor(y)})
                      as.data.frame(z)
                    }
)

This took 4168 seconds and resulted in an object that included only 100'000 
lines instead of 6'079'455 lines (I still don't know why...).


Another approach was to use laf_open_fwf from package LaF and then laf_to_ffdf 
from package ffbase, which is really a simple approach as long as the width is 
not shorter than the given width (i.e. 238). So the idea was to add the missing 
spaces by running

con <- file("my_file.txt", "rt")
out <- file("my_file_converted.txt", "wt")
system.time(
       while (TRUE) {
             lines <- readLines(con, encoding='LATIN1', n=1E5)
             if (length(lines) == 0) break
             lines <- sprintf("%-238s", lines)
             writeLines(lines, out, useBytes=TRUE) }
)
close(con)
close(out)

and then

library(LaF)
library(ffbase)
my.data.laf <- laf_open_fwf("my_file_converted.txt ", column_types=my.types, 
column_widths = my.widths, column_names = my.names)
my.data <- laf_to_ffdf(my.data.laf)

This worked really well, except that the whole process took quite some time. 
Appending the spaces took 2436 seconds, and converting the file from laf to 
ffdf took another 2628 seconds.


The third approach I tested was the fastest, but used the Unix/Linux program 
awk outside R (run on Cygwin installed on Windows 7 32-bit):

First, I converted my original file into a tab-delimited text file using awk:
awk -v FIELDWIDTHS='3 28  4 30 28  6  3 30 10  3  3  6  6  5  1  2  1  1  2  2  
2  4  2  4  7 30  1  1  3  2  4  4' -v OFS='\t' '{ $1=$1 ""; print }' 
<my_file.txt> my_file_delimited.txt

Then I used read.delim.ffdf provided by the ff package:

library(ff)
library(stringr)
my.data <- read.delim.ffdf(file="my_file_delimited.txt",
                                      header=F, VERBOSE=TRUE, first.rows=100000,
                                      col.names = my.names,
                                      colClasses=my.classes,
                                      fileEncoding = "LATIN1",
                                      transFUN=function(x) {
                                        z <- sapply(x, function(y) {
                                              y <- str_trim(y)
                                              y[y==""] <- NA
                                              factor(y)})
                                        as.data.frame(z)
                                      }
                    )


Running awk took only 203 seconds! And the import of the delimited file was 
finished after 1141 seconds.

What I like most about the variants of read.table.ffdf and also about 
laf_to_ffdf is the fransFUN argument! Have a look at it, it allows a lot of 
fine tuning.

Best Regard

Christian Kamenik
Project Manager

Federal Department of the Environment, Transport, Energy and Communications 
DETEC
Federal Roads Office FEDRO
Division Road Traffic
Road Accident Statistics

Mailing Address: 3003 Bern
Location: Weltpoststrasse 5, 3015 Bern

Tel +41 31 323 14 89
Fax +41 31 323 43 21

christian.kame...@astra.admin.ch<mailto:christian.kame...@astra.admin.ch>
www.astra.admin.ch<http://www.astra.admin.ch/>

Von: Jan Wijffels [mailto:jwijff...@bnosac.be]
Gesendet: Donnerstag, 8. August 2013 11:46
An: Kamenik Christian ASTRA
Betreff: Re: read.table.ffdf and fixed width files

Christian,

You probably misspecified column names in the transFUN. Mark that 
read.table.ffdf reads in your data in chunks and puts that chunk to an ffdf. In 
transFUN you get one chunk in RAM based on which you can do data manipulations. 
It should return a data.frame which will be appended to your ffdf.

So. This worked out fine for me.

Jan

require(ff)
fwffile <- "/home/janw/Desktop/testdata.txt"
info <- list()
info$widths <- 
c(3,28,4,30,28,6,3,30,10,3,3,6,6,5,1,2,1,1,2,2,2,4,2,4,7,30,1,1,3,2,4,4)
info$colnames <- paste("column", 1:length(info$widths))
x <- read.table.ffdf(file=fwffile, FUN="read.fwf",
                     ## Change widths and col.names accordingly
                     widths=info$widths,
                     col.names = info$colnames,
                     fileEncoding = "LATIN1",
                     transFUN=function(x){
                       ## Handle fixed width spaces
                       x$column.2 <- factor(gsub(" *$", "", x$column.2))
                       x$column.5 <- factor(gsub(" *$", "", x$column.5))
                       x
                     })
class(x)
str(as.data.frame(x))

2013/8/8 
<christian.kame...@astra.admin.ch<mailto:christian.kame...@astra.admin.ch>>
Hi Jan,

With the following column widths
3 28  4 30 28  6  3 30 10  3  3  6  6  5  1  2  1  1  2  2  2  4  2  4  7 30  1 
 1  3  2  4  4

I got this error:
Error in `$<-.data.frame`(`*tmp*`, "b", value = integer(0)) :
  replacement has 0 rows, data has 1000

Best Regard

Christian Kamenik
Project Manager

Federal Department of the Environment, Transport, Energy and Communications 
DETEC
Federal Roads Office FEDRO
Division Road Traffic
Road Accident Statistics

Mailing Address: 3003 Bern
Location: Weltpoststrasse 5, 3015 Bern

Tel +41 31 323 14 89
Fax +41 31 323 43 21

christian.kame...@astra.admin.ch<mailto:christian.kame...@astra.admin.ch>
www.astra.admin.ch<http://www.astra.admin.ch/>

Von: Jan Wijffels [mailto:jwijff...@bnosac.be<mailto:jwijff...@bnosac.be>]
Gesendet: Donnerstag, 8. August 2013 11:15

An: Kamenik Christian ASTRA
Betreff: Re: read.table.ffdf and fixed width files

Hi Christian,

The following worked out for me on your data.

best,
Jan

require(ff)
fwffile <- "/home/janw/Desktop/testdata.txt"
x <- read.table.ffdf(file=fwffile, FUN="read.fwf",
                     ## Change widths and col.names accordingly
                     widths=c(3,28,4,30,281,142),
                     col.names = c("a","b","c","e","f","g"),
                     fileEncoding = "LATIN1",
                     transFUN=function(x){
                       ## Handle fixed width spaces
                       x$b <- factor(gsub(" *$", "", x$b))
                       x$e <- factor(gsub(" *$", "", x$e))
                       x
                     })
class(x)
str(as.data.frame(x))

2013/8/8 
<christian.kame...@astra.admin.ch<mailto:christian.kame...@astra.admin.ch>>
Hello Jan

Many thanks for your help! I attached part of the file.

Best Regard

Christian Kamenik
Project Manager

Federal Department of the Environment, Transport, Energy and Communications 
DETEC
Federal Roads Office FEDRO
Division Road Traffic
Road Accident Statistics

Mailing Address: 3003 Bern
Location: Weltpoststrasse 5, 3015 Bern

Tel +41 31 323 14 89
Fax +41 31 323 43 21

christian.kame...@astra.admin.ch<mailto:christian.kame...@astra.admin.ch>
www.astra.admin.ch<http://www.astra.admin.ch/>

Von: Jan Wijffels [mailto:jwijff...@bnosac.be<mailto:jwijff...@bnosac.be>]
Gesendet: Donnerstag, 8. August 2013 08:53
An: Kamenik Christian ASTRA
Betreff: Re: read.table.ffdf and fixed width files

Hello Christian,

If you need further help, it would probably be best to send a part of the file.
So that we can check out if read.table.ffdf or LaF will both work.

Jan
2013/8/8 
<christian.kame...@astra.admin.ch<mailto:christian.kame...@astra.admin.ch>>
Hi Jan

Many thanks for your response. LaF is a great package, and I tried your 
suggestion using it. But I ran into another problem (see my recent post on  
"laf_open_fwf").

Best Regard

Christian Kamenik
Project Manager

Federal Department of the Environment, Transport, Energy and Communications 
DETEC
Federal Roads Office FEDRO
Division Road Traffic
Road Accident Statistics

Mailing Address: 3003 Bern
Location: Weltpoststrasse 5, 3015 Bern

Tel +41 31 323 14 89
Fax +41 31 323 43 21

christian.kame...@astra.admin.ch<mailto:christian.kame...@astra.admin.ch>
www.astra.admin.ch<http://www.astra.admin.ch/>

Von: Jan Wijffels [mailto:jwijff...@bnosac.be<mailto:jwijff...@bnosac.be>]
Gesendet: Mittwoch, 7. August 2013 10:28
An: Kamenik Christian ASTRA
Betreff: read.table.ffdf and fixed width files

Hi Christian,

Regarding your question on R-help about 'read.table.ffdf and fixed width 
files'. Maybe this post is of interest to you 
http://r.789695.n4.nabble.com/Any-way-to-get-read-table-ffdf-in-the-ff-package-to-pass-colClasses-or-comment-char-parameters-throu-td4643171.html
For my purposes, I regularly use the package LaF together with ffbase. If you 
set up the definition of your fixed width as a LaF object and use laf_to_ffdf 
from package ffbase, you have your dataset immediately in an ffdf.

groeten/kind regards,
Jan

Jan Wijffels
Statistical Data Miner
www.bnosac.be<http://www.bnosac.be>  | +32 486 611708



--
groeten/kind regards,
Jan

Jan Wijffels
Statistical Data Miner
www.bnosac.be<http://www.bnosac.be>  | +32 486 611708



--
groeten/kind regards,
Jan

Jan Wijffels
Statistical Data Miner
www.bnosac.be<http://www.bnosac.be>  | +32 486 611708



--
groeten/kind regards,
Jan

Jan Wijffels
Statistical Data Miner
www.bnosac.be<http://www.bnosac.be>  | +32 486 611708

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to