Hi Nilza, Just to add to David's comments, if you are reading in your file with read.table(..., fill=TRUE), and assuming that you haven't yet replace -9999 with NA, you don't need grep. You can just use the number of NAs in each line to locate data blocks.
Date records have 3 NAs Location records have 2 NAs Data records have none. my.data2<-read.table("d2010100100.txt",fill=TRUE,nrows=20000) na.count <- apply( my.data2, 1, function(x) sum( is.na(x) ) ) date.recs <- which( na.count == 3) num.stns <- length(date.recs) stn.data.length <- c(diff(date.recs) - 2, nrow(my.data2) - date.recs[num.stns] - 1) Michael On 4 October 2010 13:05, David Winsemius <dwinsem...@comcast.net> wrote: > > On Oct 3, 2010, at 9:40 PM, Nilza BARROS wrote: > >> Hi, Michael >> Thank you for your help. I have already done what you said. >> But I am still facing problems to deal with my data. >> >> I need to split the data according to station.. >> >> I was able to identify where the station information start using: >> >> my.data<-file("d2010100100.txt",open="rt") >> indata <- readLines(my.data, n=20000) >> i<-grep("^[837]",indata) #station number > > That would give you the line numbers for any line that had an 8 , _or_ a 3, > _or_ a 7 as its first digit. Was that your intent? My guess is that you did > not really want to use the square braces and should have been using "^837". > > ?regex # Paragraph starting "A character class .... " > >> my.data2<-read.table("d2010100100.txt",fill=TRUE,nrows=20000) >> stn<- my.data2$V1[i] > > That would give you the first column values for the lines you earlier > selected. > > >> ==== > > This does not look like what I would expect as a value for stn. Is that what > you wanted us to think this was? > > -- > David. > > >> 2010 10 01 00 >> *82599 -35.25 -5.91 52 1 >> * 1008.0 -9999 115 3.1 298.6 294.6 64 >> 2010 10 01 00 >> *83649 -40.28 -20.26 4 7* >> 1011.0 -9999 0 0.0 298.4 296.1 64 >> 1000.0 96 40 5.7 297.9 295.1 32 >> 925.0 782 325 3.1 295.4 294.1 32 >> 850.0 1520 270 4.1 293.8 289.4 32 >> 700.0 3171 240 8.7 284.1 279.1 32 >> 500.0 5890 275 8.2 266.2 262.9 32 >> 400.0 7600 335 9.8 255.4 242.4 32 >> =========== >> As you can see in the data above the line show the number of leves (or >> lines) for each station. >> I need to catch these lines so as to be able to feed my database. >> By the way, I didn't understand the regular expression you've used. I've >> tried to run it but it did not work. >> >> Hope you can help me! >> Best Regards, >> Nilza >> >> >> >> >> >> On Sun, Oct 3, 2010 at 2:18 AM, Michael Bedward >> <michael.bedw...@gmail.com>wrote: >> >>> Hello Nilza, >>> >>> If your file is small you can read it into a character vector like this: >>> >>> indata <- readLines("foo.dat") >>> >>> If your file is very big you can read it in batches like this... >>> >>> MAXRECS <- 1000 # for example >>> fcon <- file("foo.dat", open="r") >>> indata <- readLines(fcon, n=MAXRECS) >>> >>> The number of lines read will be given by length(indata). >>> >>> You can check to see if the end of the file has been read yet with: >>> isIncomplete( fcon ) >>> >>> If a leading "*" character is a flag for the start of a station data >>> block you can find this in the indata vector with grepl... >>> >>> start.pos <- which(indata, grepl("^\\s*\\*", indata) >>> >>> When you're finished reading the file... >>> close(fcon) >>> >>> Hope this helps, >>> >>> Michael >>> >>> >>> On 3 October 2010 13:31, Nilza BARROS <nilzabar...@gmail.com> wrote: >>>> >>>> Dear R-users, >>>> >>>> I would like to know how could I read a file with different lines >>> >>> lengths. >>>> >>>> I need read this file and create an output to feed my database. >>>> So after reading I'll need create an output like this >>>> >>>> "INSERT INTO TEMP (DATA,STATION,VAR1,VAR2) VALUES (20100910,837460, >>> >>> 39,390)" >>>> >>>> I mean, each line should be read. But I don`t how to do this when these >>>> lines have different lengths >>>> >>>> I really appreciate any help. >>>> >>>> Thanks. >>>> >>>> >>>> >>>> ====Below the file that should be read =========== >>>> >>>> >>>> *2010 10 01 00 >>>> 83746 -43.25 -22.81 6 51* >>>> 1012.0 -9999 320 1.5 299.1 294.4 64 >>>> 1000.0 114 250 4.1 298.4 294.8 32 >>>> 925.0 797 0 0.0 293.6 292.9 32 >>>> 850.0 1524 195 3.1 289.6 288.9 32 >>>> 700.0 3156 290 11.3 280.1 280.1 32 >>>> 500.0 5870 280 20.1 266.1 260.1 32 >>>> 400.0 7570 265 23.7 256.6 222.7 32 >>>> 300.0 9670 265 28.8 240.2 218.2 32 >>>> 250.0 10920 280 27.3 230.2 220.2 32 >>>> 200.0 12390 260 32.4 218.7 206.7 32 >>>> 176.0 -9999 255 37.6 -9999.0 -9999.0 8 >>>> 150.0 14180 245 35.5 205.1 196.1 32 >>>> 100.0 16560 300 17.0 195.2 186.2 32 >>>> *2010 10 01 00 >>>> 83768 -51.13 -23.33 569 41 >>>> * 1000.0 79 -9999 -9999.0 -9999.0 -9999.0 32 >>>> 946.0 -9999 270 1.0 295.8 292.1 64 >>>> 925.0 763 15 2.1 296.4 290.4 32 >>>> 850.0 1497 175 3.6 290.8 288.4 32 >>>> 700.0 3140 295 9.8 282.9 278.6 32 >>>> 500.0 5840 285 23.7 267.1 232.1 32 >>>> 400.0 7550 255 35.5 255.4 231.4 32 >>>> 300.0 9640 265 37.0 242.2 216.2 32 >>>> >>>> >>>> Best Regards, >>>> >>>> -- >>>> Abraço, >>>> Nilza Barros >> > > > David Winsemius, MD > West Hartford, CT > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.