Hi,

I'm trying to read a file containing html markup (discussion board posts) and output the various parts of each post to an field in a record in an output file (date, author, title, body). This is a one-off job and I'm trying to use R to do it.
The file looks something like this:

<br><ul>Created: --- - Dr. Johnsons's article -
concerns<br><p>After writing some text..........</p><br>-- by
Anon.<br>
<br><ul>Created: --- - RE:Dr. Johnson's article - concerns<br><p>With
some advance notice about some text........<br>-- by
Anon.<br>

So that <br>tags indicate where the field entries begin and end and "Created" and "by Anon" indicate the beginning and ending of the post.

The file is named "Module_1.txt". Here is what I have so far:

## adapted from http://finzi.psych.upenn.edu/R/Rhelp02a/archive/64261.html
## gives post beginning and ending points
starts <- gregexpr("Created", readChar("Module_1.txt", file.info("Module_1.txt")$size))[[1]]

ends <- gregexpr("by Anon", readChar("Module_1.txt", file.info("Module_1.txt")$size))[[1]]

## open connection
chk <- file("Module_1.txt", "r")
#seek(chk, origin = "start", ends[1]) # moves through file

## initalize an array
hold <- array(rep(NA, length(starts)))

## write a function to read the connection
catchtext <- function(start, end, source) {
for(i in 1:length(starts)) { hold[i] <- readChar(chk, nchar = ends[i] - starts[i])
   }
   }

#close(chk)

and my output is
> hold
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[126] NA NA

At this point I'm just trying to get the entire file cut up into post-sized chunks. Later I'll go through and output the separate bits into fields. As can be seen, I'm having trouble moving through the connection to the places I want to read from it. Suggestions welcome.

Thanks,

Scot

> version
_ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 6.2 year 2008 month 02 day 08 svn rev 44383 language R version.string R version 2.6.2 (2008-02-08)

--
Scot McNary
smcnary at charm dot net

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to