On Sat, Jan 14, 2012 at 12:41 PM, Milan Bouchet-Valat <nalimi...@club.fr>wrote:

> Le samedi 14 janvier 2012 à 12:24 -0600, Andy Adamiec a écrit :
> > Hi Milan,
> >
> >
> > The xml solr files are not in a typical format, here is an example
> > http://www.omegahat.org/RSXML/solr.xml
> > I'm not sure how to parse the documents with out using solrDocs.R
> > function, and how to make the function compatible with a tm package.
> Indeed, this doesn't seem to be easy to parse using the generic XML
> source from tm. So it will be easier for you to create your own custom
> source from scratch. Have a look at the source.R and reader.R files in
> the tm source: you need to replicate the behavior of one of the sources.
>
> The code should include the following functions:
>
> readSorl <- FunctionGenerator(function(...) {
>    function(elem, language, id) {
>        # Use elem$content, which contains an item set by SorlSource()
> below,
>        # and create a PlainTextDocument() from it,
>        # putting the data where appropriate (text, meta-data)
>    }
> })
>
> SorlSource <- function(x) {
>    # Parse the XML file using functions from solrDocs.R, and
>    # create "content", which is a list with one item for each document,
>    # to pass to readSorl() one by one
>
>    s <- tm:::.Source(readSorl, "UTF-8", length(content), FALSE, seq(1,
> length(content)), 0, FALSE)
>    s$Content <- content
>    s$URI <- match.call()$x
>    class(s) = c("SorlSource", "Source")
>    s
> }
>
> getElem <- function(x) UseMethod("getElem", x)
> getElem.SorlSource <-  function(x) {
>    list(content = x$Content[[x$Position]], uri = match.call()$x)
> }
>
> eoi <- function(x) UseMethod("eoi", x)
> eoi.SorlSource <- function(x) length(x$Content) <= x$Position
>
>
> Hope this helps
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to