> On Dec 25, 2014, at 1:04 AM, Mike Miller <mbmille...@gmail.com> wrote: > > I just wanted to put this out there. It's just some of my observations about > things that happen with R, or happened in this particular investigation. > There were definitely some lessons for me in this, and maybe that will be > true of someone else. The main thing I picked up is that it is good to put > plenty of checks into our code -- if we expect input of a certain type or > class, then I should either coerce input into that structure or test the > input and throw an error. If the function works very differently for > different kinds of input, this should be documented. The more people are > doing this, the better things will go for everyone. > > > I was working with a CRAN package called RFGLS... > > http://cran.r-project.org/web/packages/RFGLS/index.html > > ...and I was getting an error. After a few rounds of testing I realized that > the error was caused by a FAMID variable that was of character type.
But The Details section of the help page does say that the accepted FTYPES are all integers between 1 and 6 and the INDIV variables are integers in range 1:4. > The problem seemed to be that gls.batch() expected FAMID to be integers, but > the default ought to be character type because family and individual IDs in > nearly all genetic-analysis software are character strings (they might even > be people's names). You are making up rules that were not in accord with the documentation. > This was the error: > > Error in sum(blocksize) : invalid 'type' (character) of argument > Calls: gls.batch -> bdsmatrix > > To figure out more about it, I spent a bunch of time to go from CMD BATCH > mode to an interactive session so that I could look at traceback(). Generally the first thing to check is the help page. And if there is a worked example to look at its data: > data(pedigree, package="RFGLS") > str(pedigree) 'data.frame': 4050 obs. of 5 variables: $ FAMID: int 10 10 10 10 20 20 20 20 30 30 ... $ ID : int 11 12 13 14 21 22 23 24 31 32 ... $ PID : int 14 14 0 0 24 24 0 0 34 34 ... $ MID : int 13 13 0 0 23 23 0 0 33 33 ... $ SEX : num 1 1 2 1 2 2 2 1 2 2 … — David. > That got me this additional info: > >> traceback() > 2: bdsmatrix(sizelist, lme.out$sigma@blocks, dimnames = list(id, id)) > > bdsmatrix() is from a package on which RFGLS depends: > > http://cran.r-project.org/web/packages/bdsmatrix/index.html > > The problem is that RFGLS's gls.batch() function is sending something to > bdsmatrix's bdsmatrix() that it can't handle. So I look at the code for > bdsmatrix() and I see this: > > if (any(blocksize <= 0)) > stop("Block sizes must be >0") > if (any(as.integer(blocksize) != blocksize)) > stop("Block sizes must be integers") > n1 <- as.integer(sum(blocksize)) > > The condition any(as.integer(blocksize) != blocksize)) fails (is TRUE) only > if blocksize contains one or more noninteger numeric values. It doesn't fail > if blocksize is character or logical if the character strings are integers. > Example: > >> 4=="4" > [1] TRUE > > That's an interesting feature of R, but I guess that's how it works. Also > this: > >> 1=="1" > [1] TRUE >> 1==TRUE > [1] TRUE >> "1"==TRUE > [1] FALSE > > bdsmatrix() has no test that blocksize is numeric, so it fails when > sum(blocksize) cannot sum character strings. > > Next I had to figure out where RFGLS's gls.batch() is going wrong in > producing sizelist. It is created in a number of steps, but I identified > this line as especially suspicious: > > test.dat$famsize[test.dat$FTYPE!=6]=ave(test.dat$FAMID[test.dat$FTYPE!=6],test.dat$FAMID[test.dat$FTYPE!=6],FUN=length) > > famsize was later converted to sizelist, and this line also includes FAMID, > so this is likely where the problem originates. Of course this is the big > problem with debugging -- it's hard to find the source of an error that > occurs far downstream in another function from a different package. I see > that ave() is used, so I have to understand ave(). > > William Dunlap provided some guidance: > > "ave() uses its first argument, 'x', to set the length of its output and to > make an initial guess at the type of its output. The return value of FUN can > alter the type, but only in an 'upward' direction where > logical<integer<numeric<complex<character<list. (This is the same rule that > x[i]<-newvalue uses.)" > > In other words, if x is of character type, the output cannot be of integer or > numeric type even if the output of FUN is always of integer or numeric type. > Looking at the ave() code, I can understand that choice: > > function (x, ..., FUN = mean) > { > if (missing(...)) > x[] <- FUN(x) > else { > g <- interaction(...) > split(x, g) <- lapply(split(x, g), FUN) > } > x > } > > If the factor is missing an element, then the corresponding element of X is > not changed in the output: > >> fact <- gl(2,2) >> fact[3] <- NA >> fact > [1] 1 1 <NA> 2 > Levels: 1 2 >> ave(1:4, fact) > [1] 1.5 1.5 3.0 4.0 > > That's a reasonable plan, but it isn't the documented functioning of ave(). > From the document... > > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ave.html > > ...you get next to nothing about what the function actually does. It does > say that x is "a numeric," but the function does not throw an error when x is > not numeric. So if someone writes code expecting numeric x, but a user > provides a non-numeric x, there may be trouble. > > I suspect that the programmer saw that the code worked in her examples and > she went on to other things. I can't blame the documentation for that, but > it is possible that if it said something about the relation between the type > of the input and the type of the output she might have written it > differently. In addition, I probably would have caught it sooner and I would > have understood the problem. > > This is how I'll recommend they fix the bug in the code (thanks to those of > you who helped with this): > > temp.vec <- as.character( test.dat$FAMID[ test.dat$FTYPE != 6 ] ) > test.dat$famsize[ test.dat$FTYPE != 6 ] <- as.vector( table( temp.vec )[ > temp.vec ] ) > rm(temp.vec) > > I think we should force FAMID to be character from the beginning, though. > > Best, > Mike > > FYI -- RFGLS code that fails in RFGLS version 1.1: > > library(RFGLS) > > data(pheno) > data(geno) > data(map) > data(pedigree) > data(rescovmtx) > #comment out the following two lines and it will run correctly > pedigree$FAMID <- as.character(pedigree$FAMID) > pheno$FAMID <- as.character(pheno$FAMID) > minigwas <- gls.batch(phenfile=pheno,genfile=data.frame(t(geno)), > pedifile=pedigree, > snp.names=map[,2],input.mode=c(1,2,3),pediheader=FALSE, > pedicolname=c("FAMID","ID","PID","MID","SEX"), > phen="Zscore",covars="IsFemale", > outfile=NULL,col.names=TRUE,return.value=TRUE) > str(minigwas) > > Mike > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.