> On Dec 25, 2014, at 1:04 AM, Mike Miller <mbmille...@gmail.com> wrote:
> 
> I just wanted to put this out there.  It's just some of my observations about 
> things that happen with R, or happened in this particular investigation.  
> There were definitely some lessons for me in this, and maybe that will be 
> true of someone else.  The main thing I picked up is that it is good to put 
> plenty of checks into our code -- if we expect input of a certain type or 
> class, then I should either coerce input into that structure or test the 
> input and throw an error.  If the function works very differently for 
> different kinds of input, this should be documented.  The more people are 
> doing this, the better things will go for everyone.
> 
> 
> I was working with a CRAN package called RFGLS...
> 
> http://cran.r-project.org/web/packages/RFGLS/index.html
> 
> ...and I was getting an error.  After a few rounds of testing I realized that 
> the error was caused by a FAMID variable that was of character type.

But The Details section of the help page does say that the accepted FTYPES are 
all integers between 1 and 6 and the INDIV variables are integers in range 1:4.

> The problem seemed to be that gls.batch() expected FAMID to be integers, but 
> the default ought to be character type because family and individual IDs in 
> nearly all genetic-analysis software are character strings (they might even 
> be people's names).

You are making up rules that were not in accord with the documentation.

> This was the error:
> 
> Error in sum(blocksize) : invalid 'type' (character) of argument
> Calls: gls.batch -> bdsmatrix
> 
> To figure out more about it, I spent a bunch of time to go from CMD BATCH 
> mode to an interactive session so that I could look at traceback().  

Generally the first thing to check is the help page. And if there is a worked 
example to look at its data:

> data(pedigree, package="RFGLS")
> str(pedigree)
'data.frame':   4050 obs. of  5 variables:
 $ FAMID: int  10 10 10 10 20 20 20 20 30 30 ...
 $ ID   : int  11 12 13 14 21 22 23 24 31 32 ...
 $ PID  : int  14 14 0 0 24 24 0 0 34 34 ...
 $ MID  : int  13 13 0 0 23 23 0 0 33 33 ...
 $ SEX  : num  1 1 2 1 2 2 2 1 2 2 …

— 
David.

> That got me this additional info:
> 
>> traceback()
> 2: bdsmatrix(sizelist, lme.out$sigma@blocks, dimnames = list(id, id))
> 
> bdsmatrix() is from a package on which RFGLS depends:
> 
> http://cran.r-project.org/web/packages/bdsmatrix/index.html
> 
> The problem is that RFGLS's gls.batch() function is sending something to 
> bdsmatrix's bdsmatrix() that it can't handle.  So I look at the code for 
> bdsmatrix() and I see this:
> 
>     if (any(blocksize <= 0))
>         stop("Block sizes must be >0")
>     if (any(as.integer(blocksize) != blocksize))
>         stop("Block sizes must be integers")
>     n1 <- as.integer(sum(blocksize))
> 
> The condition any(as.integer(blocksize) != blocksize)) fails (is TRUE) only 
> if blocksize contains one or more noninteger numeric values.  It doesn't fail 
> if blocksize is character or logical if the character strings are integers.  
> Example:
> 
>> 4=="4"
> [1] TRUE
> 
> That's an interesting feature of R, but I guess that's how it works. Also 
> this:
> 
>> 1=="1"
> [1] TRUE
>> 1==TRUE
> [1] TRUE
>> "1"==TRUE
> [1] FALSE
> 
> bdsmatrix() has no test that blocksize is numeric, so it fails when 
> sum(blocksize) cannot sum character strings.
> 
> Next I had to figure out where RFGLS's gls.batch() is going wrong in 
> producing sizelist.  It is created in a number of steps, but I identified 
> this line as especially suspicious:
> 
> test.dat$famsize[test.dat$FTYPE!=6]=ave(test.dat$FAMID[test.dat$FTYPE!=6],test.dat$FAMID[test.dat$FTYPE!=6],FUN=length)
> 
> famsize was later converted to sizelist, and this line also includes FAMID, 
> so this is likely where the problem originates.  Of course this is the big 
> problem with debugging -- it's hard to find the source of an error that 
> occurs far downstream in another function from a different package. I see 
> that ave() is used, so I have to understand ave().
> 
> William Dunlap provided some guidance:
> 
> "ave() uses its first argument, 'x', to set the length of its output and to 
> make an initial guess at the type of its output.  The return value of FUN can 
> alter the type, but only in an 'upward' direction where 
> logical<integer<numeric<complex<character<list.  (This is the same rule that 
> x[i]<-newvalue uses.)"
> 
> In other words, if x is of character type, the output cannot be of integer or 
> numeric type even if the output of FUN is always of integer or numeric type.  
> Looking at the ave() code, I can understand that choice:
> 
> function (x, ..., FUN = mean)
> {
>    if (missing(...))
>        x[] <- FUN(x)
>    else {
>        g <- interaction(...)
>        split(x, g) <- lapply(split(x, g), FUN)
>    }
>    x
> }
> 
> If the factor is missing an element, then the corresponding element of X is 
> not changed in the output:
> 
>> fact <- gl(2,2)
>> fact[3] <- NA
>> fact
> [1] 1    1    <NA> 2
> Levels: 1 2
>> ave(1:4, fact)
> [1] 1.5 1.5 3.0 4.0
> 
> That's a reasonable plan, but it isn't the documented functioning of ave().  
> From the document...
> 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ave.html
> 
> ...you get next to nothing about what the function actually does.  It does 
> say that x is "a numeric," but the function does not throw an error when x is 
> not numeric.  So if someone writes code expecting numeric x, but a user 
> provides a non-numeric x, there may be trouble.
> 
> I suspect that the programmer saw that the code worked in her examples and 
> she went on to other things.  I can't blame the documentation for that, but 
> it is possible that if it said something about the relation between the type 
> of the input and the type of the output she might have written it 
> differently.  In addition, I probably would have caught it sooner and I would 
> have understood the problem.
> 
> This is how I'll recommend they fix the bug in the code (thanks to those of 
> you who helped with this):
> 
> temp.vec <- as.character( test.dat$FAMID[ test.dat$FTYPE != 6 ] )
> test.dat$famsize[ test.dat$FTYPE != 6 ] <- as.vector( table( temp.vec )[ 
> temp.vec ] )
> rm(temp.vec)
> 
> I think we should force FAMID to be character from the beginning, though.
> 
> Best,
> Mike
> 
> FYI -- RFGLS code that fails in RFGLS version 1.1:
> 
> library(RFGLS)
> 
> data(pheno)
> data(geno)
> data(map)
> data(pedigree)
> data(rescovmtx)
> #comment out the following two lines and it will run correctly
> pedigree$FAMID <- as.character(pedigree$FAMID)
> pheno$FAMID <- as.character(pheno$FAMID)
> minigwas <- gls.batch(phenfile=pheno,genfile=data.frame(t(geno)),
>  pedifile=pedigree,
>  snp.names=map[,2],input.mode=c(1,2,3),pediheader=FALSE,
>  pedicolname=c("FAMID","ID","PID","MID","SEX"),
>  phen="Zscore",covars="IsFemale",
>  outfile=NULL,col.names=TRUE,return.value=TRUE)
> str(minigwas)
> 
> Mike
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to