Re: [Rd] unable to load shared object
Hm. I cannot find it now, but I saw notes about problems with Onedrive, maybe also Dropbox, because file paths are not exactly what r expects. Usually I have package library as local folder. Did your R choose that location automatically? Paul Johnson University of Kansas On Fri, Oct 5, 2018, 3:01 AM Witold E Wolski wrote: > Thanks for asking. > > The path where R tries to load the dll from does not exist I think (or > the install process deletes it after the error - might this be the > case?): > C:/Users/wewol/OneDrive/Documents/R/win-library/3.5/grpc/libs/x64/grpc.dll' > > When I am checking after the error that path is only valid up to: > C:/Users/wewol/OneDrive/Documents/R/win-library/3.5/ > > So maybe (?) the install process wrongly assums that the package is > installed (but it never was) and than it executes > ** testing if installed package can be loaded > > To run into the error. > > Error: package or namespace load failed for 'grpc' in inDL(x, > as.logical(local), as.logical(now), ...): > unable to load shared object > > 'C:/Users/wewol/OneDrive/Documents/R/win-library/3.5/grpc/libs/x64/grpc.dll': > LoadLibrary failure: The specified module could not be found. > On Fri, 5 Oct 2018 at 09:37, Paul Johnson wrote: > > > > Greetings. > > Is it possible that Onedrive is causing trouble? Do other packages you > build and install work well from that directory? > > > > Paul Johnson > > University of Kansas > > > > On Wed, Oct 3, 2018, 3:13 AM Witold E Wolski wrote: > >> > >> Hello, > >> > >> I am trying to install a package with some src files on windows (linux > >> install works fine). The sources seem to build, there is an *.dll in > >> the src folder but than the installation process fails with: > >> > >> ``` > >> ** testing if installed package can be loaded > >> Error: package or namespace load failed for 'grpc' in inDL(x, > >> as.logical(local), as.logical(now), ...): > >> unable to load shared object > >> > 'C:/Users/wewol/OneDrive/Documents/R/win-library/3.5/grpc/libs/x64/grpc.dll': > >> LoadLibrary failure: The specified module could not be found. > >> > >> Error: loading failed > >> Execution halted > >> ``` > >> > >> Do I need to point the installer to the *.dll in the src directory by > >> creating some special function (e.g. dyn.load) or creating some > >> special file? > >> > >> Help would be greatly appreciated: > >> > >> The packages sources are here: > >> https://github.com/wolski/grpc-1 > >> > >> Witek > >> > >> > >> -- > >> Witold Eryk Wolski > >> > >> __ > >> R-devel@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > Witold Eryk Wolski > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Proposed function file.backup
Dear R Core: In the kutils package, I wrote a function that is so handy that I would like to ask you put it in R itself. file.backup() will look at a file, find its last modification time, and create a new backup with a name that appends MMDD-HHMM to the file name. So now, whenever I worry that running write.csv or saving a graph might destroy something valuable, I use an idiom like fn <- "filename.csv" file.backup(fn) write.csv(fn) So If "filename.csv" already exists, it will get a backup copy. If an old backup version exists, the function doesn't re-copy. In my workflow, this is very convenient and it could be integrated with other R file writers that have overwrite as a parameter. If overwrite could be set as "backup" instead of TRUE or FALSE, it would be convenient. I sincerely hope you will consider it: ##' Create a backup version of a file by renaming it. ##' ##' Inserts the date-time of the most recent modification at the ##' end of the file name, before the extension. ##' ##' Return is the new file name that was created, using whatever ##' path information was provided in the file's original name. However, ##' the fullpath argument can be set to TRUE, so a path with the ##' full directory name will be created and returned. ##' ##' @param name A character string for the name of the file. ##' @param fullpath Return the full directory path to the ##' file. Default FALSE, return only the file name. ##' @param keep.old If FALSE (default), rename the file. Otherwise, keep old copy. ##' @param verbose If TRUE, list the files in the output directory ##' @author Shadi Pirhosseinloo Paul Johnson ##' @return The name of the newly created file. ##' @importFrom tools file_ext ##' @importFrom tools file_path_sans_ext ##' @export ##' @examples ##' tdir <- tempdir() ##' owd <- getwd() ##' ##' setwd(tdir) ##' system("touch test.1.txt") ##' system("touch test.2.txt") ##' system("touch test.3.txt") ##' system("touch test.4.txt") ##' system("touch test.5.txt") ##' ## note: no extension next ##' system("touch test.6") ##' list.files() ##' file.backup("test.1.txt") ##' file.backup("test.2.txt", fullpath=TRUE) ##' list.files() ##' setwd(owd) ##' file.backup(file.path(tdir, "test.3.txt")) ##' ## Next should be same path because input had a full path ##' file.backup(file.path(tdir, "test.4.txt"), fullpath=TRUE) ##' file.backup(file.path(tdir, "test.5.txt"), fullpath = TRUE, verbose = TRUE) ##' file.backup(file.path(tdir, "test.6")) file.backup <- function(name, fullpath = FALSE, keep.old = FALSE, verbose = FALSE){ if(!file.exists(name)){ MESSG <- paste("file", name, "does not exist. No backup created.") warning(MESSG) return(NULL) } dir.source <- dirname(normalizePath(name)) date_cr <- base::format(base::file.info(name)$mtime, "%Y%m%d-%H%M") ext_name <- tools::file_ext(name) noext_name <- tools::file_path_sans_ext(name) new_name <- paste0(noext_name, "-", date_cr, if(!ext_name == "") {paste0( ".", ext_name)}) ## Abort if new file name already exists if (file.exists(new_name)){ MESSG <- paste("backup file already exists. No new backup created.") warning(MESSG) return(new_name) } ret <- if(keep.old){ file.copy(name, new_name, recursive = TRUE, overwrite = TRUE, copy.mode = TRUE, copy.date = TRUE) }else{ file.rename(name, new_name) } if(!ret) { MESSG <- paste("file.rename(", name, ",", new_name, "failed") stop(MESSG) } if(verbose){ cat("Full file list of directory:", dir.source, "\n") print(list.files(dir.source)) cat("End of file list\n") } if(fullpath){ new_name_fullpath <- normalizePath(new_name) return(new_name_fullpath) } ## else, give back adjusted with same path information it had new_name } -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Large discrepancies in the same object being saved to .RData
On Wed, Jul 7, 2010 at 7:12 AM, Duncan Murdoch wrote: > On 06/07/2010 9:04 PM, julian.tay...@csiro.au wrote: >> >> Hi developers, >> >> >> >> After some investigation I have found there can be large discrepancies in >> the same object being saved as an external "xx.RData" file. The immediate >> repercussion of this is the possible increased size of your .RData workspace >> for no apparent reason. >> >> >> > I haven't worked through your example, but in general the way that local > objects get captured is when part of the return value includes an > environment. Hi, can I ask a follow up question? Is there a tool to browse *.Rdata files without loading them into R? In HDF5 (a data storage format we use sometimes), there is a CLI program "h5dump" that will spit out line-by-line all the contents of a storage entity. It will literally track through all the metadata, all the vectors of scores, etc. I've found that handy to "see what's really in there" in cases like the one that OP asked about. Sometimes, we find that there are things that are "in there" by mistake, as Duncan describes, and then we can try to figure why they are in there. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Can you share a working example of R program aided by fast BLAS?
Can one of you give me an R program that displays the benefits an accelerated BLAS in R? Here's why I ask, in case you wonder: In a linux cluster, I've hit some bumps in the road. The worst one by far was that I installed R, then GotoBLAS2 with default settings, and after that, jobs using Rmpi were *really* *really* slow. I mean horrible. If a job took 15 minutes when run by itself, outside of MPI, it took 1 full day when run inside MPI. Literally the same job. I learned later that GotoBLAS2 defaults to allow threads equal to the number of cores, and that the threads are not compatible with MPI. This latter point not clearly stated in the GotoBLAS2 documents, so far as I can find, but after I realized that was the problem, I did find one other cluster website that mentioned the same problem. "If your application uses GotoBLAS and all cores as MPI threads, setting GOTO_NUM_THREADS larger than one will usually result in drastically slower performance." (http://hpc.uark.edu/hpc/support/software/numerical.html#gotoblas). In the GotoBLAS2 documentation, it warns of weird thread related delays, but it implies that the slowdown--if it happens--is a result of bad user code, rather than this more fundamental mismatch between OpenMPI (or MPI in general) and GotoBLAS2. In the process of diagnosing the big slowdown, I've been making many time comparisons. When I installed GotoBLAS2 in the first place, it was because so many people (and the R admin manual) said that R's ordinary BLAS is rudimentary/slow. In the test cases I've tried, R's BLAS is not that bad. In fact, in the test programs we run, the time is not substantially different with GotoBLAS2 and R's BLAS. I also compared the Intel Kernel Math Library BLAS and didn't notice a huge difference. So, well, I think that means I'm running bad test cases for R and GotoBLAS2. Oh, and one more thing. I have not been able to find an example R program that benefitted at all from allowing threads > 1 in GotoBLAS2 environment settings. In fact, if a one-thread job takes15 minutes, the one that allows 2 or more threads is 21 minutes. And the more threads allowed causes a job to take longer. This is literally the same job, same cluster node, the only difference is changing the environment variable that adjusts the GotoBLAS2 threads allowed. So if you know whether your example depends on threads or not, I would appreciate the warning. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] C or Java code generation
On Thu, Aug 19, 2010 at 5:47 AM, Vyacheslav Karamov wrote: x.ru>>> >> >> >> Hi All! >> >> I'm new to R and I need to know is it possible for R to >> generate >> C/C++ source code, Java byte code or native Win32 DLL like >> MatLab? > > Is there any posibility to use R without installing? > I mean that I have my own application written in MS Visual C++ and I need to > use R script in this app. I can install R and use it via DCOM, but it's not > convenient for the end users of my program. > Do your users have Web access? If so, you can set up an Rweb (or similar) server that could run the jobs for those users. I have one set up like that. Users transmit their code via http and can get results back in web page. I think you could probably program around the output management question, but I did not try to do it. I mean, something like "wget" could download the images directly, by-passing the http server. Rweb has been abandoned by its original author, but it can be made to work, and there are other, newer http server programs to run R. I have not tried them yet, mainly because my customers needed to run Rweb because their clients told them to use it. -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] No RTFM?
On Thu, Aug 19, 2010 at 7:08 PM, Spencer Graves wrote: > What do you think about adding a "No RTFM" policy to the R mailing lists? > Per, "http://en.wikipedia.org/wiki/RTFM": > I think this is a great suggestion. I notice the R mailing list already has a gesture in this direction: "Rudeness and ad hominem comments are not acceptable. Brevity is OK." But the people who behave badly don't care about policies like this and they will keep doing what they do. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] No RTFM?
On Sat, Aug 21, 2010 at 11:04 AM, Hadley Wickham wrote: >> previous suggestion by a regular contributor. I still think a better >> response is not to escalate: Either ignore the post or say something like, >> "I don't understand your question. Please provide a self-contained minimal >> example as suggested in the Posting Guide ... ." > > I agree wholeheartedly. I have tried to do this with the ggplot2 > mailing list, and I think it has been extremely successful in > fostering a community that is friendly and polite, yet still provides > excellent technical support (and these days, most of it doesn't come > from me!). > I've been insulted in r-help and its no fun. After Gabor pointed out the logic in it, I have to admit he is right. It does keep traffic down. I don't go there anymore unless I'm completely desperate. I agree also, sometimes RTFM is the right answer, especially if it is stated as "That is discussed on p. 162 of the R Guide for Whatever..." I don't think people are insulted if you tell them to check something specific. Usually, first time visitors don't know about the R posting guide and when they ask an incomplete question, we should just refer them to it. We don't need the angry tone that we often see, but I don't think people mind being referred. This presupposes the posting guide is helpful. If somebody forgets the posting guide twice, then I think we should all berate and insult them. I mean vigorously :) My personal opinion is that the R posting guide could be revised to be more direct. Exhausted, frustrated people don't benefit as much as they could because the document is too long. This is hard to fix because everything in there was added for a good reasons. (I've been here long enough to remember a time before the guide, sad to say.) I tried to reshape the posting guide and I can see that it is a really hard problem. What do you think of this: The priority is to put the most important thing at the top. The second priority is brevity. = Posting Guide: How to ask good questions that prompt useful answers People are busy, so ask your question in a useful way. 1. Every question to r-help should begin with the following. A. Output from the command sessionInfo() B. Output from Sys.getlocale() C. Information about how you installed R. Did you make any changes, such as incorporating a BLAS library. If you don't know, ask your system administrator. D. If you see an error or something unexpected, your message MUST include the EXACT code that you were running. We mean, ALL of your commands that were run before the error occurred. If you are unsure of what you did, close your session, clean up your code, and start over to reproduce the problem in the most direct way. Your post MUST include the EXACT VERBATIM error message. If you are working on a long program that requires a dataset, post the dataset and the program somewhere on the Internet and refer us to it. It is simply not possible to guess about what might be going wrong in your program unless we can see it. If you don't want to share your data, construct a small example dataset that produces the same problem. Post it and refer us to it. E. If you have isolated the problem to a certain part of your project, write a small, self-contained 'working example' that causes the problem and include it with your post. Example. Why does this code: dat <- data.frame(x=c(1,2,3), y=c(3,4,5)) plot (x, y, data=dat) cause this: Error in plot(x, y, data = dat) : object 'x' not found We can easily reproduce that and explain the problem to you. We can't help with a question like "my plot failed, something about an object was missing." 2. How to avoid making the members of r-help angry at you. A. Do not call a problem a "bug" unless you really mean to say that someone has made a mistake. If you say "bug", they hear "embarrassing personal flaw" or "gigantic boil". We know you don't mean that, but sometimes they forget. B. Before you write to r-help, search for answers from previous questions. 1. Try Google? Or her cousin Yahoo? 2. Try these inside R: help.search("whatever") RSiteSearch("whatever") apropos("whatever") C. Read R-intro, R help pages, and the R FAQs. ?whatever 3. Do not send your question to r-help unless your question is about R or the base R packages that are supported by the R Core Team. A. Don't ask statistics questions, unless they fall into the form "Which R procedure or package should I use to conduct an analysis of ..." or "Does anybody have a working example of procedure XYZ?" or "Can I obtain XYZ result from an obect of class ZYX?" B. If you have trouble with a package from CRAN or elsewhere, write to the author of that package. r-help might be a good place to ask, "I've been using package XYZ and the author seems to have abandoned the project. Does anybody know of a replacement?" Otherwise, don't. Note that the
Re: [Rd] No RTFM?
Hey Ben: One of my colleagues bought your book and was reading it during a faculty meeting last Tuesday. Everybody kept asking what's that? If you know how to put it in the wiki, would you please do it and let us know where it is. I was involved with an R wiki about 5 or 6 years ago, but completely lost track of it. I'm going to keep pushing for sharp, clear guidelines. If I could get R-help set up as a template for messages like bugzilla, I think it would be awesome! To disagree with Ted. People won't mind giving us what we need, as long as we tell them exactly what to do. We save ourselves the circle of "give us sessionInfo()" and "give us your LANG" and 'what version is that' and so forth. But I'll consider it a victory if we can get the requirement of sessionInfo() in as the first thing in the posting guide ! I agree that "which packages can we help with" in r-help is a good question. I had in mind that we should field questions about anything in the R tarball, including the recommended ones. I think I did not use the right words when I tried to express that. Maybe we just need to list the packages to be clear. Things like "foreign" that have passed from the ownership of the original writer to the R Core Team are certainly eligible. I also admit that I have not looked at R-FAQ since about 2005, and I don't want to duplicate that effort. If R-FAQ has a nice concise list of ways to get information about specific things, the posting guide might just refer to that section by name/number, rather than duplicate. I'm torn here, because people often assume that if you give them a link, it must not be important (thats why you left it out). pj On Sat, Aug 21, 2010 at 6:04 PM, Ben Bolker wrote: > Paul Johnson gmail.com> writes: > >> > > [snip: lots more snippage to try get gmane to let me post] > >> What do you think of this: The priority is to put the most important >> thing at the top. The second priority is brevity. > > I really like this. > Some suggestions: > > = >> Posting Guide: How to ask good questions that prompt useful answers >> >> People are busy, so ask your question in a useful way. >> 1. Every question to r-help should begin with the following. >> A. Output from the command sessionInfo() >> B. Output from Sys.getlocale() >> C. Information about how you installed R. Did you make any changes, >> such as incorporating a BLAS library. If you don't know, ask your >> system administrator. > > I would make this optional or put it further down. I don't see that > many problems on the list that are due to non-standard installations. > Most of the most clueless people are (a) using stock installations > and/or (b) don't even know who installed R on the computer they are > using. I don't mind sending them to find out/ask if it's a real > issue, but it feels like an unnecessary hurdle. > >> D. If you see an error or something unexpected, your message MUST >> include the EXACT code that you were running. We mean, ALL of your >> commands that were run before the error occurred. If you are unsure >> of what you did, close your session, clean up your code, and start >> over to reproduce the problem in the most direct way. Your post MUST >> include the EXACT VERBATIM error message. >> >> If you are working on a long program that requires a dataset, >> post the dataset and the program somewhere on the Internet and >> refer us to it. It is simply not possible to guess about what >> might be going wrong in your program unless we can see it. >> >> If you don't want to share your data, construct a small example >> dataset that produces the same problem. Post it and refer us to it. > > This is where we have to emphasize 'small, reproducible example' > more strongly -- perhaps move the next item up. > I dread the pages and pages of random R-session crap > that will be posted following this advice literally ... > >> E. If you have isolated the problem to a certain part of your project, >> write a small, self-contained 'working example' that causes the >> problem and include it with your post. >> >> Example. Why does this code: >> >> dat <- data.frame(x=c(1,2,3), y=c(3,4,5)) >> plot (x, y, data=dat) >> >> cause this: >> >> Error in plot(x, y, data = dat) : object 'x' not found >> >> We can easily reproduce that and explain the problem to you. We can't >> help with a question like "my plot failed, something about an object >> was missing." >> >> 2. How to avoid making the members
Re: [Rd] No RTFM?
On Sun, Aug 22, 2010 at 9:22 PM, Ted Harding wrote: > > (with Cc: to r-devel) > I presume you mean "sessionInfo()". "systemInfo()" hasn't been > mentioned so far, I think. > brain fart. I'm old, you know :) >. I am questioning your proposal > that > 1. Every question to r-help should begin with the following. > A. Output from the command sessionInfo() > B. Output from Sys.getlocale() > [etc.] > Not *every* question, as I've said before. It depends on the case. I promise this is my last post in this thread. How about this. I aim for more concise, direct instruction. Hence I replace "This guide is intended to help you get the most out of the R mailing lists, and to avoid embarrassment. Like many responses posted on the list, it is written in a concise manner. This is not intended to be unfriendly - it is more a consequence of allocating the limited available time and space to technical issues rather than to social niceties. The list: Remember that R is free software, constructed and maintained by volunteers. They have various reasons for contributing software and participating on the mailing lists, but often have limited time." with "People are busy, so ask your question in a useful way." I think we need to get the most vital instructions up to the top of the guide. And the single most vital thing is sessionInfo(), as far as I can tell. I wish sessionInfo would just grab the locale information. You interpret my plan as harsh/restrictive, and I don't mean it that way. I'm aiming for clear and understandable to tired/frustrated people. Suppose it is phrased it this way instead: 1. Unless you are confident that the problem you are asking about is not affected by your OS or version of R, please include this information with your post: A. Output from the command sessionInfo() B. Output from Sys.getlocale() ... I think your experience with the befuddled users must be different from mine. Mine are looking for a clear instruction on what to do, they don't know what's wrong, and don't mind doing something specific as long as they are sure you want it. My users are perplexed by a statement like : "Surprising behavior and bugs: What you think is a bug may be many other things, such as a default behavior that you do not like, a feature, an undocumented feature, or a bug in the documentation. You do not need to commit yourself to one of these in order to ask a question. If it is a real bug, someone will notice it. Before you post a real bug report, make sure you read R Bugs in the R-faq. If you're not completely and utterly sure something is a bug, post a question to r-help, not a bug report to r-bugs - every bug report requires manual action by one of the R-core members. Also, see Simon Tatham's essay on How to Report Bugs Effectively. For questions about unexpected behavior or a possible bug, you should, at a minimum, copy and paste the output from sessionInfo() into your message." I don't think they get much out of this. They need to know that that "bug" is a "fighting word" (US legal term: expression for which you can reasonably expect to be punched in the nose, so it is not protected free speech) for programmers and they are deeply insulted by it, and users should never use it. The only person who should call something a bug is the author. Only the Mother can say a baby is ugly. > [3] I have tried to argue for a moderate and flexible spirit in > what is advised in the Posting Guide. I agree, but can it be done more clearly and concisely. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How do you make a formal "feature" request?
On Sat, Aug 21, 2010 at 10:41 AM, Donald Winston wrote: > Who decides what features are in R and how they are implemented? If there is > someone here who has that authority I have this request: > > A report() function analogous to the plot() function that makes it easy to > generate a report from a table of data. This should not be in some auxiliary > package, but part of the core R just like plot(). As a long time SAS user I > cannot believe R does not have this. Please don't give me any crap about > Sweave, LaTex, and the "power" of R to roll your own. You don't have to "roll > your own" plot do you? Reports are no different. If you don't agree do not > bother me. If you agree then please bring this request to the appropriate > authorities for consideration or tell me how to do it. > > Thanks. > [[alternative HTML version deleted]] Dear Donald: I've wondered about your question about "how do things get into R." It seems to me the tone of your post pissed people off and they are giving you some abuse. By telling you to write it yourself, they are expressing their frustration toward you. Really, though, you have only yourself to blame for that. You treat the readers like your employees. I think the nicer answer might go along these lines. 1. I recently suggested code to patch R in this list and one of the readers pointed me to this note in R-FAQ, which seems to be about as close as you are going to get to a "request". "There is a section of the bug repository for suggestions for enhancements for R labelled ‘wishlist’. Suggestions can be submitted in the same ways as bugs, but please ensure that the subject line makes clear that this is for the wishlist and not a bug report, for example by starting with ‘Wishlist:’. " 2. How do you suppose plot got into R? Somebody volunteered it, it worked nicely, and many people have improved and perfected it over the years. If you could give us a working version of "report", the same might happen. 3. I hate SAS report. You should too. It stinks. We don't need that. It is for boneheads. You don't even tell us what is good about report or give us an example. Or a link to an example. 4. In the R family, there are many functions like report that you should try. Please note the function "summary" is supplied by almost all modeling tools. It is supposed to do roughly what you want. Does it have too much information or too little? There are several ways to beautify that output. Try xtable. It can make html or LaTeX. I wonder if the memisc package function mtable does what you want. It collects up a bunch of regressions for you in a nice looking table? I've found very useful. I don't know what your objection is against LaTeX, incidentally. Similar tables can be created with packages. Hmisc and apsrtable Good luck, next time don't tell us to go to hell when you ask your question. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] environment question
Hello, everybody. I'm putting together some lecture notes and course exercises on R programming. My plan is to pick some R packages, ask students to read through code and see why things work, maybe make some changes. As I look for examples, I'm running up against the problem that packages use coding idioms that are unfamiliar to me. A difficult thing for me is explaining scope of variables in R functions. When should we pass an object to a function, when should we let the R system search about for an object? I've been puzzling through ?environment for quite a while. Here's an example from one of the packages that I like, called "ltm". In the function "ltm.fit" the work of calculating estimates is sent to different functions like "EM' and "loglikltm" and "scoreltm". Before that, this is used: environment(EM) <- environment(loglikltm) <- environment(scoreltm) <- environment() ##and then EM is called res.EM <- EM(betas, constraint, control$iter.em, control$verbose) I want to make sure I understand this. The environment line gets the current environment and then assigns it for those 3 functions, right? All variables and functions that can be accessed from the current position in the code become available to function EM, loglikltm, scoreltm. So, which options should be explicitly inserted into a function call, which should be left in the environment for R to find when it needs them? 1. I *think* that when EM is called, the variables "betas", "constraint", and "control" are already in the environment. The EM function is declared like this, using the same words "beta" and "constraint" EM <- function (betas, constraint, iter, verbose = FALSE) { It seems to me that if I wrote the function call like this (leave out "betas" and "constraint") res.EM <- EM(control$iter.em, control$verbose) R will run EM and go find "betas" and "constraint" in the environment, there was no need to name them as arguments. 2 Is a function like EM allowed to alter objects that it finds through the environment, ones that are not passed as arguments? I understand that a function cannot alter an object that is passed explicitly, but what about the ones it grabs from the environment? If you have ideas about packages that might be handy teaching examples, please let me know. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 'libRblas.so' missing in R 2.12.1
On Thu, Dec 16, 2010 at 7:31 AM, Christian Kohler wrote: > Dear R developers, > > I just compiled the latest version of R (2.12.1) and noticed that > 'libRblas.so' is missing in the '/x86_64/src/extra/blas' subdirectory of my > R-installation. > > Did I miss ongoing discussions on the Mailinglist about this or might it be a > local problem? > > Christian Kohler Hi, Christian: Did you get an answer? I'm guessing you asked because you want to follow along with the optimization advice in the R Install Guide, section "A.3.1.5 Shared BLAS" I've asked the same thing myself. Depending on the configure options you specified, libRblas may be built as a shared library or R may be linked against an external blas. There seems to be some tension and the experts will give you differing advice about whether R should be a shared library and whether you should allow libRblas.so to be built. The Install manual says that a non-shared library will load more quickly, but, of course, if the blas is built into R itself, then you can't play games pointing the symbolic link to other shared library implementations. If you have access to a Ubuntu system, you will notice there is no libRblas.so supplied in the packages they provide. They are pointing to a shared library blas from the intel kernel math library. The last time I looked, that was configured like this: ./configure --prefix=/usr \ --with-cairo\ --with-jpeglib \ --with-pango\ --with-png \ --with-readline \ --with-tcltk\ --with-system-bzlib \ --with-system-pcre \ --with-system-zlib \ --mandir=/usr/share/man \ --infodir=/usr/share/info \ --datadir=/usr/share/R/share\ --includedir=/usr/share/R/include \ $(atlas)\ $(lapack) \ --without-gnome \ --enable-R-profiling\ --enable-R-shlib\ --enable-memory-profiling \ --without-recommended-packages \ --build $(buildarch) $(atlas) draws its value from the rules file atlas = --with-blas Similarly, $(lapack) lapack = --with-lapack If you want to get libRblas.so out, you need to remove the atlas and lapack lines. Also, watch out for this configure option: --disable-BLAS-shlib. If you take out the atlas/lapack statements, you get libRblas.so, and then you can follow along with R install manual to replace that with a link to one of the optimized blas libraries. However, my experience, and that of at least one other person, is that the speedup you will observe from that is not too substantial. http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Request: Suggestions for "good teaching" packages, esp. with C code
Hello, I am looking for CRAN packages that don't teach bad habits. Can I have suggestions? I don't mean the recommended packages that come with R, I mean the contributed ones. I've been sampling a lot of examples and am surprised that many ignore seemingly agreed-upon principles of R coding. In r-devel, almost everyone seems to support the "functional programming" theme in Chambers's book on Software For Data Analysis, but when I go look at randomly selected packages, programmers don't follow that advice. In particular: 1. Functions must avoid "mystery variables from nowhere." Consider a function's code, it should not be necessary to say "what's variable X?" and go hunting in the commands that lead up to the function call. If X is used in the function, it should be in a named argument, or extracted from one of the named arguments. People who rely on variables floating around in the user's environment are creating hard-to-find bugs. 2. We don't want functions with indirect effects (no <<- ), almost always. 3. Code should be vectorized where possible, C style for loops over vector members should be avoided. 4. We don't want gratuitous use of "return" at the end of functions. Why do people still do that? 5. Neatness counts. Code should look nice! Check out how beautiful the functions in MASS look! I want code with spaces and " <- " rather than everything jammed together with "=". I don't mean to criticize any particular person's code in raising this point. For teaching exemples, where to focus? Here's one candidate I've found: MNP. as far as I can tell, it meets the first 4 requirements. And it has some very clear C code with it as well. I'm only hesitant there because I'm not entirely sure that a package's C code should introduce its own functions for handling vectors and matrices, when some general purpose library might be more desirable. But that's a small point, and clarity and completeness counts a great deal in my opinion. -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Set up new CRAN mirror; but have questions before finalizing.
Hi, everybody I have an account on Dreamhost.com and when I renewed it recently, their message said my usage of storage and bandwidth had been reasonably low. In an idle moment about 3 weeks ago, I followed your instructions to set up a CRAN mirror on their server. Here it is: http://www.freefaculty.org/cran It is not hosted at my University, but it is a working, high availability server. Is there any reason it could not be listed as a CRAN mirror. (Although I really have no idea where these machines exist. I'm pretty sure it is in the USA. I'll try to find out). Maybe you might try it and see? It has updated several times, no trouble I can see in that. I have a couple of small details to ask about. Maybe this first one is a potential "bug report" for the CRAN mirror instructions. 1. Permissions on "src" and "web" folders are 700, and so running "update.packages" or an apt-get update against the debian stuff results in permission denied errors. I re-set the permissions manually, but wonder if I'm actually supposed to mess around with your archive. After doing that, it works. But I worry a little bit about what else might not be readable "down there" in the hierarchy. And I wonder why any body else's mirror works without doing that. Notice before > update.packages( repos=c("http://www.freefaculty.org/cran";)) Warning: unable to access index for repository http://www.freefaculty.org/cran/src/contrib and after giving others rx permission on "src". > update.packages( repos=c("http://www.freefaculty.org/cran";)) RandomFields : Version 1.3.47 installed in /usr/local/lib/R/site-library Version 2.0.40 available at http://www.freefaculty.org/cran Update (y/N/c)? So at least that part works. right? 2. When I run "apt-get update" against my mirror, i get a lot of harassment about the lack of a security key for my repository. Should I be publishing an R Core team key to fix that, or my own key to do what? I've never administered an apt repository before. I have administered yum repositories and the security there is in the key on the individual RPMS, I don't quite understand what the Debian thing is asking me to do. If just a couple of you would test this out, I would feel more brave in asking to have this listed on the CRAN mirror system. Regards, pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Set up new CRAN mirror; but have questions before finalizing.
On Sat, Feb 19, 2011 at 12:43 AM, Friedrich Leisch wrote: >>>>>> On Fri, 18 Feb 2011 22:00:28 -0600, >>>>>> Paul Johnson (PJ) wrote: > > > Hi, everybody > > Maybe you might try it and see? It has updated several times, no > > trouble I can see in that. > > > I have a couple of small details to ask about. Maybe this first one > > is a potential "bug report" for the CRAN mirror instructions. > > > 1. Permissions on "src" and "web" folders are 700, > > Then you must have made a mistake in setting up the mirror, because on > the master we have 775 for both directories and all directories within > these two. We also had never complaints about this before. > Greetings, Fritz I'm surprised too. All I did was follow the very clear and simple instructions in the HOWTO. I run this script in the folder where the web server lives. $ cat downloadCran.sh rsync -rtlzv --delete cran.r-project.org::CRAN /home/freefaculty/freefaculty.org/cran I promise (sincerely) I changed nothing that forced src or web to 700. I manually put all the directories to permission 755 and then will run an update and see if they stay that way. If all stays well, I'll notify the address listed in the mirror HOWTO that this thing is in California. pj > > Best regards, > Fritz Leisch > -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] function call overhead
Snipping down to bare minimum history before comment: On Wed, Feb 16, 2011 at 4:28 PM, Olaf Mersmann wrote: > Dear Hadly, dear list, > > On Wed, Feb 16, 2011 at 9:53 PM, Hadley Wickham wrote: > >>> system.time(replicate(1e4, base::print)) >> user system elapsed >> 0.539 0.001 0.541 >>> system.time(replicate(1e4, print)) >> user system elapsed >> 0.013 0.000 0.012 >> library("microbenchmark") >> res <- microbenchmark(print, base::print, times=1) >> res >> print(res, unit="eps") > Unit: evaluations per second > min lq median uq max > print 17543859.65 15384615.38 14705882.35 14492753.62 20665.8538 > base::print 23944.64 23064.33 22584.32 20659.88 210.5329 > I think it is important to say that this slowdown is not unique to R and is unrelated to the fact that is R interpreted. The same happens in compiled object-oriented languages like C++ or Objective-C. There is an inherent cost in the runtime system to find a function or method that is suitable to an object. In agent-based modeling simulations, we call it the cost of "method lookup" because the runtime system has to check for the existence of a method each time it is called for a given object. There is a time-saving approach where one can cache the result of the lookup and then call that result directly each time through the loop. Implementing this is pretty complicated, however, and it is discouraged unless you really need it. It is especially dangerous because this optimization throws-away the runtime benefit of matching the correct method to the class of the object. (See http://www.mulle-kybernetik.com/artikel/Optimization/opti-3.html, where it shows how one can even cache C library functions to avoid lookup overhead. I'm told that the Obj-C 2.0 runtime will try to optimize this automatically, I've not tested.) The R solution is achieving that exact same kind of speed-up by saving the function lookup in a local variable. The R approach, however, is implemented much more easily than the Objective-C solution. There is an obvious danger: if the saved method is not appropriate to an object to which it applies, something unpredictable will happen. The same is true in C++. I was fiddling around with the C++ code that is included with the R package Siena (awesome package, incidentally) last year and noticed a similar slowdown with method lookup. In C++, I was surprised to find a slowdown inside a class using an instance variable prefixed with "this.". For an IVAR, "this.x" and "x" are the same thing, but to the runtime system, well, there's slowdown in finding "this" class and getting x, compared to just using x. To the programmer who is trying to be clear and careful, putting "this." on the front of IVAR is tidy, but it also slows down the runtime a lot. Hope this is not more confusing than when I started :) pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Standalone C++ application for processing R parser output(SEXP)
On Wed, Mar 23, 2011 at 6:35 PM, Rob Anderson wrote: > Hi All, > > I am trying to write a source-to-source compiler for R. I am trying to > leverage the R parser code for the purpose. I am trying to transform the > SEXP returned from the parser into an AST for our own Ruby embedded Domain > specific language. > > I tried using R CMD SHBIN to compile a C function that parses arbitrary R > expressions. But I think, the generated .so file can be used from within R > and not be called from other C or Ruby programs(I get linker errors). > I hope I am not telling you what you already know. There are working examples of "C Standalone" programs that link with R mathlib. I learned from the examples here http://www.math.ncu.edu.tw/~chenwc/R_note/index.php?item=standalone and http://www.stat.berkeley.edu/classes/s243/rmath.html I can't say for sure if these give you access to all of the R stuff you want, but you do get access to quite a bit. PJ > My Idea is to use the SEXP processing functions/MACROS (CAR, CDR, CADR, > etc..) from within C code and transform it to our AST format. I tried > linking to libR.a and other lib*.so's when I compile the C code using gcc > but, it doesn't work. > > I read that R exposes only small number of functions for library/package > writers and the compiled *.so can only from within R. > > Any ideas on what is wrong, or how I can go about it? > > Appreciate any help. > > Thanks > RJ > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Wish there were a "strict mode" for R interpreter. What about You?
Years ago, I did lots of Perl programming. Perl will let you be lazy and write functions that refer to undefined variables (like R does), but there is also a strict mode so the interpreter will block anything when a variable is mentioned that has not been defined. I wish there were a strict mode for checking R functions. Here's why. We have a lot of students writing R functions around here and they run into trouble because they use the same name for things inside and outside of functions. When they call functions that have mistaken or undefined references to names that they use elsewhere, then variables that are in the environment are accidentally used. Know what I mean? dat <- whatever someNewFunction <- function(z, w){ #do something with z and w and create a new "dat" # but forget to name it "dat" lm (y, x, data=dat) # lm just used wrong data } I wish R had a strict mode to return an error in that case. Users don't realize they are getting nonsense because R finds things to fill in for their mistakes. Is this possible? Does anybody agree it would be good? -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Wish R Core had a standard format (or generic function) for "newdata" objects
Is anybody working on a way to standardize the creation of "newdata" objects for predict methods? When using predict, I find it difficult/tedious to create newdata data frames when there are many variables. It is necessary to set all variables at the mean/mode/median, and then for some variables of interest, one has to insert values for which predictions are desired. I was at a presentation by Scott Long last week and he was discussing the increasing emphasis in Stata on calculations of marginal predictions and "Spost" an several other packages, and, co-incidentally, I had a student visit who is learning to use R MASS's polr (W.Venables and B. Ripley) and we wrestled for quite a while to try to make the same calculations that Stata makes automatically. It spits out predicted probabilities each independent variable, keeping other variables at a reference level. I've found R packages that aim to do essentially the same thing. In Frank Harrell's Design/rms framework, he uses a "data.dist" function that generates an object that the user has to put into the R options. I think many users trip over the use of "options" there. If I don't use that for a month or two, I completely forget the fine points and have to fight with it. But it does "work" to give plots and predict functions the information they require. In Zelig ( by Kosuke Imai, Gary King, and Olivia Lau), a function "setx" does the work of creating "newdata" objects. That appears to be about right as a candidate for a generic "newdata" function. Perhaps it could directly generalize to all R regression functions, but right now it is tailored to the models in Zelig. It has separate methods for the different types of models, and that is a bit confusing to me,since the "newdata" in one model should be the same as the newdata in another, I'm guessing. But his code is all there, I'll keep looking. In Effects (by John Fox), there are internal functions to create newdata and plot the marginal effects. If you load effects and run, for example, "effects:::effect.lm" you see Prof Fox has his own way of grabbing information from model columns and calculating predictions. I think it is time the R Core Team would look at this tell "us" what is the right way to do this. I think the interface to setx in Zelig is pretty easy to understand, at least for numeric variables. In R's termplot function, such a thing could be put to use. As far as I can tell now, termplot is doing most of the work of creating a newdata object, but not exactly. It seems like it would be a shame to proliferate more functions that do the same function, when it is such a common thing. -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Wish R Core had a standard format (or generic function) for "newdata" objects
On Tue, Apr 26, 2011 at 7:39 PM, Duncan Murdoch wrote: > If you don't like the way this was done in my three lines above, or by Frank > Harrell, or the Zelig group, or John Fox, why don't you do it yourself, and > get it right this time? It's pretty rude to complain about things that > others have given you for free, and demand they do it better. > > Duncan Murdoch > I offer sincere apology for sounding that way. I'm not attacking anybody. I'm just talking, asking don't you agree this were standardized. And you disagree, and I respect that since you are actually doing the work. >From a "lowly user's point of view", I wish "you experts" out there would tell us one way to do this, we could follow your example. When there's a regression model fitted with 20 variables in it, and half of them are numeric, 4 are unordered factors, 3 are ordinal factors, and what not, then this is a hard problem for many of us ordinary users. Or it is tedious. They want "keep everything fixed," except one variable that takes on different specified values. And they want to do that for every variable, one at a time. Stata has made this easy for many models, R could as well, if we coalesced on a more-or-less standard way to create newdata objects for predict. But, in the end, I agree with your sentiment. I just have to do this, show you it is handy. I think Zelig's setx has it about right, I'll pursue that strategy. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] median and data frames
On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns wrote: > Here are some data frames: > > df3.2 <- data.frame(1:3, 7:9) > df4.2 <- data.frame(1:4, 7:10) > df3.3 <- data.frame(1:3, 7:9, 10:12) > df4.3 <- data.frame(1:4, 7:10, 10:13) > df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) > df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) > > Now here are some commands and their answers: >> median(df4.4) > [1] 8.5 11.5 >> median(df3.2[c(1,2,3),]) > [1] 2 8 >> median(df3.2[c(1,3,2),]) > [1] 2 NA > Warning message: > In mean.default(X[[2L]], ...) : > argument is not numeric or logical: returning NA > > > > The sessionInfo is below, but it looks > to me like the present behavior started > in 2.10.0. > > Sometimes it gets the right answer. I'd > be grateful to hear how it does that -- I > can't figure it out. > Hello, Pat. Nice poetry there! I think I have an actual answer, as opposed to the usual crap I spew. I would agree if you said median.data.frame ought to be written to work columnwise, similar to mean.data.frame. apply and sapply always give the correct answer > apply(df3.3, 2, median) X1.3 X7.9 X10.12 2 8 11 > apply(df3.2, 2, median) X1.3 X7.9 28 > apply(df3.2[c(1,3,2),], 2, median) X1.3 X7.9 28 mean.data.frame is now implemented as mean.data.frame <- function(x, ...) sapply(x, mean, ...) I think we would suggest this for medians: ?? median.data.frame <- function(x,...) sapply(x, median, ...) ? It works, see: > median.data.frame(df3.2[c(1,3,2),]) X1.3 X7.9 28 Would our next step be to enter that somewhere in R bugzilla? (I'm not joking--I'm that naive). I think I can explain why the current median works intermittently in those cases you mention. Give it a small set of pre-sorted data, all is well. median.default uses a sort function, and it is confused when it is given a data.frame object rather than just a vector. I put a browser() at the top of median.default > median(df3.2[c(1,3,2),]) Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at #4: if (is.factor(x)) stop("need numeric data") Browse[2]> n debug at #4: NULL Browse[2]> n debug at #6: if (length(names(x))) names(x) <- NULL Browse[2]> n debug at #6: names(x) <- NULL Browse[2]> n debug at #8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at #8: if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at #8: NULL Browse[2]> n debug at #12: n <- length(x) Browse[2]> n debug at #13: if (n == 0L) return(x[FALSE][NA]) Browse[2]> n debug at #13: NULL Browse[2]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Note the sort there in step 16. I think that's what is killing us. If you are lucky, give it a small data frame that is in order, like df3.2, the sort doesn't produce gibberish. When I get to that point, I will show you the sort's effect. First, the case that "works". I moved the browser() down, because I got tired of looking at the same old not-yet-erroneous output. > median(df3.2) Called from: median.default(df3.2) Browse[1]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively, type Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 7 NULL NULL NULL NULL 2 2 8 3 3 9 Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs But it still gives you a "right" answer: Browse[2]> n [1] 2 8 But if you give it data out of order, the second column turns to NA, and that causes doom. > median(df3.2[c(1,3,2),]) Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively: Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 NULL 7 NULL NULL NULL 3 3 9 2 2 8 Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Here's a larger test case. Note columns 1 and 3 turn to NULL > df8.8 <- data.frame(a=2:8, b=1:7) median(df8.8)
[Rd] requesting a mentor for R development
I'd like to learn the process of revising R functions & packages and then submitting proposed patches to the R Core team. Would someone be willing to mentor me through one example? For starters, consider an example. I'd like to revise the t.test function to return the stderr value to the user. We only need to change the "rval" in the third-from-the end line of stats:::t.test.default. Change this: rval <- list(statistic = tstat, parameter = df, p.value = pval, conf.int = cint, estimate = estimate, null.value = mu, alternative = alternative, method = method, data.name = dname) class(rval) <- "htest" return(rval) To this: rval <- list(statistic = tstat, parameter = df, stderr=stderr, p.value = pval, conf.int = cint, estimate = estimate, null.value = mu, alternative = alternative, method = method, data.name = dname) class(rval) <- "htest" return(rval) Here is where I need help. 1. What other changes in the R code & documents would be required for consistency? 2. How does R Core Team expect/accept patches? I do understand a bit about SVN and CVS. I think I could mark this up and make a diff, but I'm uncertain about the details. 3. How would I best prepare a suggestion like this so as to get it accepted into R? If somebody will raise a hand, I'm willing to do the work to learn. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Randomness not due to seed
On Tue, Jul 19, 2011 at 8:13 AM, jeroen00ms wrote: > I am working on a reproducible computing platform for which I would like to > be able to _exactly_ reproduce an R object. However, I am experiencing > unexpected randomness in some calculations. I have a hard time finding out > exactly how it occurs. The code below illustrates the issue. > > mylm1 <- lm(dist~speed, data=cars); > mylm2 <- lm(dist~speed, data=cars); > identical(mylm1, mylm2); #TRUE > > makelm <- function(){ > return(lm(dist~speed, data=cars)); > } > > mylm1 <- makelm(); > mylm2 <- makelm(); > identical(mylm1, mylm2); #FALSE > > When inspecting both objects there seem to be some rounding differences. > Setting a seed does not make a difference. Is there any way I can remove > this randomness and exactly reproduce the object every time? > William Dunlap was correct. Observe in the sequence of comparisons below, the difference in the "terms" object is causing the identical to fail: Everything else associated with this model--the coefficients, the r-square, cov matrix, etc, exactly match. > mylm1 <- lm(dist~speed, data=cars); > mylm2 <- lm(dist~speed, data=cars); > identical(mylm1, mylm2); #TRUE [1] TRUE > makelm <- function(){ +return(lm(dist~speed, data=cars)); + } > mylm1 <- makelm(); > mylm2 <- makelm(); > identical(mylm1, mylm2); #FALSE [1] FALSE > identical(coef(mylm1), coef(mylm2)) [1] TRUE > identical(summary(mylm1), summary(mylm2)) [1] FALSE > identical(coef(summary(mylm1)), coef(summary(mylm2))) [1] TRUE > all.equal(mylm1, mylm2) [1] TRUE > identical(summary(mylm1)$r.squared, summary(mylm2)$r.squared) [1] TRUE > identical(summary(mylm1)$adj.r.squared, summary(mylm2)$adj.r.squared) [1] TRUE > identical(summary(mylm1)$sigma, summary(mylm2)$sigma) [1] TRUE > identical(summary(mylm1)$fstatistic, summary(mylm2)$fstatistic) [1] TRUE > identical(summary(mylm1)$residuals, summary(mylm2)$residuals) [1] TRUE > identical(summary(mylm1)$cov.unscaled, summary(mylm2)$cov.unscaled) [1] TRUE > identical(summary(mylm1)$call, summary(mylm2)$call) [1] TRUE > identical(summary(mylm1)$terms, summary(mylm2)$terms) [1] FALSE > summary(mylm2)$terms dist ~ speed attr(,"variables") list(dist, speed) attr(,"factors") speed dist 0 speed 1 attr(,"term.labels") [1] "speed" attr(,"order") [1] 1 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,".Environment") attr(,"predvars") list(dist, speed) attr(,"dataClasses") dist speed "numeric" "numeric" > > summary(mylm1)$terms dist ~ speed attr(,"variables") list(dist, speed) attr(,"factors") speed dist 0 speed 1 attr(,"term.labels") [1] "speed" attr(,"order") [1] 1 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,".Environment") attr(,"predvars") list(dist, speed) attr(,"dataClasses") dist speed "numeric" "numeric" -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] CRAN mirror size mushrooming; consider archiving some?
Hi, everybody I'm setting up a new CRAN mirror and filled up the disk space the server allotted me. I asked for more, then filled that up. Now the system administrators want me to buy an $800 fiber channel card and a storage device. I'm going to do that, but it does make want to suggest to you that this is a problem. CRAN now is about 68GB, and about 3/4 of that is in the bin folder, where one finds copies of compiled packages for macosx and windows. If the administrators of CRAN would move the packages for R before, say, 2.12, to long term storage, then mirror management would be a bit more, well, manageable. Moving the R for windows packages for, say, R 2.0 through 2.10 would save some space, and possibly establish a useful precedent for the long term. Here's the bin/windows folder. Note it is expanding exponentially (or nearly so) $ du --max-depth=1 | sort 1012644 ./2.6 103504 ./1.7 122200 ./1.8 1239876 ./2.7 1487024 ./2.8 15220 ./ATLAS 167668 ./1.9 17921604. 1866196 ./2.9 204392 ./2.0 2207708 ./2.10 2340120 ./2.13 2356272 ./2.12 2403176 ./2.11 298620 ./2.1 364292 ./2.2 438044 ./2.3 595920 ./2.4 698064 ./2.5 -- -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Please explain your workflow from R code -> package -> R code -> package
Hi, I'm asking another one of those questions that would be obvious if I could watch your work while you do it. I'm having trouble understanding the workflow of code and package maintenance. Stage 1. Make some R functions in a folder. This is in a Subversion repo R/trunk/myproject Stage 2. Make a package: After the package.skeleton, and R check, I have a new folder with the project in it, R/trunk/myproject/mypackage DESCRIPTION man R I to into the man folder and manually edit the Rd files. I don't change anything in the R folder because I think it is OK so far. And eventually I end up with a tarball mypackage_1.0.tar.gz. Stage 3. How to make the round trip? I add more R code, and re-generate a package. package.skeleton obliterates the help files I've already edited. So keeping the R code in sync with the documentation appears to be a hassle. In other languages, I've seen to write the documentation inside the code files and then post-process to make the documentation. Is there a similar thing for R, to unify the R code development and documentation/package-making process? pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Proposed change in file.exists() to tolerate Windows
I'm writing to ask if R Core would make file.exists more Windows tolerant when the argument has a trailing slash. This has been discussed by users a few times here, I know it is not a new topic. But it is not a solved problem, yet. I acknowledge that CRAN packages exist which fix this by replacing file.exists(), but it seems more elegant to me to fix the problem in R itself. R Core goes to great extremes to accommodate Windows users and the refusal to make file.exists() work in a cross-platform way is incongruous. I often do have slashes on the end of directory names being tested. Now that I understand the meaning of ?file.exists, I need to wrap the name being checked in a slash-deleter ## delete trailing slashes dts <- function(name) gsub("/$", "", name) if(!file.exists(dts(any_name))) { ... Can't you make file.exists do this invisibly? Maybe the argument could be filtered through normalizePath() instead. If not, would you please consider putting a workaround like mine into the file.exists documentation so Windows users can see how easy this is to avoid? Respectfully yours, pj -- Paul E. Johnson Professor, Political ScienceDirector 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://crmda.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] MKL Acceleration encouraging; need adjust package builds?
Dear R-devel: The Cluster administrators at KU got enthusiastic about testing R-3.2.2 with Intel MKL when I asked for some BLAS integration. Below I forward a performance report, which is encouraging, and thought you would like to know the numbers. Appears to my untrained eye there are some extraordinary speedups on Cholesky decomposition, determinants, and matrix inversion. They had difficulty getting R to compile with R shared BLAS (don't know what went wrong there), so they went the other direction. In his message to me, the technician says that I should consider adjusting the compilation flags on the packages that use BLAS. Do you think that is needed? R is compiled with non-shared BLAS libraries, won't packages know where to look for BLAS headers? 2. If I need to do that, I wonder how to do it and which packages need attention. Eigen and Armadillo packages, and possibly the ones that depend on them, lme4, anything flowing through Rcpp. Here's the build for some packages. Are they finding MKL BLAS? How would I know? * installing *source* package 'RcppArmadillo' ... ** package 'RcppArmadillo' successfully unpacked and MD5 sums checked * checking LAPACK_LIBS: divide-and-conquer complex SVD available via system LAPACK ** libs g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include -I/usr/local/include -I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include" -I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic-c RcppArmadillo.cpp -o RcppArmadillo.o g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include -I/usr/local/include -I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include" -I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic-c RcppExports.cpp -o RcppExports.o g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include -I/usr/local/include -I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include" -I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic-c fastLm.cpp -o fastLm.o g++ -shared -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib -L/usr/local/lib64 -o RcppArmadillo.so RcppArmadillo.o RcppExports.o fastLm.o -L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64 -Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread -lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm -lgfortran -lm -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib -lR installing to /panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/RcppArmadillo/libs ** R ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded * DONE (RcppArmadillo) * installing *source* package 'RcppEigen' ... ** package 'RcppEigen' successfully unpacked and MD5 sums checked ** libs g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include -I/usr/local/include -I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include" -I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic-c RcppEigen.cpp -o RcppEigen.o g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include -I/usr/local/include -I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include" -I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic-c RcppExports.cpp -o RcppExports.o g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include -I/usr/local/include -I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include" -I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic-c fastLm.cpp -o fastLm.o g++ -shared -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib -L/usr/local/lib64 -o RcppEigen.so RcppEigen.o RcppExports.o fastLm.o -L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64 -Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread -lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm -lgfortran -lm -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib -lR installing to /panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/RcppEigen/libs ** R ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded * DONE (RcppEigen) * installing *source* package 'MatrixModels' ... ** package 'MatrixModels' successfully unpacked and MD5 sums checked ** R ** preparing package for lazy loading Creating a generic function for 'resid' from package 'stats' in package 'MatrixModels' Creating a generic function for 'fitted.va
Re: [Rd] MKL Acceleration encouraging; need adjust package builds?
On Mon, Nov 23, 2015 at 11:39 AM, David Smith wrote: > Hi Paul, > > We've been through this process ourselves for the Revolution R Open project. > There are a number of pitfalls to avoid, but you can take a look at how we > achieved it in the build scripts at: > > https://github.com/RevolutionAnalytics/RRO > > There are also some very useful notes in the R Installation guide: > https://cran.r-project.org/doc/manuals/r-release/R-admin.html#BLAS > > Most packages do benefit from MKL (or any multi-threaded BLAS) to some > degree, although the actual benefit depends on the R functions they call. > Some packages (and some built-in R functions) don't call into BLAS endpoints, > so you won't see benefits in all cases. > > # David Smith > Dear David I'm in the situation mentioned here in the docs, since BLAS is not shared. "Note that under Unix (but not under Windows) if R is compiled against a non-default BLAS and --enable-BLAS-shlib is not used, then all BLAS-using packages must also be. So if R is re-built to use an enhanced BLAS then packages such as quantreg will need to be re-installed. " I am building all of the modules from scratch, so if the default build is sufficient, then I'll be done. When I asked the other day, I was worried that packages would find the wrong shared library. As far as I can tell now, I should not have been so worried. Today, while browsing the R installation, I find the Makeconf file and that has all the information a package should need. I've verified that the quantreg package detects this information, and we'll just hope the others do too :) In case anybody else comes along later and wonders how R can be configured to make this go, here's the top of our Makeconf from the installed R, which has the configure line as well as BLAS_LIBS, which, so far as I can tell, is making all of this go. Makeconf content # etc/Makeconf. Generated from Makeconf.in by configure. # # ${R_HOME}/etc/Makeconf # R was configured using the following call # (not including env. vars and site configuration) # configure '--prefix=/tools/cluster/6.2/R/3.2.2_mkl' '--with-tcltk' '--enable-R-shlib' '--enable-shared' '--with-pic' '--disable-BLAS-shlib' '--with-blas=-L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64 -Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread -lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm' '--with-lapack' 'CFLAGS=-I/panfs/pfs.acf.ku.edu/cluster/system/pkg/R/curl7.45_install/include -L/panfs/pfs.acf.ku.edu/cluster/6.2/R/3.2.2_mkl/lib64' 'JAVA_HOME=/tools/cluster/6.2/java/jdk1.8.0_66' ## This fails if it contains spaces, or if it is quoted include $(R_SHARE_DIR)/make/vars.mk AR = ar ## Used by packages 'maps' and 'mapdata' AWK = gawk BLAS_LIBS = -L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64 -Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread -lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm C_VISIBILITY = -fvisibility=hidden ... pj -- Paul E. Johnson Professor, Political ScienceDirector 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://crmda.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Package Dependency
Lately, I see a dependency problem with R-3.2 systems. install.packages allows dependencies, but they are not installed so the install fails. I see this a lot with students who are able to install packages like ggplot2 but then can't load it because a dependency "stringi" is missing. Those students are generally using Windows, but today I reproduced the same on RedHat Linux with R-3.2.2. desiredPackages is a big vector of packages to install. This is using a trick that one of R Core taught me about 15 years ago... > installedPackages <- rownames (installed.packages() ) > alreadyHave <- desiredPackages %in% installedPackages > install.packages(desiredPackages[!alreadyHave], + dependencies = c("Depends"), + lib = TARGETLIB) trying URL 'http://rweb.crmda.ku.edu/cran/src/contrib/DiagrammeR_0.8.1.tar.gz' Content type 'application/x-gzip' length 3438156 bytes (3.3 MB) == downloaded 3.3 MB trying URL 'http://rweb.crmda.ku.edu/cran/src/contrib/emplik_1.0-2.tar.gz' Content type 'application/x-gzip' length 199124 bytes (194 KB) == downloaded 194 KB trying URL 'http://rweb.crmda.ku.edu/cran/src/contrib/glmc_0.2-4.tar.gz' Content type 'application/x-gzip' length 19721 bytes (19 KB) == downloaded 19 KB trying URL 'http://rweb.crmda.ku.edu/cran/src/contrib/sem_3.1-6.tar.gz' Content type 'application/x-gzip' length 156194 bytes (152 KB) == downloaded 152 KB trying URL 'http://rweb.crmda.ku.edu/cran/src/contrib/OpenMx_2.3.1.tar.gz' Content type 'application/x-gzip' length 2537264 bytes (2.4 MB) == downloaded 2.4 MB ERROR: dependency 'visNetwork' is not available for package 'DiagrammeR' * removing '/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/DiagrammeR' However, I can manually install that individual package > install.packages("visNetwork", + dependencies = c("Depends"), + lib = TARGETLIB) trying URL 'http://rweb.crmda.ku.edu/cran/src/contrib/visNetwork_0.1.2.tar.gz' Content type 'application/x-gzip' length 1828012 bytes (1.7 MB) == downloaded 1.7 MB * installing *source* package 'visNetwork' ... ** package 'visNetwork' successfully unpacked and MD5 sums checked ** R ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded * DONE (visNetwork) The downloaded source packages are in '/library/tmp/RtmpBImblN/downloaded_packages' After that, I start over, the install of the desiredPackages vector proceeds, but hits another snag: install.packages(desiredPackages[!alreadyHave], dependencies = c("Depends"), + + lib = TARGETLIB) trying URL 'http://rweb.crmda.ku.edu/cran/src/contrib/DiagrammeR_0.8.1.tar.gz' Content type 'application/x-gzip' length 3438156 bytes (3.3 MB) == downloaded 3.3 MB trying URL 'http://rweb.crmda.ku.edu/cran/src/contrib/sem_3.1-6.tar.gz' Content type 'application/x-gzip' length 156194 bytes (152 KB) == downloaded 152 KB trying URL 'http://rweb.crmda.ku.edu/cran/src/contrib/OpenMx_2.3.1.tar.gz' Content type 'application/x-gzip' length 2537264 bytes (2.4 MB) == downloaded 2.4 MB * installing *source* package 'DiagrammeR' ... ** package 'DiagrammeR' successfully unpacked and MD5 sums checked ** R ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** testing if installed package can be loaded * DONE (DiagrammeR) ERROR: dependency 'StanHeaders' is not available for package 'OpenMx' Go back, install just that one > install.packages("StanHeaders", dependencies = c("Depends"), lib = TARGETLIB) trying URL 'http://rweb.crmda.ku.edu/cran/src/contrib/StanHeaders_2.8.0.tar.gz' Content type 'application/x-gzip' length 479231 bytes (467 KB) == downloaded 467 KB * installing *source* package 'StanHeaders' ... ** package 'StanHeaders' successfully unpacked and MD5 sums checked ** inst ** help *** installing help indices ** building package indices ** testing if installed package can be loaded * DONE (StanHeaders) The downloaded source packages are in '/library/tmp/RtmpBImblN/downloaded_packages' I don't recall seeing this problem until the last few months. -- Paul E. Johnson Professor, Political ScienceDirector 1541 Lilac Lane, Room 504 Center
[Rd] something wrong in package submission procedure/website
Yesterday I uploaded new rockchalk_1.8.97. Then I received email saying that I needed to confirm the submission. Here's the message. Dear Paul E. Johnson Someone has submitted the package rockchalk to CRAN. You are receiving this email to confirm the submission as the maintainer of this package. To confirm the submission to CRAN, follow or copy & paste the following link into your browser: http://xmpalantir.wu.ac.at/cransubmit/conf_mail.php?code=b95e95eb5317901449ed7d311053bad8 If you did not submit the package or do not wish for it to be submitted to CRAN, simply ignore this email Submission Information: Submitter: Paul Johnson Package: rockchalk Version: 1.8.97 Note the email has the correct package version 1.8.97, what I uploaded yesterday. The email points to a webpage where I confirm submission. The 3rd box there requires me to agree to "I have fixed all probelms shown on the package check page(which links to http://cran.r-project.org/web/checks/check_results_rockchalk.html). The problem is that the package check has a report on the PREVIOUS rockchalk 1.8.92, not the one I uploaded yesterday. So none of the beautiful repairs I implemented show up :( pj -- Paul E. Johnson Professor, Political ScienceDirector 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://crmda.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] configure statement for R-devel with updated zlib in user account
I'm aware R-devel no longer includes zlib. This works find on up-to-date Linux systems. On the older Centos 6 cluster at KU, they have zlib tool old for R-devel. The R-devel configure fails thus: checking if zlib version >= 1.2.5... no checking whether zlib support suffices... configure: error: zlib library and headers are required In other software, I've seen this kind of thing, so I installed zlib 1.2.8 into $HOME/packages and I can verify that the right pieces are installed there. Now I need you to tell me how to tell R to use that. Guesses so far: $ export PATH=$HOME/packages/bin:$PATH $ export LD_LIBRARY_PATH=$HOME/packages/lib:$LD_LIBRARY_PATH What R's configure statement needs to find zlib in my user account? In R-devel, the ./configure --help doesn't give me a simple-enough explanation of what I'm supposed to do. I thought I had a solution by adding this --with-libpth-prefix=$HOME/packages but still rejection, new zlib not found. -- Paul E. Johnson Professor, Political ScienceDirector 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://crmda.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] configure statement for R-devel with updated zlib in user account
On Mon, Feb 15, 2016 at 3:28 AM, Jesper Gådin wrote: > Hi Paul, > > This might be what you are looking for. > https://stat.ethz.ch/pipermail/r-devel/2015-April/070951.html > > Jesper > Thanks to help form Wes Mason at KU and Jesper Gadin in r-devel, I compiled R-devel on RHEL 6. R-devel requires install of newer gzip, bzip2, xz, pcre, and curl. Compiles of bzip2 and pcre require a slightly greater-than-usual amount of attention to detail. I wrote it all out: http://pj.freefaculty.org/blog/?p=315 "Building R-devel on RedHat Linux 6" Regards pj -- Paul E. Johnson Professor, Political ScienceDirector 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://crmda.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Suggestion: pdf.options(embed=TRUE)
I learned last week that I generated a lot of PDFs and forgot to go back and embed the fonts in them. It would be very pleasant for me if pdf.options included an argument to turn on font embedding and have fonts always embedded. Always. Thanks for your time, as always. -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu I only use this account for email list memberships. To write directly, address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] seq.int does not return a sequence of integers sometimes
I have a script that goes wrong because I assumed that seq.int would return integers. Below please see it does not unless user is super cautious about inserting "L" with inputs. I think seq.int should do coercion for me before returning the sequence. > xx <- seq.int(1,10) > class(xx) [1] "integer" > is.integer(xx) [1] TRUE > xx <- seq.int(1,10, 2) > class(xx) [1] "numeric" > is.integer(xx) [1] FALSE > xx <- seq.int(1,10, 2L) > class(xx) [1] "numeric" > is.integer(xx) [1] FALSE > xx <- seq.int(1L, 10L, 2L) > class(xx) [1] "integer" > is.integer(xx) [1] TRUE I think all of those should have same return value, if the function is correctly named seq.int. -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu I only use this account for email list memberships. To write directly, address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Cluster: Various GCC, how important is consistency?
On a cluster that is based on RedHat 6.2, we are updating to R-3.3.1. I have, from time to time, run into problems with various R packages and some older versions of GCC. I wish we had newer Linux in the cluster, but with 1000s of nodes running 1000s of jobs, well, they don't want a restart. Administrator suggested I try to build with the GCC that is provided with the nodes, which is gcc-4.4.7. To my surprise, R-3.3.1 compiled with that. After that, I got quite far, many 100s of packages compiled, but then I hit a snag that RccArmadillo explicitly refuses to build with anything older than gcc-4.6. The OpenMx package and emplik packages also refuse to compile with old gcc The cluster uses a module system, it is easy enough to swap in various gcc versions to see what compiles. I did succeed compiling RcppArmadillo with gcc 4.9.2. But Rcpp is not picky, it compiled with gcc-4.4.7. I worry... 1) will reliance on various GCC make the packages incompatible with R, or each other? I logged out, logged back in, with R 3.3.1 I can run library(RcppArmadillo) library(Rcpp) with no errors so far. But I'm not stress testing it much. I should rebuild everything? I expect that if I were to use gcc-6 on one package, it would not be compatible with binaries built with 4.4.7. But is there a zone of tolerance allowing 4.4.7 and 4.9 packages to coexist? 2) If I build with non-default GCC, are all of the R users going to hit trouble if they don't have the same GCC I use? Unless I make some extraordinary effort, they are getting GCC 4.4.7. If they try to install a package, they are getting that GCC, not the one I use to build RcppArmadillo or the other trouble cases (or everything, if you say I need to go back and rebuild). >From an administrative point of view, should I tie R-3.3.1 to a particular version of GCC? I think I could learn how to do that. On the cluster, they use the module framework. There are about 50 versions of GCC. It is easy enough ask for a newer one: $ module load gcc/4.9.2 It puts the gcc 4.9.2 binaries and shared libraries at the front of the PATHs. pj -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write me directly, address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Cluster: Various GCC, how important is consistency?
Dear Jeroen Did you rebuild R-3.3.1 and all of the packages with GCC-5.3 in order to make this work? The part that worries me is that the shared libraries won't be consistent, with various versions of GCC in play. On Tue, Oct 18, 2016 at 5:55 AM, Jeroen Ooms wrote: > On Tue, Oct 18, 2016 at 1:44 AM, Paul Johnson wrote: >> >> Administrator suggested I try to build with the GCC that is provided >> with the nodes, which is gcc-4.4.7. > > Redhat provides an alternative compiler (gcc 5.3 based) in one of it's > opt-in repositories called "redhat developer toolkit" (RDT). In CentOS > you install it as follows: > > yum install -y centos-release-scl > yum install -y devtoolset-4-gcc-c++ > > This compiler is specifically designed to be used alongside the EL6 > stock gcc 4.4.7. It includes a simple 'enable' script which will put > RDT gcc and g++ in front of your PATH and LD_LIBRARY_PATH and so on. > > So what I do on CentOS is install R from EPEL (built with stock gcc > 4.4.7) and whenever I need to install an R package that uses e.g. > CXX11, simply start an R shell using the RDT compilers: > >source /opt/rh/devtoolset-4/enable >R > > From what I have been able to test, this works pretty well (though I > am not a regular EL user). But I was able to build R packages that use > C++11 (such as feather) and once installed, these packages can be used > even in a regular R session (without RDT enabled). -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu I only use this account for email list memberships. To write directly, address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] syntax difference clusterExport in parallel and snow
We got some errors and eventually figured out that parallel::clusterExport second argument is "varlist" while in snow::clusterExport it is "list". The user had loaded parallel first, but did something else which inadvertently loaded snow, then clusterExport failed because we had "varlist" and not "list". Are these different on purpose? pj -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] duplicated factor labels.
Dear R devel I've been wondering about this for a while. I am sorry to ask for your time, but can one of you help me understand this? This concerns duplicated labels, not levels, in the factor function. I think it is hard to understand that factor() fails, but levels() after does not > x <- 1:6 > xlevels <- 1:6 > xlabels <- c(1, NA, NA, 4, 4, 4) > y <- factor(x, levels = xlevels, labels = xlabels) Error in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, : factor level [3] is duplicated > y <- factor(x, levels = xlevels) > levels(y) <- xlabels > y [1] 1 444 Levels: 1 4 If the latter use of levels() causes a good, expected result, couldn't factor(..., labels = xlabels) be made to the same thing? That's the gist of it. To signal to you that I've been trying to figure this out on my own, here is a revision I've tested in R's factor function which "seems" to fix the matter. (Of course, probably causes lots of other problems I don't understand, that's why I'm writing to you now.) In the factor function, the class of f is assigned *after* levels(f) is called levels(f) <- ## nl == nL or 1 if (nl == nL) as.character(labels) else paste0(labels, seq_along(levels)) class(f) <- c(if(ordered) "ordered", "factor") At that point, f is an integer, and levels(f) is a primitive > `levels<-` function (x, value) .Primitive("levels<-") That's what generates the error. I don't understand well what .Primitive means here. I need to walk past that detail. Suppose I revise the factor function to put the class(f) line before the level(). Then `levels<-.factor` is called and all seems well. factor <- function (x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA) { if (is.null(x)) x <- character() nx <- names(x) if (missing(levels)) { y <- unique(x, nmax = nmax) ind <- sort.list(y) levels <- unique(as.character(y)[ind]) } force(ordered) if (!is.character(x)) x <- as.character(x) levels <- levels[is.na(match(levels, exclude))] f <- match(x, levels) if (!is.null(nx)) names(f) <- nx nl <- length(labels) nL <- length(levels) if (!any(nl == c(1L, nL))) stop(gettextf("invalid 'labels'; length %d should be 1 or %d", nl, nL), domain = NA) ## class() moved up 3 rows class(f) <- c(if (ordered) "ordered", "factor") levels(f) <- if (nl == nL) as.character(labels) else paste0(labels, seq_along(levels)) f } > assignInNamespace("factor", factor, "base") > x <- 1:6 > xlevels <- 1:6 > xlabels <- c(1, NA, NA, 4, 4, 4) > y <- factor(x, levels = xlevels, labels = xlabels) > y [1] 1 444 Levels: 1 4 > attributes(y) $class [1] "factor" $levels [1] "1" "4" That's a "good" answer for me. But I broke your function. I eliminated the check for duplicated levels. > y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) > y [1] 14 Levels: 1 4 Rather than have factor return the "duplicated levels" error when there are duplicated values in labels, I wonder why it is not better to have a check for duplicated levels directly. For example, insert a new else in this stanza if (missing(levels)) { y <- unique(x, nmax = nmax) ind <- sort.list(y) levels <- unique(as.character(y)[ind]) } ##next is new part else { levels <- unique(levels) } That will cause an error when there are duplicated levels because there are more labels than levels: > y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) Error in factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) : invalid 'labels'; length 6 should be 1 or 2 So, in conclusion, if levels() can work after creating a factor, I wish equivalent labels argument would be accepted. What is your opinion? pj -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] duplicated factor labels.
On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys wrote: > To extwnd on Martin 's explanation : > > In factor(), levels are the unique input values and labels the unique output > values. So the function levels() actually displays the labels. > Dear Joris I think we agree. Currently, factor insists both levels and labels be unique. I wish that it would not accept nonunique labels. I also understand it is impractical to change this now in base R. I don't think I succeeded in explaining why this would be nicer. Here's another example. Fairly often, we see input data like x <- c("Male", "Man", "male", "Man", "Female") The first four represent the same value. I'd like to go in one step to a new factor variable with enumerated types "Male" and "Female". This fails xf <- factor(x, levels = c("Male", "Man", "male", "Female"), labels = c("Male", "Male", "Male", "Female")) Instead, we need 2 steps. xf <- factor(x, levels = c("Male", "Man", "male", "Female")) levels(xf) <- c("Male", "Male", "Male", "Female") I think it is quirky that `levels<-.factor` allows the duplicated labels, whereas factor does not. I wrote a function rockchalk::combineLevels to simplify combining levels, but most of the students here like plyr::mapvalues to do it. The use of levels() can be tricky because one must enumerate all values, not just the ones being changed. But I do understand Martin's point. Its been this way 25 years, it won't change. :). > Cheers > Joris > > -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Rmpi, openMPI editions.
Greetings. I see a warning message while compiling OpenMPI and would appreciate it if you tell me what it means. This warning happens with any OpenMPI > 1.6.5. Even before starting a cluster, just "sessionInfo" triggers this warning. I'm pasting in the message from R-3.3.2 (this is MRO). Do the R parallel package cluster functions violate the warnings described here? > library("Rmpi") > sessionInfo() -- An MPI process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your MPI job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged. The process that invoked fork was: Local host: n401 (PID 114242) MPI_COMM_WORLD rank: 0 If you are *absolutely sure* that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0. -- R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rmpi_0.6-6 RevoUtilsMath_10.0.0 loaded via a namespace (and not attached): [1] RevoUtils_10.0.2 parallel_3.3.2 tools_3.3.2 > What I think this means is that we need to never run any multicore functions and we need to be very careful that MKL or such does not launch threads. Is that right? Is it worse than that? Why am I chasing this one today? I've been on an adventure compiling R in a RedHat 6 cluster again. The cluster admins here like the Microsoft R, and they had both 3.3 and 3.4 installed. However, we found some packaging flaws in 3.4 and so that MRO was removed. I'm interested in building R-3.4, but it is a pretty big job on the old RedHat. I want to get this correct. I've run into the problem I'd forgotten about OpenMPI. If OpenMPI >= 2, then Rmpi will compile, but jobs hang with "stopCluster". With OpenMPI-1.6.5, we get a clean build and no warnings, and clusters do start and stop cleanly. With newer version 1 editions of OpenMPI, such as 1.10 or 1.12 (I suspect any versions (> 1.6.5), the Rmpi generates an intimidating warning, but the cluster will stop when asked. -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] arrows: no vectors for "code" and "angle" parameters
I was teaching new R users to make some fun graphs. I had some arrows examples worked up we came across a problem. The arrows function ignores 2nd and following elements of vectors given as code an angle. Would you please consider 1) allowing vectors for code and angle, or 2) returning an error or warning when user mistakenly supplies a vector for those parameters? When code like this is wrapped into an Sweaved document--we I don't see the graph on the screen--this error is difficult to catch while proofreading. Example: x0 <- c(-1, -4, 4.5) y0 <- c(-1, -4, -8) x1 <- c(2, -2, -3) y1 <- c(4, 4, 18) mylengths <- c(0.2, 0.3, 0.15) mycodes <- c(3, 2, 1) myangle <- c(10, 60, 80) plot(x = c(-5, 5), y = c(-10, 20), type = "n", xlab = "", ylab = "") arrows(x0 = x0, y0 = y0, x1 = x1, y1 = y1, length = mylengths, code = mycodes, angle = myangle) I found a workaround, but this is more difficult to explain to beginners... plot(x = c(-5, 5), y = c(-10, 20), type = "n", xlab = "", ylab = "") mapply(arrows, x0 = x0, y0 = y0, x1 = x1, y1 = y1, length = mylengths, angle = myangle, code = mycodes) pj -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] duplicated factor labels.
return(x) } mapidx <- match(x, from) mapidxNA <- is.na(mapidx) # index of items in `from` that were found in `x` from_found <- sort(unique(mapidx)) if (warn_missing && length(from_found) != length(from)) { message("The following `from` values were not present in `x`: ", paste(from[!(1:length(from) %in% from_found) ], collapse = ", ")) } x[!mapidxNA] <- to[mapidx[!mapidxNA]] x } In the rockchalk package, I wrote a function called combineLevels that is careful with ordinal variables and only allows adjacent values to be combined. I'm not suggesting you go that far with this simple piece. > > > > >> >> -pd >> >>> On 23 Jun 2017, at 10:42 , Martin Maechler >>> wrote: >>> >>>>>>>> Martin Maechler >>>>>>>> on Thu, 22 Jun 2017 11:43:59 +0200 writes: >>> >>> >>>>>>>> Paul Johnson >>>>>>>> on Fri, 16 Jun 2017 11:02:34 -0500 writes: >>> >>> >>>>> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys >>>>> wrote: >>>>>> >>>>>> To extwnd on Martin 's explanation : >>>>>> >>>>>> In factor(), levels are the unique input values and labels the unique >>>>>> output >>>>>> values. So the function levels() actually displays the labels. >>>>>> >>> >>>>> Dear Joris >>> >>> >>>>> I think we agree. Currently, factor insists both levels and labels be >>>>> unique. >>> >>> >>>>> I wish that it would not accept nonunique labels. I also understand it >>>>> is impractical to change this now in base R. >>> >>> >>>>> I don't think I succeeded in explaining why this would be nicer. >>>>> Here's another example. Fairly often, we see input data like >>> >>> >>>>> x <- c("Male", "Man", "male", "Man", "Female") >>> >>> >>>>> The first four represent the same value. I'd like to go in one step >>>>> to a new factor variable with enumerated types "Male" and "Female". >>>>> This fails >>> >>> >>>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"), >>>>> labels = c("Male", "Male", "Male", "Female")) >>> >>> >>>>> Instead, we need 2 steps. >>> >>> >>>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female")) >>>>> levels(xf) <- c("Male", "Male", "Male", "Female") >>> >>> >>>>> I think it is quirky that `levels<-.factor` allows the duplicated >>>>> labels, whereas factor does not. >>> >>> >>>>> I wrote a function rockchalk::combineLevels to simplify combining >>>>> levels, but most of the students here like plyr::mapvalues to do it. >>>>> The use of levels() can be tricky because one must enumerate all >>>>> values, not just the ones being changed. >>> >>> >>>>> But I do understand Martin's point. Its been this way 25 years, it >>>>> won't change. :). >>> >>> >>>> Well.. the above is a bit out of context. >>> >>> >>>> Your first example really did not make a point to me (and Joris) >>>> and I showed that you could use even two different simple factor() calls >>>> to >>>> produce what you wanted >>>> yc <- factor(c("1",NA,NA,"4","4","4")) >>>> yn <- factor(c( 1, NA,NA, 4, 4, 4)) >>> >>> >>>> Your new example is indeed much more convincing ! >>> >>> >>>> (Note though that the two steps that are needed can be written >>>> more shortly >>> >>> >>>> The "been this way 25 years" is one a reason to be very >>>> cautious(*) with changes, but not a reason for no changes! >>> >>> >>>> (*) Indeed as some of you have noted we really should not "break >>>> behavior". >>>> This means to me we cannot accept a change there which gives >>>> an error or a different result in cases the old behavior gave a valid >>>> factor. >>> >>> >>>> I'm looking at a possible change currently >>>> [not promising that a change will happen ...] >>> >>> >>> In the end, I've liked the change (after 2-3 iterations), and >>> now been brave to commit to R-devel (svn 72845). >>> >>> With the change, I had to disable one of our own regression >>> checks (tests/reg-tests-1b.R, line 726): >>> >>> The following is now (in R-devel -> R 3.5.0) valid: >>> >>>> factor(1:2, labels = c("A","A")) >>> >>>[1] A A >>>Levels: A >>>> >>>> >>> >>> I wonder how many CRAN package checks will "break" from >>> this (my guess is in the order of a dozen), but I hope >>> that these breakages will be benign, e.g., similar to the above >>> case where before an error was expected via tools :: assertError(.) >>> >>> Martin >>> >>> __ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Rmpi, openMPI editions.
API v2.0, Component v1.6.5) MCA btl: openib (MCA v2.0, API v2.0, Component v1.6.5) MCA btl: self (MCA v2.0, API v2.0, Component v1.6.5) MCA btl: sm (MCA v2.0, API v2.0, Component v1.6.5) MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6.5) And like this after changing either ~/openmpi/mca-params.conf or, etc/openmpi-mca-params.conf). MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6.5) MCA btl: self (MCA v2.0, API v2.0, Component v1.6.5) MCA btl: sm (MCA v2.0, API v2.0, Component v1.6.5) MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6.5) I believe it is worth mentioning that, if some of your compute nodes have Infiniband, an some do not, then OpenMPI jobs will crash if they try to integrate nodes connected with ethernet and Infiniband. That is another reason to tell OpenMPI not to try to use Infiniband at all. pj On Mon, Jun 19, 2017 at 2:34 PM, Paul Johnson wrote: > Greetings. > > I see a warning message while compiling OpenMPI and would appreciate > it if you tell me what it means. > > This warning happens with any OpenMPI > 1.6.5. Even before starting a > cluster, just "sessionInfo" triggers this warning. > > I'm pasting in the message from R-3.3.2 (this is MRO). > > Do the R parallel package cluster functions violate the warnings described > here? > >> library("Rmpi") >> sessionInfo() > -- > An MPI process has executed an operation involving a call to the > "fork()" system call to create a child process. Open MPI is currently > operating in a condition that could result in memory corruption or > other system errors; your MPI job may hang, crash, or produce silent > data corruption. The use of fork() (or system() or other calls that > create child processes) is strongly discouraged. > > The process that invoked fork was: > > Local host: n401 (PID 114242) > MPI_COMM_WORLD rank: 0 > > If you are *absolutely sure* that your application will successfully > and correctly survive a call to fork(), you may disable this warning > by setting the mpi_warn_on_fork MCA parameter to 0. > -- > R version 3.3.2 (2016-10-31) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rmpi_0.6-6 RevoUtilsMath_10.0.0 > > loaded via a namespace (and not attached): > [1] RevoUtils_10.0.2 parallel_3.3.2 tools_3.3.2 >> > > What I think this means is that we need to never run any multicore > functions and we need to be very careful that MKL or such does not > launch threads. Is that right? Is it worse than that? > > Why am I chasing this one today? > > I've been on an adventure compiling R in a RedHat 6 cluster again. The > cluster admins here like the Microsoft R, and they had both 3.3 and > 3.4 installed. However, we found some packaging flaws in 3.4 and so > that MRO was removed. I'm interested in building R-3.4, but it is a > pretty big job on the old RedHat. I want to get this correct. > > I've run into the problem I'd forgotten about OpenMPI. If OpenMPI >= > 2, then Rmpi will compile, but jobs hang with "stopCluster". With > OpenMPI-1.6.5, we get a clean build and no warnings, and clusters do > start and stop cleanly. With newer version 1 editions of OpenMPI, > such as 1.10 or 1.12 (I suspect any versions (> 1.6.5), the Rmpi > generates an intimidating warning, but the cluster will stop when > asked. > > > -- > Paul E. Johnson http://pj.freefaculty.org > Director, Center for Research Methods and Data Analysis http://crmda.ku.edu > > To write to me directly, please address me at pauljohn at ku.edu. -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] what do you think about write.table(... qmethod = "excel")?
Last week one of our clients reported trouble with a csv file I generated with write.table. He said that columns with quotes for character variables were rejected by their data importer, which was revised to match the way Microsoft Excel uses quotation marks in character variables. I explained to them that quoted character variables are virtuous and wise, of course, but they say Microsoft Excel CSV export no longer quotes characters unless they include commas in the values. They showed me a CSV file from Excel that looked like this x1,x2,x3,x4 5 6 fred,barney,betty,x bambam,"fred,wilma",pebbles,y Note how the quotes only happen on row 2 column 2. I was surprised it did that, but now I have some pressure to write a csv maker that has that structure. Its weird, even when there are spaces in values there are no quotation marks. Has anybody done this and verified that it matches CSV from MS Excel? If I succeed will you consider a patch? pj -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] what do you think about write.table(... qmethod = "excel")?
On Tue, Sep 19, 2017 at 4:45 PM, Duncan Murdoch wrote: > > > That's true, but if that's what they want to do, and they're willing to pay > to be able to write files that imitate Excel, then why not do what they ask? > > On the other hand, if they aren't willing to pay for the work, then you > should lecture them on how silly their request is. > > In any case, base R functions should not include nonsense, so this is not > something that should go into R. > > Duncan Murdoch > I understand. This is a paying client, I'm going where the money goes. Here's my current working example of a function that writes a CSV exactly as Excel does. I've posted this into StackOverflow (https://stackoverflow.com/questions/25743018/how-to-conditionally-remove-quotes-in-write-csv). It is buried under a thread, maybe will not get much attention and bug-checking. Oh, well, I can hope. Processes variables column-by-column, I don't know how to do it differently. ##' Write CSV files with quotes same as MS Excel 2013 or newer ##' ##' R's write.csv inserts quotes around all elements in a character ##' vector (if quote = TRUE). In contrast, MS Excel CSV export no ##' longer inserts quotation marks on all elements in character ##' variables, except when the cells include commas or quotation ##' marks. This function generates CSV files that are, so far as we ##' know, exactly the same "quoted style" as MS Excel CSV export ##' files. ##' ##' This works by manually inserting quotation marks where necessary and ##' turning FALSE R's own method to insert quotation marks. ##' @param x a data frame ##' @param file character string for file name ##' @param row.names Default FALSE for row.names ##' @importFrom utils write.table ##' @return the return from write.table, using revised quotes ##' @export ##' @author Paul Johnson ##' @examples ##' set.seed(234) ##' x1 <- data.frame(x1 = c("a", "b,c", "b", "The \"Washington, DC\""), ##' x2 = rnorm(4), stringsAsFactors = FALSE) ##' x1 ##' fn <- tempfile(pattern = "testcsv", fileext = ".csv") ##' writeCSV(x1, file = fn) ##' readLines(fn) ##' x2 <- read.table(fn, sep = ",", header = TRUE, stringsAsFactors = FALSE) ##' all.equal(x1,x2) writeCSV <- function(x, file, row.names = FALSE){ xischar <- colnames(x)[sapply(x, is.character)] for(jj in xischar){ x[ , jj] <- gsub('"', '""', x[ , jj], fixed = TRUE) needsquotes <- grep('[\",]', x[ ,jj]) x[needsquotes, jj] <- paste0("\"", x[needsquotes, jj], "\"") } write.table(x, file = file, sep = ",", quote = FALSE, row.names = row.names) } Output: > set.seed(234) > x1 <- data.frame(x1 = c("a", "b,c", "b", "The \"Washington, DC\""), +x2 = rnorm(4), stringsAsFactors = FALSE) > x1 x1 x2 1a 0.6607697 2 b,c -2.0529830 3b -1.4992061 4 The "Washington, DC" 1.4712331 > fn <- tempfile(pattern = "testcsv", fileext = ".csv") > writeCSV(x1, file = fn) > readLines(fn) [1] "x1,x2" [2] "a,0.660769736644892" [3] "\"b,c\",-2.052983003941" [4] "b,-1.49920605110092" [5] "\"The \"\"Washington, DC\"\"\",1.4712331168047" > x2 <- read.table(fn, sep = ",", header = TRUE, stringsAsFactors = FALSE) > all.equal(x1,x2) [1] TRUE I already see one problem, that I've got no special arrangement for column names with commas or quotes. People who want column names with those things are even more wrong than the people write a parser that can't understand quotes on character variables. -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] package check fail on Windows-release only?
I mistakenly left a write in "/tmp" in the rockchalk package (version 1.8.109) that I uploaded last Friday. Kurt H wrote and asked me to fix today. While uploading a new one, I became aware of a problem I had not seen. The version I uploaded last Friday, 1.8.109, has OK status on all platforms except r-release-windows-ix86+x86_64. I get OK on oldrel-windows and also on devel-windows. https://cran.r-project.org/web/checks/check_results_rockchalk.html Can you please advise me on what to do? I don't understand, well, anything about this :( The error says: * installing *source* package 'rockchalk' ... ** package 'rockchalk' successfully unpacked and MD5 sums checked ** R ** data ** inst ** preparing package for lazy loading Warning: S3 methods 'print.sparseSummary', 'print.diagSummary', 'c.abIndex', 'c.sparseVector', 'as.array.Matrix', 'as.array.sparseVector', 'as.matrix.Matrix', 'as.matrix.sparseVector', 'as.vector.Matrix', 'as.vector.sparseVector' were declared in NAMESPACE but not found Error in namespaceExport(ns, exports) : undefined exports: %&%, Cholesky, .SuiteSparse_version, Diagonal, .symDiagonal, .sparseDiagonal, .trDiagonal, Hilbert, KhatriRao, Matrix, MatrixClass, spMatrix, sparseMatrix, rsparsematrix, Schur, abIseq, abIseq1, rep2abI, band, bandSparse, bdiag, .bdiag, c.sparseVector, condest, onenormest, .asmatrix, .dsy2mat, .dsy2dsp, .dxC2mat, .T2Cmat, ..2dge, .dense2sy, .C2nC, .nC2d, .nC2l, .diag.dsC, .solve.dgC.chol, .solve.dgC.qr, .solve.dgC.lu, diagN2U, diagU2N, .diagU2N, .diag2tT, .diag2sT, .diag2mat, drop0, expand, expm, facmul, fac2sparse, fac2Sparse, forceSymmetric, T2graph, graph2T, anyDuplicatedT, uniqTsparse, isTriangular, isDiagonal, isLDL, is.null.DN, invPerm, lu, nearPD, nnzero, formatSpMatrix, formatSparseM, .formatSparseSimple, printSpMatrix, printSpMatrix2, qrR, rankMatrix, readHB, readMM, sparse.model.matrix, sparseVector, symmpart, skewpart, tril, triu, updown, pack, unpack, .updateCHMfactor, .validateCsparse, writeMM, cBind, rBind ERROR: lazy loading failed for package 'rockchalk' * removing 'd:/Rcompile/CRANpkg/lib/3.4/rockchalk' * restoring previous 'd:/Rcompile/CRANpkg/lib/3.4/rockchalk' -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Two R editiosn in Unix cluster systems
Dear R Devel Some of our R users are still insisting we run R-2.15.3 because of difficulties with a package called OpenMX. It can't cooperate with new R, oh well. Other users need to run R-3.0.1. I'm looking for the most direct route to install both, and allow users to choose at runtime. In the cluster, things run faster if I install RPMs to each node, rather than putting R itself on the NFS share (but I could do that if you think it's really better) In the past, I've used the SRPM packaging from EPEL repository to make a few little path changes and build R RPM for our cluster nodes. Now I face the problem of building 2 RPMS, one for R-2.15.3 and one for R-newest, and somehow keeping them separate. If you were me, how would you approach this? Here's my guess First, The RPM packages need unique names, of course. Second, leave the RPM packaging for R-newest exactly the same as it always was. R is in the path, the R script and references among all the bits will be fine, no need to fight. It will find what it needs in /usr/lib64/R or whatnot. For the legacy R, I'm considering 2 ideas. I could install R with the same prefix, /usr, but very careful so the R bits are installed into separate places. I just made a fresh build of R and on RedHat 6, it appears to me R installs these directories: bin libdir share. So what if the configure line has the magic bindir=/usr/bin-R-2.15.3 libdir = /usr/lib64/R-2.15.3, and whatnot. If I were doing Debian packaging, I suppose I'd be obligated (by the file system standard) to do that kind of thing. But it looks like a headache. The easy road is to set the prefix at some out of the way place, like /opt/R-2.15.3, and then use a post-install script to link /opt/R-2/15.3/bin/R to /usr/bin/R-2.15.3. When I tried that, it surprised me because R did not complain about lack access to devel headers. It configures and builds fine. R is now configured for x86_64-unknown-linux-gnu Source directory: . Installation directory:/tmp/R C compiler:gcc -std=gnu99 -g -O2 Fortran 77 compiler: gfortran -g -O2 C++ compiler: g++ -g -O2 Fortran 90/95 compiler:gfortran -g -O2 Obj-C compiler:gcc -g -O2 -fobjc-exceptions Interfaces supported: X11, tcltk External libraries:readline, ICU, lzma Additional capabilities: PNG, JPEG, TIFF, NLS, cairo Options enabled: shared BLAS, R profiling, Java Recommended packages: yes Should I worry about any runtime complications of this older R finding its of the newer R in the PATH ahead of it? I worry I'm making lazy assumptions? After that, I need to do some dancing with the RPM packaging. I suppose there'd be some comfort if I could get the users to define R_HOME in their user environment before launching jobs, I think that would eliminate the danger of confusion between versions, wouldn't it? pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Tab formatting in dummy.coef.R
Happy New Year I recognize this is a low priority issue, but... I'll fix it if you let me. There are some TABs where R style calls for 4 spaces. For example R-3.0.2/src/library/stats/R/dummy.coef.R. I never noticed this until today, when I was stranded on a deserted island with only the R source code and a Swiss Army knife (vi). Now I realize my ~.vimrc has tabstop set at 2, and it makes tab indentations in R source "go backwards". if you are strictly following R Internals 9 "R Coding Standards", the TABs should be replaced with spaces, I shouldn't have to reconfigure the editor. Here's the way it looks in my vi dummy.coef.aovlist <- function(object, use.na = FALSE, ...) { Terms <- terms(object, specials="Error") err <- attr(Terms,"specials")$Error - 1 tl <- attr(Terms, "term.labels")[-err] int <- attr(Terms, "intercept") facs <- attr(Terms, "factors")[-c(1,1+err), -err, drop=FALSE] vars <- rownames(facs) xl <- attr(object, "xlevels") if(!length(xl)) { # no factors in model return(as.list(coef(object))) } nxl <- setNames(rep.int(1, length(vars)), vars) tmp <- unlist(lapply(xl, length)) nxl[names(tmp)] <- tmp lterms <- apply(facs, 2L, function(x) prod(nxl[x > 0])) nl <- sum(lterms) args <- setNames(vector("list", length(vars)), vars) for(i in vars) args[[i]] <- if(nxl[[i]] == 1) rep.int(1, nl) else factor(rep.int(xl[[i]][1L], nl), levels = xl[[i]]) dummy <- do.call("data.frame", args) pos <- 0 rn <- rep.int(tl, lterms) rnn <- rep.int("", nl) for(j in tl) { i <- vars[facs[, j] > 0] ifac <- i[nxl[i] > 1] if(length(ifac) == 0L) {# quantitative factor rnn[pos + 1] <- j } else if(length(ifac) == 1L) { # main effect dummy[ pos+1L:lterms[j], ifac ] <- xl[[ifac]] rnn[ pos+1L:lterms[j] ] <- as.character(xl[[ifac]]) [... snip ] The patch looks like this in my vi: *** dummy.coef.R 2013-03-05 17:02:23.0 -0600 --- dummy.coef-indent.R 2014-01-02 21:49:25.828163544 -0600 *** *** 28,35 Terms <- delete.response(Terms) vars <- all.vars(Terms) xl <- object$xlevels ! if(!length(xl)) { # no factors in model ! return(as.list(coef(object))) } nxl <- setNames(rep.int(1, length(vars)), vars) tmp <- unlist(lapply(xl, length)) ## ?? vapply(xl, length, 1L) --- 28,35 Terms <- delete.response(Terms) vars <- all.vars(Terms) xl <- object$xlevels ! if(!length(xl)) {# no factors in model ! return(as.list(coef(object))) } nxl <- setNames(rep.int(1, length(vars)), vars) tmp <- unlist(lapply(xl, length)) ## ?? vapply(xl, length, 1L) *** *** 38,64 nl <- sum(lterms) args <- setNames(vector("list", length(vars)), vars) for(i in vars) ! args[[i]] <- if(nxl[[i]] == 1) rep.int(1, nl) ! else factor(rep.int(xl[[i]][1L], nl), levels = xl[[i]]) dummy <- do.call("data.frame", args) pos <- 0 rn <- rep.int(tl, lterms) rnn <- rep.int("", nl) for(j in tl) { ! i <- vars[facs[, j] > 0] ! ifac <- i[nxl[i] > 1] ! if(length(ifac) == 0L) {# quantitative factor ! rnn[pos+1] <- j ! } else if(length(ifac) == 1L) { # main effect ! dummy[ pos+1L:lterms[j], ifac ] <- xl[[ifac]] ! rnn[ pos+1L:lterms[j] ] <- as.character(xl[[ifac]]) ! } else { # interaction ! tmp <- expand.grid(xl[ifac]) ! dummy[ pos+1L:lterms[j], ifac ] <- tmp ! rnn[ pos+1L:lterms[j] ] <- ! apply(as.matrix(tmp), 1L, function(x) paste(x, collapse=":")) ! } ! pos <- pos + lterms[j] } ## some terms like poly(x,1) will give problems here, so allow ## NaNs and set to NA afterwards. --- 38,64 nl <- sum(lterms) args <- setNames(vector("list", length(vars)), vars) for(i in vars) ! args[[i]] <- if(nxl[[i]] == 1) rep.int(1, nl) ! else factor(rep.int(xl[[i]][1L], nl), levels = xl[[i]]) dummy <- do.call("data.frame", args) pos <- 0 rn <- rep.int(tl, lterms) rnn <- rep.int("", nl) for(j in tl) { ! i <- vars[facs[, j] > 0] ! ifac <- i[nxl[i] > 1] ! if(length(ifac) == 0L) {# quantitative factor ! rnn[pos+1] <- j ! } else if(length(ifac) == 1L) {# main effect ! dummy[ pos+1L:lterms[j], ifac ] <- xl[[ifac]] ! rnn[ pos+1L:lterms[j] ] <- as.character(xl[[ifac]]) ! } else {# interaction ! tmp <- expand.grid(xl[ifac]) ! dummy[ pos+1L:lterms[j], ifac ] <- tmp ! rnn[ pos+1L:lterms[j] ] <- ! apply(as.matrix(tmp), 1L, function(x) paste(x, collapse=":")) ! } ! pos <- pos + lterms[j] } [...snip...] I could make a patch that applies at the top level, maybe one for each library. If one of you will consider using it. Again, I recognize this is a small thing, the editor c
[Rd] predict.glm line 28. Please explain
I imitated predict.glm, my thing worked, now I need to revise. It would help me very much if someone would explain predict.glm line 28, which says object$na.action <- NULL # kill this for predict.lm calls I want to know 1) why does it set the object$na.action to NULL 2) what does the comment after mean? Maybe I need a pass by value lesson too, because I can't see how changing that in object would have any effect on calculations done elsewhere. pj Here's the context from predict.glm, to save you trouble of looking it up: predict.glm <- function(object, newdata = NULL, type = c("link", "response", "terms"), se.fit = FALSE, dispersion = NULL, terms = NULL, na.action = na.pass, ...) { ## 1998/06/23 KH: predict.lm() now merged with the version in lm.R type <- match.arg(type) na.act <- object$na.action object$na.action <- NULL # kill this for predict.lm calls if (!se.fit) { ## No standard errors if(missing(newdata)) { pred <- switch(type, link = object$linear.predictors, response = object$fitted.values, terms = predict.lm(object, se.fit = se.fit, scale = 1, type = "terms", terms = terms) ) if(!is.null(na.act)) pred <- napredict(na.act, pred) } else { pred <- predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == "link", "response", type), terms = terms, na.action = na.action) switch(type, response = {pred <- family(object)$linkinv(pred)}, link = , terms = ) } } else { ## summary.survreg has no ... argument. if(inherits(object, "survreg")) dispersion <- 1. if(is.null(dispersion) || dispersion == 0) dispersion <- summary(object, dispersion=dispersion)$dispersion residual.scale <- as.vector(sqrt(dispersion)) pred <- predict.lm(object, newdata, se.fit, scale = residual.scale, type = ifelse(type == "link", "response", type), terms = terms, na.action = na.action) fit <- pred$fit se.fit <- pred$se.fit switch(type, response = { se.fit <- se.fit * abs(family(object)$mu.eta(fit)) fit <- family(object)$linkinv(fit) }, link = , terms = ) if( missing(newdata) && !is.null(na.act) ) { fit <- napredict(na.act, fit) se.fit <- napredict(na.act, se.fit) } pred <- list(fit = fit, se.fit = se.fit, residual.scale = residual.scale) } pred } -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] package compositions removed CRAN. Explain please the output?
I notice compositions was removed on CRAN, that's one I want to try out. I downloaded the last tarball, the build output doesn't look that bad. A warning, no errors. $ R CMD build compositions * checking for file ‘compositions/DESCRIPTION’ ... OK * preparing ‘compositions’: * checking DESCRIPTION meta-information ... OK * cleaning src * checking for LF line-endings in source and make files * checking for empty or unneeded directories * looking to see if a ‘data/datalist’ file should be added * building ‘compositions_1.40-0.tar.gz’ $ R CMD check --as-cran compositions_1.40-0.tar.gz * using log directory ‘/tmp/compositions.Rcheck’ * using R version 3.1.0 (2014-04-10) * using platform: x86_64-pc-linux-gnu (64-bit) * using session charset: UTF-8 * checking for file ‘compositions/DESCRIPTION’ ... OK * this is package ‘compositions’ version ‘1.40-0’ * checking CRAN incoming feasibility ... WARNING Maintainer: ‘K. Gerald van den Boogaart ’ New submission Package was archived on CRAN Insufficient package version (submitted: 1.40.0, existing: 1.40.0) CRAN repository db overrides: X-CRAN-Comment: Archived on 2014-05-01 as undefined-behaviour errors were not corrected (and previouslu archived on 2014-03-07). * checking package namespace information ... OK * checking package dependencies ... NOTE No repository set, so cyclic dependency check skipped * checking if this is a source package ... OK * checking if there is a namespace ... OK * checking for executable files ... OK * checking for hidden files and directories ... OK * checking for portable file names ... OK * checking for sufficient/correct file permissions ... OK * checking whether package ‘compositions’ can be installed ... OK * checking installed package size ... OK * checking package directory ... OK * checking DESCRIPTION meta-information ... OK * checking top-level files ... OK * checking for left-over files ... OK * checking index information ... OK * checking package subdirectories ... OK * checking R files for non-ASCII characters ... OK * checking R files for syntax errors ... OK * checking whether the package can be loaded ... OK * checking whether the package can be loaded with stated dependencies ... OK * checking whether the package can be unloaded cleanly ... OK * checking whether the namespace can be loaded with stated dependencies ... OK * checking whether the namespace can be unloaded cleanly ... OK * checking loading without being on the library search path ... OK * checking dependencies in R code ... OK * checking S3 generic/method consistency ... OK * checking replacement functions ... OK * checking foreign function calls ... OK * checking R code for possible problems ... OK * checking Rd files ... OK * checking Rd metadata ... OK * checking Rd line widths ... OK * checking Rd cross-references ... OK * checking for missing documentation entries ... OK * checking for code/documentation mismatches ... OK * checking Rd \usage sections ... OK * checking Rd contents ... OK * checking for unstated dependencies in examples ... OK * checking contents of ‘data’ directory ... OK * checking data for non-ASCII characters ... OK * checking data for ASCII and uncompressed saves ... OK * checking line endings in C/C++/Fortran sources/headers ... OK * checking line endings in Makefiles ... OK * checking compilation flags in Makevars ... OK * checking for portable use of $(BLAS_LIBS) and $(LAPACK_LIBS) ... OK * checking compiled code ... OK * checking sizes of PDF files under ‘inst/doc’ ... OK * checking installed files from ‘inst/doc’ ... OK * checking files in ‘vignettes’ ... OK * checking examples ... OK Examples with CPU or elapsed time > 5s user system elapsed SimulatedAmounts 18.810 0.188 19.064 rmahalanobis 8.573 0.016 8.619 replot7.078 0.012 7.113 * checking for unstated dependencies in tests ... OK * checking tests ... Running ‘AitchisonTest.R’ Running ‘CheckGeometry.R’ [38s/38s] Running ‘RobustTest.R’ Running ‘TestMissing.R’ [39s/40s] Running ‘TestSimp.R’ [12s/12s] Running ‘acheck.R’ Running ‘geostat.R’ Running ‘missings.R’ Running ‘quickcheck.R’ OK * checking PDF version of manual ... OK WARNING: There was 1 warning. NOTE: There was 1 note. See ‘/tmp/compositions.Rcheck/00check.log’ for details. -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] portableParalleSeeds Package violation, CRAN exception?
I'm writing to ask for a policy exception, or advice on how to make this package CRAN allowable. http://rweb.quant.ku.edu/kran/src/contrib/portableParallelSeeds_0.9.tar.gz Yesterday I tried to submit a package on CRAN and Dr Ripley pointed out that I had not understood the instructions about packages. Here's the part where the R check gives a Note * checking R code for possible problems ... NOTE Found the following assignments to the global environment: File ‘portableParallelSeeds/R/initPortableStreams.R’: assign("currentStream", n, envir = .GlobalEnv) assign("currentStates", curStates, envir = .GlobalEnv) assign("currentStream", 1L, envir = .GlobalEnv) assign("startStates", runSeeds, envir = .GlobalEnv) assign("currentStates", runSeeds, envir = .GlobalEnv) assign("currentStream", as.integer(currentStream), envir = .GlobalEnv) assign("startStates", runSeeds, envir = .GlobalEnv) assign("currentStates", runSeeds, envir = .GlobalEnv) Altering the user's environment requires a special arrangement with CRAN. I believe this is justified, I'll sketch the reasons now. But, mostly, I'm at your mercy and if there is any way to make this possible, I would be very grateful. To control & replace random number streams, it really is necessary to alter the workspace. That's where the random generator state is stored. It is acknowledged in Robert Gentleman' s Book, R Programming for Bionformatics "The decision to have these [random generator] functions manipulate a global variable, .Random.seed, is slightly unfortunate as it makes it somewhat more difficult to manage several different random number streams simultaneously” (Gentleman, 2009, p. 201). I have developed an understandable set of wrapper functions that handle this. Some of you may recall this project. I've asked about it here a couple of times. We allow separate streams of randoms for different purposes within a single R run. There is a framework to save 1000s of those sets in a file, so it can be used on a cluster or in a single workstation. This is handy because, when 1 run in 10,000 on the cluster exhibits some weird behavior, we can easily re-initiate that interactively and see what's going on. I have a vignette "pps" that explains. I dropped a copy of that here in case you don't want to get the package: http://pj.freefaculty.org/scraps/pps.pdf While working on that, I gained a considerably deeper understanding of random generators and seeds. That is what this vignette is about http://pj.freefaculty.org/scraps/PRNG-basics.pdf We've been running simulations on our cluster with the portableParallelSeeds framework for 2 years, we've never had any trouble. We are able to re-start runs, verify random number draws in separate streams. PJ -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] S3 generic method dispatch on promises
Dear R friends I wanted a function to make a simple percent table that would be easy for students to use. The goal originally was to have a simple thing people would call like this pctable(rowvar, colvar, data) and the things "rowvar" and "colvar" might be names of variables in data. I wanted to avoid the usage of "with" (as we now see in the table help). Then some people wanted more features, and I agreed with the suggestion to create a formula interface that people can call like so: pctable(rowvar ~ colvar, data) I end up with a generic function pctable and methods pctable.default, pctable.formula, pctable.character. I got that working, mostly I understand what's going on. Except the following, which, actually, is a good lesson to me about promises and method dispatch in R. An S3 generic will not send a call with a promise in the first argument to pctable.default (as I had mistakenly hoped). I'll paste in all the code below, but I think you will know the answer even without running it. pctable is a generic function. In workspace, I have no objects x and y, but there are variables inside data.frame dat named x and y. Since y is not an object, the method dispatch fails thus: > pctable(y, x, dat) Error in pctable(y, x, dat) (from #3) : object 'y' not found This direct call on pctable.default works (recall y and x are promises): > pctable.default(y, x, dat) Count (column %) x y 1 2 3 4 Sum A 5(20%) 3(12%) 5(20%) 6(24%) 19 B 9(36%) 5(20%) 4(16%) 6(24%) 24 C 1(4%) 6(24%) 3(12%) 2(8%) 12 D 4(16%) 4(16%) 6(24%) 5(20%) 19 E 6(24%) 7(28%) 7(28%) 6(24%) 26 Sum 25 25 25 25 100 All the methods work fine when the first argument is a language object. This works (dispatches to pctable.formula) > pctable(y ~ x, dat) Count (column %) x y 1 2 3 4 Sum A 5(20%) 3(12%) 5(20%) 6(24%) 19 B 9(36%) 5(20%) 4(16%) 6(24%) 24 C 1(4%) 6(24%) 3(12%) 2(8%) 12 D 4(16%) 4(16%) 6(24%) 5(20%) 19 E 6(24%) 7(28%) 7(28%) 6(24%) 26 Sum 25 25 25 25 100 This works (dispatches to pctable.default) > pctable(dat$y, dat$x) Count (column %) dat$x dat$y 1 2 3 4 Sum A 5(20%) 3(12%) 5(20%) 6(24%) 19 B 9(36%) 5(20%) 4(16%) 6(24%) 24 C 1(4%) 6(24%) 3(12%) 2(8%) 12 D 4(16%) 4(16%) 6(24%) 5(20%) 19 E 6(24%) 7(28%) 7(28%) 6(24%) 26 Sum 25 25 25 25 100 However, this fails because y is not an object with a type > pctable(y, x, dat) Error in pctable(y, x, dat) (from #3) : object 'y' not found Can R be tricked to send that call to pctable.default, where it does work? Here's the code, I'm working on documentation, will put in package rockchalk eventually, but hate to leave this problem until I fully understand it. pctable <- function(rv, ...) { UseMethod("pctable") } ## rv: row variable, quoted or not ## cv: column variable, quoted or not pctable.default <- function(rv, cv, data = parent.frame(), rvlab = NULL, cvlab = NULL, colpct = TRUE, rowpct = FALSE, exclude = c(NA, NaN), rounded = FALSE) { rvlabel <- if (!missing(rv)) deparse(substitute(rv)) cvlabel <- if (!missing(cv)) deparse(substitute(cv)) rvlab <- if (is.null(rvlab)) rvlabel else rvlab cvlab <- if (is.null(cvlab)) cvlabel else cvlab rvin <- eval(substitute(rv), envir = data, enclos = parent.frame()) cvin <- eval(substitute(cv), envir = data, enclos = parent.frame()) t1 <- table(rvin, cvin, dnn = c(rvlab, cvlab), exclude = exclude) rownames(t1)[is.na(rownames(t1))] <- "NA" ## symbol to letters colnames(t1)[is.na(colnames(t1))] <- "NA" if (rounded) t1 <- round(t1, -1) t2 <- addmargins(t1, c(1,2)) t1colpct <- round(100*prop.table(t1, 2), 1) t1rowpct <- round(100*prop.table(t1, 1), 1) t1colpct <- apply(t1colpct, c(1,2), function(x) gsub("NaN", "", x)) t1rowpct <- apply(t1rowpct, c(1,2), function(x) gsub("NaN", "", x)) res <- list("count" = t2, "colpct" = t1colpct, "rowpct" = t1rowpct, call = match.call()) class(res) <- "pctable" print(res, colpct = colpct, rowpct = rowpct) invisible(res) } pctable.formula <- function(formula, data = NULL, rvlab = NULL, cvlab = NULL, colpct = TRUE, rowpct = FALSE, exclude = c(NA, NaN), rounded = FALSE, ..., subset = NULL) { if (missing(data) || !is.data.frame(data)) stop("pctable requires a data frame") if (missing(formula) || (length(formula) != 3L)) stop("pctable requires a two sided formula") mt <- terms(formula, data = data) if (attr(mt, "response") == 0L) stop("response variable is required") mf <- match.call(expand.dots = FALSE) keepers <- match(c("formula", "data", "subset", "na.action"), names(mf), 0L) mf <- mf[c(1L, keepers)] mf$drop.unused.levels <- F
[Rd] "What Calls What" diagram. Flow Chart?
I don't know the right computer science words for this question, I'm afraid. Apology in advance. How do you find your way around in somebody else's code? If the user runs a specific command, and wants to know how the data is managed until the result is returned, what to do ? I've tried this manually with tools like mtrace and browser. This is a bit frustrating because the browser does not stay in effect until the work is done. It quits at the end of the function. So you have to attach the browser to the functions that are called there, and so forth. But that doesn't quite put everything together. Example. Recently I was trying to find out where the package lavaan's calculations for the function cfa are actually done and it was a maddening chase from one function to the next, as data was re-organized and options were filled in. lavaan's "cfa" function reformats some options, then the work gets done by an eval. cfa> fit <- cfa(HS.model, data=HolzingerSwineford1939) debugging in: cfa(HS.model, data = HolzingerSwineford1939) debug: { mc <- match.call() mc$model.type = as.character(mc[[1L]]) if (length(mc$model.type) == 3L) mc$model.type <- mc$model.type[3L] mc$int.ov.free = TRUE mc$int.lv.free = FALSE mc$auto.fix.first = !std.lv mc$auto.fix.single = TRUE mc$auto.var = TRUE mc$auto.cov.lv.x = TRUE mc$auto.cov.y = TRUE mc[[1L]] <- as.name("lavaan") eval(mc, parent.frame()) } The value of "mc" that gets executed by eval is this Browse[2]> mc lavaan(model.syntax = HS.model, data = HolzingerSwineford1939, model.type = "cfa", int.ov.free = TRUE, int.lv.free = FALSE, auto.fix.first = TRUE, auto.fix.single = TRUE, auto.var = TRUE, auto.cov.lv.x = TRUE, auto.cov.y = TRUE) So then I need to but a debug on "lavaan" and step through that, see what it does. Is there a way to make a list of the functions that are called "pop out", possibly with a flow chart? Consider lm, I want to know lm -> lm.fit -> .Fortran("dqrls") I'm not asking for a conceptual UML diagram, so far as I know. The kind of trace information you get with gdb in C programs and shallow steps with "n" would probably help. I would not need to keep attaching more functions with debug. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] "What Calls What" diagram. Flow Chart?
On Sun, Oct 9, 2011 at 5:29 PM, wrote: > Hi Paul > > Have you tried > > mvbutils::foodweb( where=asNamespace( 'lavaan')) > > (assuming lavaan has a namespace, otherwise where='package:lavaan')? > > Sounds like it's what you're after-- > > Mark > Thanks, Mark. The foodweb graph for lavaan is a bit overwhelming. The graph shows everything it finds that might be called any time, it doesn't help me trace the path of a specific user call to a particular function. So I'm not entirely sure it is doing what I hope for. While matching the graph against the source code, it seems to me some R language idioms can confuse/break the foodweb. When eval is called on a string object, then I think function calls can escape detection. In the cfa example code I put in the original post, the function "lavaan" is called by eval, and as far as I can tell in the foodweb output, that connection is not found. I'm still studying your package, of course, but here's (I think) an example, I know "cfa" does call "lavaan" though eval, but this code library(lavaan) library(mvbutils) mvbutils::foodweb( where=asNamespace( 'lavaan')) myfw <- mvbutils::foodweb( where=asNamespace( 'lavaan')) callers.of("lavaan", myfw) > [1] "independence.model" "independence.model.fit" [3] "independence.model.fit2" "setLavaanOptions" -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] extra & in some lines of R KEYWORDS file
The KEYWORDS file (on R 2.13.1 anyway) , seems to have some extra "&" symbols. Look below at "&" before "character" "complex" "category" and "NA" ?intentional or mistaken? If not mistaken, what do the & mean? Basics sysdata & Basic System Variables [!= S] datasets& Datasets available by data(.) [!= S] data& Environments, Scoping, Packages [~= S] manip & Data Manipulation attribute & Data Attributes classes & Data Types (not OO) & character & Character Data ("String") Operations & complex & Complex Numbers & category & Categorical Data & NA& Missing Values [!= S] list& Lists chron & Dates and Times package & Package Summaries -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] One step way to create data frame with variable "variable names"?
Suppose plotx <- "someName" modx <- "otherName" plotxRange <- c(10,20) modxVals <- c(1,2,3) It often happens I want to create a dataframe or object with plotx or modx as the variable names. But can't understand syntax to do that. I can get this done in 2 steps, creating the data frame and then assigning names, as in newdf <- data.frame( c(1, 2, 3, 4), c(4, 4, 4, 4)) colnames(newdf) <- c(plotx, modx) I was trying to hit this in one step, but can't find how. If I could get this in one step, it would simplify some book keeping, because this current method requires me to build up the name vector to match newdf, and sometimes I make mistakes. After the data.frame already exists, I understand I can insert new variables using the variable names with syntax like newdf[[plotx]] <- c(23, 23, 23, 23) PJ -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] termplot & predict.lm. some details about calculating predicted values with "other variables set at the mean"
I'm making some functions to illustrate regressions and I have been staring at termplot and predict.lm and residuals.lm to see how this is done. I've wondered who wrote predict.lm originally, because I think it is very clever. I got interested because termplot doesn't work with interactive models: > m1 <- lm(y ~ x1*x2) > termplot(m1) Error in `[.data.frame`(mf, , i) : undefined columns selected Digging into that, I realized some surprising implications of nonlinear formulas. This issue arises when there are math functions in the regression formula. The question focuses on what we mean by the mean of "x" when we are discussing predictions and deviations. Suppose one fits: m1 <- lm (y ~ x1 + log(x2), data=dat) I had thought the partial residual was calculated with reference to the log of the mean of x2. But that's not right. It is calculated with reference to mean(log(x2)). That seems misleading, termplot shows a graph illustrating the effect of x2 on the horizontal axis (not "log(x2)"). I should not say misleading. Rather, it is unexpected. I think users who want the reference value in the plot of x2 to be the mean of x2 have a legitimate concern here. With a more elaborate formula, the mismatch gets more confusing. Suppose the regression formula is m2 <- lm (y ~ x1 + poly(x2,3), data=dat) The model frame has these variables: yx1 poly(x2, 3).1 poly(x2, 3).2 poly(x2, 3).3 and the partial residual calculation for variable x1, which I had expected would be based on a polynomial transformation of mean(x2), is the weighted sum of the means of the 3 polys. Can you help me see this more clearly? (Or less wrongly?) Perhaps you think I don't understand partial residuals in termplot, but I am pretty sure I do. I made notes about it. See slides 54 and 55 in here: http://pj.freefaculty.org/guides/Rcourse/regression-tableAndPlot-1/regression-tableAndPlot.pdf -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] termplot & predict.lm. some details about calculating predicted values with "other variables set at the mean"
I'm making some functions to illustrate regressions and I have been staring at termplot and predict.lm and residuals.lm to see how this is done. I've wondered who wrote predict.lm originally, because I think it is very clever. I got interested because termplot doesn't work with interactive models: > m1 <- lm(y ~ x1*x2) > termplot(m1) Error in `[.data.frame`(mf, , i) : undefined columns selected Digging into that, I realized some surprising implications of nonlinear formulas. This issue arises when there are math functions in the regression formula. The question focuses on what we mean by the mean of "x" when we are discussing predictions and deviations. Suppose one fits: m1 <- lm (y ~ x1 + log(x2), data=dat) I had thought the partial residual was calculated with reference to the log of the mean of x2. But that's not right. It is calculated with reference to mean(log(x2)). That seems misleading, termplot shows a graph illustrating the effect of x2 on the horizontal axis (not "log(x2)"). I should not say misleading. Rather, it is unexpected. I think users who want the reference value in the plot of x2 to be the mean of x2 have a legitimate concern here. With a more elaborate formula, the mismatch gets more confusing. Suppose the regression formula is m2 <- lm (y ~ x1 + poly(x2,3), data=dat) The model frame has these variables: yx1 poly(x2, 3).1 poly(x2, 3).2 poly(x2, 3).3 and the partial residual calculation for variable x1, which I had expected would be based on a polynomial transformation of mean(x2), is the weighted sum of the means of the 3 polys. Can you help me see this more clearly? (Or less wrongly?) Perhaps you think I don't understand partial residuals in termplot, but I am pretty sure I do. I made notes about it. See slides 54 and 55 in here: http://pj.freefaculty.org/guides/Rcourse/regression-tableAndPlot-1/regression-tableAndPlot.pdf -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] returning information from functions via attributes rather than return list
I would like to ask for advice from R experts about the benefits or dangers of using attr to return information with an object that is returned from a function. I have a feeling as though I have cheated by using attributes, and wonder if I've done something fishy. Maybe I mean to ask, where is the dividing line between attributes and instance variables? The separation is not clear in my mind anymore. Background: I paste below a function that takes in a regression object and make changes to the data and/or call and then run a revised regression. In my earlier effort, I was building a return list, including the new fitted regression object plus some variables that have information about the changes that a were made. That creates some inconvenience, however. When the regression is in a list object, then methods for lm objects don't apply to that result object. The return is not an lm anymore. I either need to write custom methods for every function or remember to extract the object from the list before sending to the generic function. I *guessed* it would work to write the new information as object attributes, and it seems to work. There is a generic function "meanCenter" and a method "meanCenter.default". At the end of meanCenter.default, here's my use (or abuse) of attributes. res <- eval(mc) class(res) <- c("mcreg", class(model)) attr(res, "centeredVars") <- nc attr(res, "centerCall") <- match.call() res I wrote print and summary methods, but other methods that work for lm objects like plot will also work for these new ones. meanCenter <- function(model, centerOnlyInteractors=TRUE, centerDV=FALSE, standardize=FALSE, centerContrasts = F){ UseMethod("meanCenter") } meanCenter.default <- function(model, centerOnlyInteractors=TRUE, centerDV=FALSE, standardize=FALSE, centerContrasts = F){ std <- function(x) { if( !is.numeric(x) ){ stop("center.lm tried to center a factor variable. No Can Do!") } else { scale(x, center = TRUE, scale = standardize) } } rdf <- get_all_vars(formula(model), model$model) #raw data frame t <- terms(model) tl <- attr(t, "term.labels") tmdc <- attr(t, "dataClasses") ##term model data classes isNumeric <- names(tmdc)[ which(tmdc %in% c("numeric"))] isFac <- names(tmdc)[ which(tmdc %in% c("factor"))] if (tmdc[1] != "numeric") stop("Sorry, DV not a single numeric column") ##Build "nc", a vector of variable names that "need centering" ## if (!centerDV) { if (centerOnlyInteractors == FALSE){ nc <- isNumeric[-1] #-1 excludes response unique(nc) }else{ interactTerms <- tl[grep(":", tl)] nc <- unique(unlist(strsplit( interactTerms, ":"))) nc <- nc[which(nc %in% isNumeric)] } }else{ if (centerOnlyInteractors == FALSE){ nc <- isNumeric }else{ interactTerms <- tl[grep(":", tl)] nc <- unique(unlist(strsplit( interactTerms, ":"))) nc <- nc[which(nc %in% isNumeric)] nc <- c( names(tmdc)[1] , nc) } } mc <- model$call # run same model call, replacing non centered data with centered data. ## if no need to center factor contrasts: if (!centerContrasts) { stddat <- rdf for (i in nc) stddat[ , i] <- std( stddat[, i]) mc$data <- quote(stddat) }else{ ##dm: design matrix, only includes intercept and predictors dm <- model.matrix(model, data=rdf, contrasts.arg = model$contrasts, xlev = model$xlevels) ##contrastIdx: indexes of contrast variables in dm contrastIdx <- which(attr(dm, "assign")== match(isFac, tl)) contrastVars <- colnames(dm)[contrastIdx] nc <- c(nc, contrastVars) dm <- as.data.frame(dm) hasIntercept <- attr(t, "intercept") if (hasIntercept) dm <- dm[ , -1] # removes intercept, column 1 dv <- rdf[ ,names(tmdc)[1]] #tmdc[1] is response variable name dm <- cbind(dv, dm) colnames(dm)[1] <- names(tmdc)[1] #put colname for dv dmnames <- colnames(dm) hasColon <- dmnames[grep(":", dmnames)] dm <- dm[ , -match(hasColon, dmnames)] ##remove vars with colons (lm will recreate) ##Now, standardise the variables that need standardizing for (i in nc) dm[ , i] <- std( dm[, i]) fmla <- formula(paste(dmnames[1], " ~ ", paste(dmnames[-1], collapse=" + "))) cat("This fitted model will use those centered variables\n") cat("Model-constructed interactions such as \"x1:x3\" are built from centered variables\n") mc$formula <- formula(fmla) mc$data <- quote(dm) } cat("These variables", nc, "Are centered in the design matrix \n") res <- eval(mc) class(res) <- c("mcreg", class(model)) attr(res, "centeredVars") <- nc attr(res, "centerCall") <- match.call() res } summary.mcreg <- function(object, ...){ nc <- attr(object, "centeredVars") cat("The centered variables were: \n") print(nc) cat("Even though the variables here have the same names as their non-centered
Re: [Rd] returning information from functions via attributes rather than return list
On Tue, Jan 3, 2012 at 3:59 PM, Simon Urbanek wrote: > Paul, > > On Jan 3, 2012, at 3:08 PM, Paul Johnson wrote: > >> I would like to ask for advice from R experts about the benefits or >> dangers of using attr to return information with an object that is >> returned from a function. I have a feeling as though I have cheated by >> using attributes, and wonder if I've done something fishy. >> >> Maybe I mean to ask, where is the dividing line between attributes and >> instance variables? The separation is not clear in my mind anymore. >> >> Background: I paste below a function that takes in a regression object >> and make changes to the data and/or call and then run a >> revised regression. In my earlier effort, I was building a return >> list, including the new fitted regression object plus some >> variables that have information about the changes that a were made. >> >> That creates some inconvenience, however. When the regression is in a >> list object, then methods for lm objects don't apply to that result >> object. The return is not an lm anymore. > > Why don't you just subclass it? That's the "normal" way of doing things - you > simply add additional entries for your subclass (e.g. m$myItem1, m$myItem2, > ...), prepend your new subclass name and you're done. You can still dispatch > on your subclass before the superclass while superclass methods just work as > well.. > > Cheers, > Simon > Yes. I see that now. But I'm still wondering about the attribute versus variable question. To the programmer, is there any difference between returning information as attributes or variables? Does R care if I do this: class(res) <- c("mcreg", "lm") attr(res, "centeredVars") <- nc Or this: class(res) <- c("mcreg", "lm") res$centeredVars <- nc Is there some place "down there," in the C foundations of R, where an attribute is just a variable, as is centeredVars? pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] delete.response leaves response in attribute dataClasses
I posted this one as an R bug (https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14767), but Prof. Ripley says I'm premature, and I should raise the question here. Here's the behavior I assert is a bug: The output from delete.response on a terms object alters the formula by removing the dependent variable. It removes the response from the "variables" attribute and it changes the response attribute from 1 to 0. The response is removed from "predvars" But it leaves the name of the dependent variable first in the in "dataClasses". It caused an unexpected behavior in my code, so (as usual) the bug may be mine, but in my heart, I believe it belongs to delete.response. To illustrate, here's a terms object from a regression. > tt y ~ x1 * x2 + x3 + x4 attr(,"variables") list(y, x1, x2, x3, x4) attr(,"factors") x1 x2 x3 x4 x1:x2 y 0 0 0 0 0 x1 1 0 0 0 1 x2 0 1 0 0 1 x3 0 0 1 0 0 x4 0 0 0 1 0 attr(,"term.labels") [1] "x1""x2""x3""x4""x1:x2" attr(,"order") [1] 1 1 1 1 2 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,".Environment") attr(,"predvars") list(y, x1, x2, x3, x4) attr(,"dataClasses") yx1x2x3x4 "numeric" "numeric" "numeric" "numeric" "numeric" Now observe that delete.response removes the response from all attributes except dataClasses. > delete.response(tt) ~x1 * x2 + x3 + x4 attr(,"variables") list(x1, x2, x3, x4) attr(,"factors") x1 x2 x3 x4 x1:x2 x1 1 0 0 0 1 x2 0 1 0 0 1 x3 0 0 1 0 0 x4 0 0 0 1 0 attr(,"term.labels") [1] "x1""x2""x3""x4""x1:x2" attr(,"order") [1] 1 1 1 1 2 attr(,"intercept") [1] 1 attr(,"response") [1] 0 attr(,".Environment") attr(,"predvars") list(x1, x2, x3, x4) attr(,"dataClasses") yx1x2x3x4 "numeric" "numeric" "numeric" "numeric" "numeric" pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] informal conventions/checklist for new predictive modeling packages
I agree with almost all, except the last point. Since I have participated in wheel-reinvention lately, I agree with the bulk of your comment. I don't think the fix is as easy as you suspect, RSiteSearch won't help me find a function I need when I don't know the magic words. Some R functions have such unexpected names that only a fastidious source-code reader would find them ("pretty", for example). But I agree with your concern. But, as far as the last one is concerned, I think you are mistaken. Explanation below. On Wed, Jan 4, 2012 at 8:19 AM, Max Kuhn wrote: > > (14) [OCD] For binary classification models, model the probability of > the first level of a factor as the event of interest (again, for > consistency) Note that glm() does not do this but most others use the > first level. > When the DV is thought of as 0 and 1, and 1 is an "event" "success" or "win" and 0 is a "non event" "failure" or "loss", if there is to be a single predicted probability, I want it to be the probability of the higher outcome. glm is doing the thing I want, and I don't know of others that go the other way, except PROC LOGISTIC in SAS. And that has a long history of causing confusion and despair. I'd like to consider adding one thing to your list, though. I have wished (in this list and elsewhere) that there were a more regular approach for calculating "newdata" objects that are used in predict. Many packages have re-invented this (datadist in rms, effects), and almost nobody here agreed with my wish for a more standard approach. But if there were a standard approach, it would be much easier to hold up R as an alternative to Stata when users pop up with "marginal effects tables" from Stata that are very difficult to reproduce with R. Regards, pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] delete.response leaves response in attribute dataClasses
Thanks, Bill Counter-arguments at the end On Thu, Jan 5, 2012 at 3:15 PM, William Dunlap wrote: > My feeling that everyone would index dataClasses by name was > wrong. I looked through the packages that used dataClasses > and saw code that would break if the first (response) entry > were omitted. (I didn't check to see if passing the output > of delete.response to these functions would be appropriate.) > E.g., > file: AICcmodavg/R/predictSE.mer.r > ##matrix with info on factors > fact.frame <- attr(attr(orig.frame, "terms"), "dataClasses")[-1] > > ##continue if factors > if(any(fact.frame == "factor")) { > id.factors <- which(fact.frame == "factor") > fact.name <- names(fact.frame)[id.factors] #identify the rows for factors > > Some packages create a dataClass attribute for a model.frame > (not its terms attribute) that does not have any names: > file: caper/R/macrocaic.R > attr(mf, "dataClasses") <- rep("numeric", dim(termFactors)[2]) > .checkMFClasses() does not throw an error for that, but it > doesn't do any real checking either. > > Most users of dataClasses do pass it to .checkMFClasses() to > compare it with newdata and that doesn't care if you have extra > entries in dataClasses. > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > I can't understand what your point is. I agree we can work around the problem, but why should we have to? If you confine yourself to the output of "delete.response" applied to a terms object from a regression, can you point to any package or usage that depends on leaving the response variable in the dataClasses attribute? I can't find one. In R base, these are all the references to delete.response: stats/R/models.R:delete.response <- function (termobj) stats/R/lm.R:Terms <- delete.response(tt) stats/R/lm.R:Terms <- delete.response(tt) stats/R/ppr.R:Terms <- delete.response(object$terms) stats/R/loess.R: as.matrix(model.frame(delete.response(terms(object)), newdata, stats/R/dummy.coef.R:Terms <- delete.response(Terms) I've looked it over carefully and predict.lm (in lm.R) would not be affected by the change I propose. I can't find any usage in loess.R of the dataClasses attribute. Furthermore, I can't see how a person would use the dataClasses attribute at all, after the other markers of the response are eliminated. How is a method to find which variable is the response, after response=0? I'm not disagreeing with you that I can workaround the peculiarity that the response is left in the dataClasses attribute of the output object from delete.response. I'm just saying it is a complication that programmers should not have to put up with, because I think delete.response should delete the response from all attributes of a terms object. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Example of "task seeds" with R parallel. Critique?
Greetings: In R parallel's vignette, there is a comment "It would however take only slightly more work to allocate a stream to each task." (p.6). I've written down a working example that can allocate not just one, but several separate seeds for each task. (We have just a few project here that need multiple streams). I would like to help work that up for inclusion in the parallel package, if possible. This approach is not original. It combines the idea behind snowFT and ideas for setting and monitoring seeds to replicate random streams in John Chambers Software for Data Analysis, Section 6.10. I'm able to save a block of seeds in a file, run a simulation for each one, and then re-run any particular task and match the random numbers. But I realize there are a lot of dangers I am not yet aware of, so I'm asking you what can go wrong? I see danger in conflicts between my effort to manage seeds and the work of parallel functions that try to manage seeds for me. That's why I wish I could integrate task-based seeds into parallel itself. RNGkind("L'Ecuyer-CMRG") set.seed(23456) library(parallel) ## nrep = number of repetitions (or tasks) ## streamsPerRep = number of streams needed by each repetition nReps <- 2000 streamsPerRep <- 2 ## projSeeds=list of lists of stream seeds projSeeds <- vector(mode="list", nReps) for (i in 1:nReps) projSeeds[[i]] <- vector(mode="list", streamsPerRep) runif(1) ##establishes .Random.seed ##Grab first seed s <- .Random.seed origSeed <- s x <- rnorm(4) ##will compare later x for (i in 1:nReps) { for (j in 1:streamsPerRep){ projSeeds[[i]][[j]] <- s s <- nextRNGStream(s) } } save(projSeeds, file="projSeeds.rda") rm(projSeeds) load("projSeeds.rda") ##Note that origSeed does match projSeeds origSeed projSeeds[[1]][[1]] ## And we get same random draws from project 1, stream 1 .Random.seed <- projSeeds[[1]][[1]] rnorm(4) x ##Another way (preferred?) to reset stream assign(".Random.seed", projSeeds[[1]][[1]], envir = .GlobalEnv) rnorm(4) ## Now, how to make this more systematic ## Each task needs streamsPerRep seeds ## startSeeds = for each stream, a starting seed ## currentSeeds = for each stream, a seed recording stream's current position ## currentStream = integer indicator of which stream is in use ## Test that interactively currentStream <- 1 currentSeeds <- startSeeds <- projSeeds[[1]] .Random.seed <- startSeeds[[currentStream]] useStream <- function(n = NULL, origin = FALSE){ if (n > length(currentSeeds)) stop("requested stream does not exist") currentSeeds[[currentStream]] <- .Random.seed if (origin) assign(".Random.seed", startSeeds[[n]], envir = .GlobalEnv) else assign(".Random.seed", currentSeeds[[n]], envir = .GlobalEnv) currentStream <<- n } useStream(n=1, origin=TRUE) rnorm(4) currentStream useStream(n=2, origin=TRUE) rnorm(4) currentStream ## Now, make that work in a clustered environment cl <- makeCluster(9, "MPI") ## run on worker, so can retrieve seeds for particular run initSeeds <- function(p = NULL){ currentStream <<- 1 projSeeds[[p]] } clusterEvalQ(cl, { RNGkind("L'Ecuyer-CMRG") }) clusterExport(cl, c("projSeeds", "useStream", "initSeeds")) someHorrendousFunction <- function(run, parm){ currentStream <- 1 currentSeeds <- startSeeds <- initSeeds(run) assign(".Random.seed", startSeeds[[currentStream]], envir = .GlobalEnv) ##then some gigantic, long lasting computation occurs dat <- data.frame(x1 = rnorm(parm$N), x2 = rnorm(parm$N), y = rnorm(parm$N)) m1 <- lm(y ~ x1 + x2, data=dat) list(m1, summary(m1), model.matrix(m1)) } whatever <- list("N" = 999) res <- parLapply(cl, 1:nReps, someHorrendousFunction, parm = whatever) res[[77]] ##Prove I can repeat 77'th task res77 <- someHorrendousFunction(77, parm = whatever) ## well, that worked. stopCluster(cl) -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Last Call: Who says this is not a bug in delete.response()?
On January 5, I posted here concerning this issue (https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14767), and I have not heard any answers from people who think this is not a bug, or a simple accident in the delete.response function. The output from delete.response on a terms object alters the formula by removing the dependent variable. It removes the response from the "variables" attribute and it changes the response attribute from 1 to 0. The response is removed from "predvars" But it leaves the name of the dependent variable first in the in "dataClasses" attribute. I believe delete.response should be fixed, and if I'm wrong, I wish one of you would tell me. Otherwise, I will ask Prof. Ripley to re-open the bug report. This is an important API question, since functions that use regression models (in my opinion) ought to be able to count on delete.response to leave behind a dataClasses attribute that no longer includes the response. If we could count on that, then, some methods that work with fitted regressions could work more easily. For example, the work in termplot with variables "carrier" and "carrier.terms" would be unnecessary, since the column names of "dataClasses" would be, exactly, the carrier.name vector and the data type information of those variables would eliminate a lot of fancy footwork one has to do in order to calculate predicted values. -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] need gui matrix editor: does R Core team have advice on how?
Dear R-devel: Would R core team consider endorsing a graphical toolkit and trying to facilitate development of little GUI tools? I need a gui matrix editor for users that want to be able to write matrices that are later used to simulate data. Instead of teaching them to write a covariance matrix (for example, for mvtnorm), I want to tell them run a function that pops up a table they can fill in. The users need to be able to enter variable names in the matrix, so something that would accept a 0 0 0 b 0 c d e would be great. Something that would accept formulae like this would be even more great. a 0 0 0 b a^2 c d e I want this gui matrix editor to "just work" on Windows, Mac, Linux. I don't mind building this on top of some widget set, but it is important that the widget set be easily installable on the user end. That's why I wish R core would offer us some guidance or advice. I'm not a programmer, but I can learn to program with a library, as long as it is not a waste of time. I've been searching R archives and here's what I found so far. 1. tcl/tk Building on R tcltk2, for people that have the Tcl addon widget TkTable installed, there are several packages. in tcltk2 itself, there is a function tk2edit, and there are others that try to embellish. I've tried several of these, they seem to be not-quite done yet, one can't copy a rectangle, for example. But maybe I could learn how to fix them up and make yet another tktable based editor. Problem: requires user to have enough understanding to install the Tcl widget TkTable. And, for platforms like Windows, user has to install tcl/tk itself. On Linux and Mac, that is not as big of a hurdle, so far as I can tell. On Debian linux, I found that in a package libtktable that works, but I have no idea how tough that would be on other linux systems or Macintosh or Windows. Another problem is that tcl/tk editions change rapidly, and on several of our systems, we still don't have access to tcl/tk 8.5 (RedHat/Centos based clusters still running version 5 are like that). 2. Gtk Building on Rgtk, I found the package RGtk2Extras. This of course requires a function gtk tool chain, which used to be a big problem on the various platforms. But the function dfedit appears to be almost exactly what I'm looking for. I can create a character matrix and put in letters how I want, but I later face the problem of how to evaluate the matrix. Problem: even more than tcl/tk, GTK versions change and become incompatible, especially across platforms. What about QT or WX-Widgets. I gather RkWard is built on KDE, and hence Qt. I don't find matrix editors using those languages, but I don't know why not. Maybe this is impossible for R core to advise us because you may disagree on which widget library is best, but if there is some consensus, I would be glad to know because I would do whatever you recommend. I'm pretty sure that, if you said, "use library X, version XYZ", then the worldwide usage of R is sufficiently broad that people would step forward and help make sure those libraries are packaged for all OS. I mean, if there were no tktable package for any linux for which I make packages (RedHat, Fedora, Debian, Ubuntu), I would create the packages. But I'm not doing it now because I have no reason to believe that is a good path from here on out. -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] tcltk GUIs (was need gui matrix editor: does R Core team have advice on how?)
On Sun, Jan 29, 2012 at 6:10 AM, Prof Brian Ripley wrote: > On 28/01/2012 22:04, John Fox wrote: >> >> Dear Paul and Gabor, >> >> The Rcmdr GUI uses the tcltk package, so I have some experience with >> providing an R tcltk-based GUI for various platforms. >> >> As Gabor says, everything works very smoothly on Windows because the R >> Windows binary includes Tcl/Tk. > > > Maybe, but getting it there was very far from smooth. Tcl/Tk compiled under > the compilers we used, but the resulting DLLs crashed R. No one has ever > found the cause and I used the system SDK (essentiallly a version of VC++) > to build them. And that puts us in a bind since the current system SDKs > generate code depending on DLLs that are not part of the minimal OS versions > we support (e.g. Windows XP and Server 2003, and the machine used to build > was retired 2 years ago). > Thanks, this is clearing things up. I believe these comments mean that, at the current time, tcl/tk is as close as there is to an officially endorsed graphical toolkit. As I search more, I find many other community contributors (besides Prof. Fox) using tcl/tk (Sciviews). So I should learn how to work with that. Prof Ripley's comment makes me think the endorsement is not entirely enthusiastic, though. If there were a change to emphasize Gtk2, I don't think I would be disappointed. I've been testing table-making examples today. On Debian Linux, I am having more luck with the Gtk2 based packages. dfedit in RGtk2Extras "just works" for me. Example: > library(RGtk2Extras) > mat <- matrix(rnorm(100), 10, 10) > dfedit(mat) That edits the R object mat as expected. On the other hand, I don't have success with the tk2edit from tcltk2, even with the example in the help page: > library(tcltk2) Loading required package: tcltk Loading Tcl/Tk interface ... done > ?tk2edit > data(iris) > tk2edit(iris) Error in matrix("", nrow = nrow(tA), ncol = ncol(tA)) : non-numeric matrix extent I've fiddled with this quite a bit, I believe there's some little mismatch between this particular system's tcl/tk libraries and the ones that tcltk2 is expecting. Packaging of tcl/tk has caused lots of trouble with Swarm simulations that we run, maybe that's breaking tktable usage too. I'm going to look into that some more. I think the idea behind gWidgetstcltk is great, it aims to create R functions that can use either Gtk2 or tclk. But the implementation is a big hassle, it seems to me. It inherits all of the management troubles of both tcltk and Gtk2. For example. > library(gWidgetstcltk) > mat <- matrix(rnorm(100), 10 , 10) > gdf(mat) Select a GUI toolkit 1: gWidgetsRGtk2 2: gWidgetstcltk Selection: 2 guiWidget of type: NULL for toolkit: guiWidgetsToolkittcltk Warning message: In .gdf(toolkit, items = items, name = name, do.subset = do.subset, : Container is not correct. No NULL containers possible When I run the example at the end of the help from ?gdf in gWidgetstcltk, I get this (even before trying to use the table at all). >obj[,] <- head(mtcars) ## replace df Error in `.leftBracket<-`(`*tmp*`, toolkit, ..., value = list(mpg = c(21, : Value has different number of rows than the replacement area If I make the other selection, opting for Gtk2, I don't get an error, but nothing happens--no table pops up either. > library(gWidgetstcltk) > mat <- matrix(100, 10, 10) > gdf(mat) Select a GUI toolkit 1: gWidgetsRGtk2 2: gWidgetstcltk Selection: 1 Loading required package: gWidgetsRGtk2 guiWidget of type: gGridRGtk for toolkit: guiWidgetsToolkitRGtk2 If I had not seen the Gtk2 table work well with RGtk2Extras, I'd have no faith at all. In conclusion, what am I supposed to work on? If tcl/tk is likely to stay in the R for Windows package, then we can work on streamlining the Macintosh and Windows instructions for tcltk maintenance, then I see my mission now is to make TkTable based widgets work. Right? Something Prof. Grothendieck said made me curious. One can package the TkTable library with an R package? Why is it not already included in packages like tcltk2 or gWidgetstcltk? Debian package libtktable2.9 installs these files: /usr/lib/Tktable2.9/libTktable2.9.so /usr/lib/Tktable2.9/pkgIndex.tcl /usr/lib/Tktable2.9/tkTable.tcl So TkTable requries not just the tcl bit, but a shared library. Is that a substantial roadblock to R packaging of TkTable? (I've never tried it). pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] portable parallel seeds project: request for critiques
I've got another edition of my simulation replication framework. I'm attaching 2 R files and pasting in the readme. I would especially like to know if I'm doing anything that breaks .Random.seed or other things that R's parallel uses in the environment. In case you don't want to wrestle with attachments, the same files are online in our SVN http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/ ## Paul E. Johnson CRMDA ## Portable Parallel Seeds Project. ## 2012-02-18 Portable Parallel Seeds Project This is how I'm going to recommend we work with random number seeds in simulations. It enhances work that requires runs with random numbers, whether runs are in a cluster computing environment or in a single workstation. It is a solution for two separate problems. Problem 1. I scripted up 1000 R runs and need high quality, unique, replicable random streams for each one. Each simulation runs separately, but I need to be confident their streams are not correlated or overlapping. For replication, I need to be able to select any run, say 667, and restart it exactly as it was. Problem 2. I've written a Parallel MPI (Message Passing Interface) routine that launches 1000 runs and I need to assure each has a unique, replicatable, random stream. I need to be able to select any run, say 667, and restart it exactly as it was. This project develops one approach to create replicable simulations. It blends ideas about seed management from John M. Chambers Software for Data Analysis (2008) with ideas from the snowFT package by Hana Sevcikova and Tony R. Rossini. Here's my proposal. 1. Run a preliminary program to generate an array of seeds run1: seed1.1 seed1.2 seed1.3 run2: seed2.1 seed2.2 seed2.3 run3: seed3.1 seed3.2 seed3.3 ... ... ... run1000 seed1000.1 seed1000.2 seed1000.3 This example provides 3 separate streams of random numbers within each run. Because we will use the L'Ecuyer "many separate streams" approach, we are confident that there is no correlation or overlap between any of the runs. The projSeeds has to have one row per project, but it is not a huge file. I created seeds for 2000 runs of a project that requires 2 seeds per run. The saved size of the file 104443kb, which is very small. By comparison, a 1400x1050 jpg image would usually be twice that size. If you save 10,000 runs-worth of seeds, the size rises to 521,993kb, still pretty small. Because the seeds are saved in a file, we are sure each run can be replicated. We just have to teach each program how to use the seeds. That is step two. 2. Inside each run, an initialization function runs that loads the seeds file and takes the row of seeds that it needs. As the simulation progresses, the user can ask for random numbers from the separate streams. When we need random draws from a particular stream, we set the variable "currentStream" with the function useStream(). The function initSeedStreams creates several objects in the global environment. It sets the integer currentStream, as well as two list objects, startSeeds and currentSeeds. At the outset of the run, startSeeds and currentSeeds are the same thing. When we change the currentStream to a different stream, the currentSeeds vector is updated to remember where that stream was when we stopped drawing numbers from it. Now, for the proof of concept. A working example. Step 1. Create the Seeds. Review the R program seedCreator.R That creates the file "projSeeds.rda". Step 2. Use one row of seeds per run. Please review "controlledSeeds.R" to see an example usage that I've tested on a cluster. "controlledSeeds.R" can also be run on a single workstation for testing purposes. There is a variable "runningInMPI" which determines whether the code is supposed to run on the RMPI cluster or just in a single workstation. The code for each run of the model begins by loading the required libraries and loading the seed file, if it exists, or generating a new "projSeed" object if it is not found. library(parallel) RNGkind("L'Ecuyer-CMRG") set.seed(234234) if (file.exists("projSeeds.rda")) { load("projSeeds.rda") } else { source("seedCreator.R") } ## Suppose the "run" number is: run <- 232 initSeedStreams(run) After that, R's random generator functions will draw values from the first random random stream that was initialized in projSeeds. When each repetition (run) occurs, R looks up the right seed for that run, and uses it. If the user wants to begin drawing observations from the second random stream, this command is used: useStream(2) If the user has drawn values from stream 1 already, but wishes to begin again at the initial point in that stream, use this command useStream(1, origin = TRUE) Question: Why is this approach better for parallel runs? Answer: After a batch of simulations, we can re-start any one of them and repeat it exactly. This builds on the idea of the snowFT package, by Hana Sevcikova and A.J
Re: [Rd] portable parallel seeds project: request for critiques
On Fri, Feb 17, 2012 at 3:23 PM, Paul Gilbert wrote: > Paul > > I think (perhaps incorrectly) of the general problem being that one wants to > run a random experiment, on a single node, or two nodes, or ten nodes, or > any number of nodes, and reliably be able to reproduce the experiment > without concern about how many nodes it runs on when you re-run it. > > From your description I don't have the impression your solution would do > that. Am I misunderstanding? > Well, I think my approach does that! Each time a function runs, it grabs a pre-specified set of seed values and initializes the R .Random.seed appropriately. Since I take the pre-specified seeds from the L'Ecuyer et al approach (cite below), I believe that means each separate stream is dependably uncorrelated and non-overlapping, both within a particular run and across runs. > A second problem is that you want to use a proven algorithm for generating > the numbers. This is implicitly solved by the above, because you always get > the same result as you do on one node with a well proven RNG. If you > generate a string of seed and then numbers from those, do you have a proven > RNG? > Luckily, I think that part was solved by people other than me: L'Ecuyer, P., Simard, R., Chen, E. J. and Kelton, W. D. (2002) An object-oriented random-number package with many long streams and substreams. Operations Research 50 1073–5. http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf > Paul > > > On 12-02-17 03:57 PM, Paul Johnson wrote: >> >> I've got another edition of my simulation replication framework. I'm >> attaching 2 R files and pasting in the readme. >> >> I would especially like to know if I'm doing anything that breaks >> .Random.seed or other things that R's parallel uses in the >> environment. >> >> In case you don't want to wrestle with attachments, the same files are >> online in our SVN >> >> >> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/ >> >> >> ## Paul E. Johnson CRMDA >> ## Portable Parallel Seeds Project. >> ## 2012-02-18 >> >> Portable Parallel Seeds Project >> >> This is how I'm going to recommend we work with random number seeds in >> simulations. It enhances work that requires runs with random numbers, >> whether runs are in a cluster computing environment or in a single >> workstation. >> >> It is a solution for two separate problems. >> >> Problem 1. I scripted up 1000 R runs and need high quality, >> unique, replicable random streams for each one. Each simulation >> runs separately, but I need to be confident their streams are >> not correlated or overlapping. For replication, I need to be able to >> select any run, say 667, and restart it exactly as it was. >> >> Problem 2. I've written a Parallel MPI (Message Passing Interface) >> routine that launches 1000 runs and I need to assure each has >> a unique, replicatable, random stream. I need to be able to >> select any run, say 667, and restart it exactly as it was. >> >> This project develops one approach to create replicable simulations. >> It blends ideas about seed management from John M. Chambers >> Software for Data Analysis (2008) with ideas from the snowFT >> package by Hana Sevcikova and Tony R. Rossini. >> >> >> Here's my proposal. >> >> 1. Run a preliminary program to generate an array of seeds >> >> run1: seed1.1 seed1.2 seed1.3 >> run2: seed2.1 seed2.2 seed2.3 >> run3: seed3.1 seed3.2 seed3.3 >> ... ... ... >> run1000 seed1000.1 seed1000.2 seed1000.3 >> >> This example provides 3 separate streams of random numbers within each >> run. Because we will use the L'Ecuyer "many separate streams" >> approach, we are confident that there is no correlation or overlap >> between any of the runs. >> >> The projSeeds has to have one row per project, but it is not a huge >> file. I created seeds for 2000 runs of a project that requires 2 seeds >> per run. The saved size of the file 104443kb, which is very small. By >> comparison, a 1400x1050 jpg image would usually be twice that size. >> If you save 10,000 runs-worth of seeds, the size rises to 521,993kb, >> still pretty small. >> >> Because the seeds are saved in a file, we are sure each >> run can be replicated. We just have to teach each program >> how to use the seeds. That is step two. >> >> >> 2. Inside each run, an initialization function runs that loads the >> seeds file and t
Re: [Rd] portable parallel seeds project: request for critiques
On Fri, Feb 17, 2012 at 5:06 PM, Petr Savicky wrote: > On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote: > Hi. > > Some of the random number generators allow as a seed a vector, > not only a single number. This can simplify generating the seeds. > There can be one seed for each of the 1000 runs and then, > the rows of the seed matrix can be > > c(seed1, 1), c(seed1, 2), ... > c(seed2, 1), c(seed2, 2), ... > c(seed3, 1), c(seed3, 2), ... > ... > Yes, I understand. The seed things I'm using are the 6 integer values from the L'Ecuyer. If you run the example script, the verbose option causes some to print out. The first 3 seeds in a saved project seeds file looks like: > projSeeds[[1]] [[1]] [1] 407 376488316 1939487821 1433925148 -1040698333 579503880 [7] -624878918 [[2]] [1] 407 -1107332181 854177397 1773099324 1774170776 -266687360 [7] 816955059 [[3]] [1] 407 936506900 -1924631332 -1380363206 2109234517 1585239833 [7] -1559304513 The 407 in the first position is an integer R uses to note the type of stream for which the seed is intended, in this case R'Lecuyer. > There could be even only one seed and the matrix can be generated as > > c(seed, 1, 1), c(seed, 1, 2), ... > c(seed, 2, 1), c(seed, 2, 2), ... > c(seed, 3, 1), c(seed, 3, 2), ... > > If the initialization using the vector c(seed, i, j) is done > with a good quality hash function, the runs will be independent. > I don't have any formal proof that a "good quality hash function" would truly create seeds from which independent streams will be drawn. There is, however, the proof in the L'Ecuyer paper that one can take the long stream and divide it into sections. That's the approach I'm taking here. Its the same approach the a parallel package in R follows, and parallel frameworks like snow. The different thing in my approach is that I'm saving one row of seeds per simulation "run". So each run can be replicated exactly. I hope. pj pj > What is your opinion on this? > > An advantage of seeding with a vector is also that there can > be significantly more initial states of the generator among > which we select by the seed than 2^32, which is the maximum > for a single integer seed. > > Petr Savicky. > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] portable parallel seeds project: request for critiques
Greetings. Answers below. On Tue, Feb 21, 2012 at 7:04 AM, Petr Savicky wrote: > > Hi. > > In the description of your project in the file > > http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/README > > you argue as follows > > Question: Why is this better than the simple old approach of > setting the seeds within each run with a formula like > > set.seed(2345 + 10 * run) > > Answer: That does allow replication, but it does not assure > that each run uses non-overlapping random number streams. It > offers absolutely no assurance whatsoever that the runs are > actually non-redundant. > > The following demonstrates that the function set.seed() for > the default generator indeed allows to have correlated streams. > > step <- function(x) > { > x[x < 0] <- x[x < 0] + 2^32 > x <- (69069 * x + 1) %% 2^32 > x[x > 2^31] <- x[x > 2^31] - 2^32 > x > } > > n <- 1000 > seed1 <- 124370417 # any seed > seed2 <- step(seed1) > > set.seed(seed1) > x <- runif(n) > set.seed(seed2) > y <- runif(n) > > rbind(seed1, seed2) > table(x[-1] == y[-n]) > > The output is > > [,1] > seed1 124370417 > seed2 205739774 > > FALSE TRUE > 5 994 > > This means that if the streams x, y are generated from the two > seeds above, then y is almost exactly equal to x shifted by 1. > > What is the current state of your project? > > Petr Savicky. Thanks for the example. That is exactly the point I've been making about the old, ordinary way of setting seeds. I continue testing on my plan, I believe it does work. The example implementation is successful. I have not received any critiques that make me think it is theoretically wrong, but I also have not received feedback from any body who has very detailed knowledge of R's parallel package to tell me that my approach is good. In order for this to be easy for users, I need to put the init streams and set current stream functions into a package, and then streamline the process of creating the seed array. My opinion is that CRAN is now overflowed with too many "one function" packages, I don't want to create another one just for these two little functions, and I may roll it into my general purpose regression package called "rockchalk". If I worked in a company and had money riding on this, I'd need to hire someone from R Core to stare at the way I'm setting, saving, and re-setting the .Random.seed variable to be completely sure there are not complications. Because parallel package is so new, there is relatively little documentation on how it is supposed to work. I'm just gazing at the source code and trying not to break anything it does, and trying to stop it from breaking anything I do. One technical issue that has been raised to me is that R parallel's implementation of the L'Ecuyer generator is based on integer valued variables, whereas the original L'Ecuyer code uses double real variables. But I'm trusting the R Core on this, if they code the generator in a way that is good enough for R itself, it is good enough for me. (And if that's wrong, we'll all find out together :) ). Josef Leydold (the rstream package author) has argued that R's implementation runs more slowly than it ought to. We had some correspondence and I tracked a few threads in forums. It appears the approach suggested there is roadblocked by some characteristics deep down in R and the way random streams are managed. Packages have only a limited, tenuous method to replace R's generators with their own generators. Here is the comment that L'Ecuyer and Leydold make in the document that is distributed with rstream.. "rstream: Streams of Random Numbers for Stochastic Simulation" "There is one drawback in the implementation. R uses a static table to implement access to different kinds of random number generators. Only one slot (RNGkind(kind="user-supplied")) has been provided for users who want to add her own generator. This slot relies on a pointer to a function called user_unif_rand. Thus rstream has to abuse this slot to add additional random number generators. However, there are other packages on CRAN that use the same trick (randaes, rlecuyer, rsprng, SuppDists) or load one of these packages (Rmpi, rpvm, rpart, snow, varSelRF). Thus if a user loads two of these packages or adds her own uniform random number generator, the behavior of at least one of the packages is broken. This problem could be fixed by some changes in the R core system, either by adding a package variableto RNGkind(kind="user-supplied") similar to the .Call function, or by replacing the static table by a dynamic one. I interpret all of this to mean that, although other approaches may work as well or better, the only approach that can "defend itself " in the ecology of a running R program is an approach that fiddles with R's own generators, rather than tries to replace them. And that's what I'm doing. pj ps Just this moment, while googling for Leydold's comments,
[Rd] Unexpected behavior in factor level ordering
Hello, Everybody: This may not be a "bug", but for me it is an unexpected outcome. A factor variable's levels do not retain their ordering after the levels function is used. I supply an example in which a factor with values "BC" "AD" (in that order) is unintentionally re-alphabetized by the levels function. To me, this is very bad behavior. Would you agree? # Paul Johnson 2012-02-05 x <- c("AD","BC","AD","BC","AD","BC") xf <- factor(x, levels=c("BC", "AD"), labels=c("Before Christ","After Christ")) y <- rnorm(6) m1 <- lm (y ~ xf ) plot(y ~ xf) abline (m1) ## Just a little problem the line does not "go through" the box ## plot in the right spot because contrasts(xf) is 0,1 but ## the plot uses xf in 1,2. xlevels <- levels(xf) newdf <- data.frame(xf=xlevels) ypred <- predict(m1, newdata=newdf) ##Watch now: the plot comes out "reversed", AC before BC plot(ypred ~ newdf$xf) ## Ah. Now I see: levels(newdf$xf) ## Why doesnt newdf$xf respect the ordering of the levels? -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] --as-cran and need to ignore.svn directories
Good morning: I submitted a package update to CRAN and got a bounce because I had not run R CMD check with "--as-cran". I'd not heard of that before, but I'm glad to know about it now. I see it warns when my functions do use partial argument matching, and I like that advice very much. Also I see this warning * checking package subdirectories ... WARNING Found the following directory(s) with names of version control directories: ./.svn ./R/.svn ./data/.svn ./inst/.svn ./inst/doc/.svn ./inst/examples/.svn ./vignettes/.svn These should not be in a package tarball. Is there a way to cause R to ignore the .svn folders while running R CMD check --as-cran or R CMD build? It seems a little tedious to have to copy the whole directory tree to some other place and remove the .svn folders before building. I can do it, but it just seems, well, tedious. I have the feeling that you "frequent flyers" would have worked around this already. -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] --as-cran and need to ignore.svn directories
On Mon, Mar 12, 2012 at 12:47 PM, Martin Maechler wrote: >> Jari Oksanen [snip] > > > Paul, > > > I think the best solution is to 'svn export' svn directory to a > > temporary directory/folder: > > No, I don't think that should not be needed. > > When building the tarball, 'R CMD build' *does* ignore the .svn > folders, at least for me. > My tarball was created by R CMD build, I don't know why the .svn folders were included. There are no .svn folders in the packages I made with 2.14.1, so I suppose that is the first problem I have to figure out. This was my first build with R-2.14.2. If I get an answer, I will write back. > Paul's problem was using --as-cran on his SVN *directory* > but that's a slight contradiction in itself, as the CRAN maintainers > always run the checks on a tarball... I don't have an argument against what you say, and I'm glad to do it however you recommend. But I don't understand this. How do you get the tarball without first doing R CMD check and R CMD build on the working directory where the package lives? I just do what seems to be recommended in the tutorial by Friedrich Leisch "Creating R Packages: A Tutorial" and the R extensions document. Until now, I have done this in the shell to create the package. R CMD check rockchalk R CMD build rockchalk You mean everybody doesn't do it that way? >From now on, I'm adding --as-cran with check, but what else? After that finishes, I should be running check on the tarball that was created by build? While I'm asking basic packaging questions, I have a vignette packaging question. My vignette is in vignettes/outreg.Rnw The R build process runs latex to make sure the vignette can be built. But, as far as I can tell, it does not copy the pdf file into inst/doc/outreg.pdf. So when building the package, I should still manually compile the pdf and move it over there? > > Martin > > > svn export my-svn-directory tmp-pkg-directory > > R CMD build tmp-pkg-directory > > R CMD check --as-cran ... > > > The two advantages of 'svn export' that it (1) strips the .svn specific > > files, and (2) it really exports only those files that really are under > > version control. More often than once I have had some non-svn files in > > my svn directory so that *my* version of the package works, but the one > > actually in subversion fails. > > > Cheers, Jari Oksanen > > > -- > > Jari Oksanen, Dept Biology, Univ Oulu, 90014 Finland > > jari.oksa...@oulu.fi, Ph. +358 400 408593, http://cc.oulu.fi/~jarioksa -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] overriding "summary.default" or "summary.data.frame". How?
I suppose everybody who makes a package for the first time thinks "I can change anything!" and then runs into this same question. Has anybody written out information on how a package can override functions in R base in the R 2.14 (mandatory NAMESPACE era)? Suppose I want to alphabetize variables in a summary.data.frame, or return the standard deviation with the mean in summary output. I'm pasting in a working example below. It has new "summary.factor" method. It also has a function summarize that I might like to use in place of summary.data.frame. How would my new methods "drop on top" of R base functions? It appears my functions (summarizeFactors) can find my summary.factor, but R's own summary uses its own summary.factor. ## summarizeNumerics takes a data frame or matrix, scans the columns ## to select only the numeric variables. By default it alphabetizes ## the columns (use alphaSort = FALSE to stop that). It then ## calculates the quantiles for each variable, as well as the mean, ## standard deviation, and variance, and then packs those results into ## a matrix. The main benefits from this compared to R's default ## summary are 1) more summary information is returned for each ## variable, and the results are returned in a matrix that is easy to ## use in further analysis. summarizeNumerics <- function(dat, alphaSort = TRUE, digits = max(3, getOption("digits") - 3)){ if (!is.data.frame(dat)) dat <- as.data.frame(dat) nums <- sapply(dat, is.numeric) datn <- dat[ , nums, drop = FALSE] if (alphaSort) datn <- datn[ , sort(colnames(datn)), drop = FALSE] sumdat <- apply(datn, 2, stats::quantile, na.rm=TRUE) sumdat <- rbind(sumdat, mean= apply(datn, 2, mean, na.rm=TRUE)) sumdat <- rbind(sumdat, sd= apply(datn, 2, sd, na.rm=TRUE)) sumdat <- rbind(sumdat, var= apply(datn, 2, var, na.rm=TRUE)) sumdat <- rbind(sumdat, "NA's"=apply(datn, 2, function(x) sum(is.na(x signif(sumdat, digits) } summary.factor <- function(y, numLevels) { ## 5 nested functions to be used later divr <- function(p=0){ ifelse ( p>0 & p < 1, -p * log2(p), 0) } entropy <- function(p){ sum ( divr(p) ) } maximumEntropy <- function(N) -log2(1/N) normedEntropy <- function(x) entropy(x)/ maximumEntropy(length(x)) nas <- is.na(y) y <- factor(y) ll <- levels(y) tbl <- table(y) tt <- c(tbl) names(tt) <- dimnames(tbl)[[1L]] o <- sort.list(tt, decreasing = TRUE) if (length(ll) > numLevels){ toExclude <- numLevels:length(ll) tt <- c(tt[o[-toExclude]], `(All Others)` = sum(tt[o[toExclude]]), `NA's`=sum(nas)) }else{ tt <- c(tt[o], `NA's`=sum(nas)) } props <- prop.table(tbl); tt <- c(tt, "Entropy"=entropy(props), "NormedEntropy"= normedEntropy(props)) } ## Takes a data frame or matrix, scans the columns to find the ## variables that are not numeric and keeps them. By default it ## alphabetizes them (alphaSort = FALSE to stop that). It then treats ## all non-numeric variables as if they were factors, and summarizes ## each in a say that I find useful. In particular, for each factor, ## it provides a table of the most frequently occurring values (the ## top "numLevels" values are represented). As a diversity indictor, ## it calculates the Entropy and NormedEntropy values for each ## variable. Note not all of this is original. It combines my code ## and R code from base/summary.R summarizeFactors <- function(dat = NULL, numLevels = 10, alphaSort = TRUE, digits = max(3, getOption("digits") - 3)) { ##copies from R base::summary.R summary.data.frame ncw <- function(x) { z <- nchar(x, type="w") if (any(na <- is.na(z))) { # FIXME: can we do better z[na] <- nchar(encodeString(z[na]), "b") } z } if (!is.data.frame(dat)) dat <- as.data.frame(dat) ##treat any nonnumeric as a factor factors <- sapply(dat, function(x) {!is.numeric(x) }) ##If only one factor, need drop=FALSE. datf <- dat[ , factors, drop = FALSE] if (alphaSort) datf <- datf[ , sort(colnames(datf)), drop = FALSE] z <- lapply(datf, summary.factor, numLevels=numLevels) nv <- length(datf) nm <- names(datf) lw <- numeric(nv) nr <- max(unlist(lapply(z, NROW))) for(i in 1L:nv) { sms <- z[[i]] lbs <- format(names(sms)) sms <- paste(lbs, ":", format(sms, digits = digits), " ", sep = "") lw[i] <- ncw(lbs[1L]) length(sms) <- nr z[[i]] <- sms } z <- unlist(z, use.names=TRUE) dim(z) <- c(nr, nv) if(any(is.na(lw))) warning("probably wrong encoding in names(.) of column ", paste(which(is.na(lw)), collapse = ", ")) blanks <- paste(character(max(lw, na.rm=TRUE) + 2L), collapse = " ") pad <- floor(lw - ncw(nm)/2) nm <- paste(substring(blanks, 1, pad), nm, sep = "") dimnames(z) <- list(rep.int("", nr), nm) attr(z, "class") <- c("table") z } ## ## want to override summary.data.frame, but confusing. When ## will R find my summary.data.frame, when will it find the one in base. ## us
[Rd] I wish xlim=c(0, NA) would work. How about I send you a patch?
I'm looking for an R mentor. I want to propose a change in management of plot options xlim and ylim. Did you ever want to change one coordinate in xlim or ylim? It happens to me all the time. x <- rnorm(100, m=5, s=1) y <- rnorm(100, m=6, s=1) plot(x,y) ## Oh, I want the "y axis" to show above x=0. plot(x,y, xlim=c(0, )) ##Output: Error in c(0, ) : argument 2 is empty plot(x,y, xlim=c(0,NA )) ## Output: Error in plot.window(...) : need finite 'xlim' values I wish that plot would let me supply just the min (or max) and then calculate the other value where needed. It is a little bit tedious for the user to do that for herself. The user must be knowledgeable enough to know where the maximum (MAX) is supposed to be, and then supply xlim=c(0, MAX). I can't see any reason for insisting users have that deeper understanding of how R calculates ranges for plots. Suppose the user is allowed to supply NA to signal R should fill in the blanks. plot(x,y, xlim=c(0, NA)) In plot.default now, I find this code to manage xlim xlim <- if (is.null(xlim)) range(xy$x[is.finite(xy$x)]) And I would change it to be something like ##get default range nxlim <- range(xy$x[is.finite(xy$x)]) ## if xlim is NULL, so same as current xlim <- if (is.null(xlim)) nxlim ## Otherwise, replace NAs in xlim with values from nxlim else xlim[ is.na(xlim) ] <- nxlim[ is.na(xlim) ] Who is the responsible party for plot.default. How about it? Think of how much happier users would be! pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Proposal: model.data
Greetings: I'm still working on functions to make it easier for students to interpret regression predictions. I am working out a scheme to more easily create "newdata" objects (for use in predict functions). This has been done before in packages like Zelig, Effects, rms, and others. Nothing is "exactly right," though. Stata users keep flashing their "predicted probabity tables" at me and I'd like something like that to use with R. I'm proposing here a function model.data that receives a regression and creates a dataframe of raw data from which newdata objects can be constructed. This follows a suggestion that Bill Dunlap made to me in response to a question I posted in r-help. While studying termplot code, I saw the "carrier" function approach to deducing the "raw" predictors. However, it does not always work. Here is one problem. termplot mistakes 10 in log(10 + x1) for a variable Example: dat <- data.frame(x1 = rnorm(100), x2 = rpois(100, lambda=7)) STDE <- 10 dat$y <- 1.2 * log(10 + dat$x1) + 2.3 * dat$x2 + rnorm(100, sd = STDE) m1 <- lm( y ~ log(10 + x1) + x2, data=dat) termplot(m1) ## See the trouble? termplot thinks 10 is the term to plot. Another problem is that predict( type="terms") does not behave sensibly sometimes. RHS of formula that have nonlinear transformations are misunderstood as separate terms. ##Example: dat$y2 <- 1.2 * log(10 + dat$x1) + 2.3 * dat$x1^2 + rnorm(100, sd = STDE) m2 <- lm( y2 ~ log(10 + x1) + sin(x1), data=dat) summary(m2) predict(m2, type="terms") ## Output: ## log(10 + x1) sin(x1) ## 1 1.50051781 -2.04871711 ## 2-0.14707391 0.31131124 What I wish would happen instead is one "correct" prediction for each value of x1. This should be the output: predict(m2, newdata = data.frame(x1 = dat$x1)) ## > predict(m2, newdata = data.frame(x1 = dat$x1)) ## 12345678 ## 17.78563 18.49806 17.50719 19.70093 17.45071 19.69718 18.84137 18.89971 The "fix" I'm testing now is the following new function, "model.data". which tries to re-create the data object that would be consistent with a fitted model. This follows a suggestion from Bill Dunlap in r-help on 2012-04-22 ##' Creates a "raw" (UNTRANSFORMED) data frame equivalent ##' to the input data that would be required to fit the given model. ##' ##' Unlike model.frame and model.matrix, this does not return transformed ##' variables. ##' ##' @param model A fitted regression model in which the data argument ##' is specified. This function will fail if the model was not fit ##' with the data option. ##' @return A data frame ##' @export ##' @author Paul E. Johnson ##' @example inst/examples/model.data-ex.R model.data <- function(model){ fmla <- formula(model) allnames <- all.vars(fmla) ## all variable names ## indep variables, includes d in poly(x,d) ivnames <- all.vars(formula(delete.response(terms(model ## datOrig: original data frame datOrig <- eval(model$call$data, environment(formula(model))) if (is.null(datOrig))stop("model.data: input model has no data frame") ## datOrig: almost right, but includes d in poly(x, d) dat <- get_all_vars(fmla, datOrig) ## Get rid of "d" and other "non variable" variable names that are not in datOrig: keepnames <- intersect(names(dat), names(datOrig)) ## Keep only rows actually used in model fit, and the correct columns dat <- dat[ row.names(model$model) , keepnames] ## keep ivnames that exist in datOrig attr(dat, "ivnames") <- intersect(ivnames, names(datOrig)) invisible(dat) } This works for the test cases like log(10+x) and so forth: ## Examples: head(m1.data <- model.data(m1)) head(m2.data <- model.data(m2)) ## > head(m1.data <- model.data(m1)) ## y x1 x2 ## 1 18.53846 0.46176539 8 ## 2 28.24759 0.09720934 7 ## 3 23.88184 0.67602556 9 ## 4 23.50130 -0.74877054 8 ## 5 25.81714 1.02555255 5 ## 6 24.75052 -0.69659539 6 ## > head(m2.data <- model.data(m2)) ## y x1 ## 1 18.53846 0.46176539 ## 2 28.24759 0.09720934 ## 3 23.88184 0.67602556 ## 4 23.50130 -0.74877054 ## 5 25.81714 1.02555255 ## 6 24.75052 -0.69659539 d <- 2 m4 <- lm(y ~ poly(x1,d), data=dat) head(m4.data <- model.data(m4)) ## y x1 ## 1 18.53846 0.46176539 ## 2 28.24759 0.09720934 ## 3 23.88184 0.67602556 Another strength of this approach is that the return object has an attribute "ivnames". If R's termplot used model.dat instead of the carrier functions, this would make for a much tighter set of code. What flaws do you see in this? One flaw is that I did not know how to re-construct data from the parent environment, so I insist the regression model has to have a data argument. Is this necessary, or can one of the R experts help. Another possible flaw: I'm keeping the columns from the data frame that are needed to re-construct the model.frame, and to match the rows, I'm using row.names for the model
Re: [Rd] Proposal: model.data
Greetings: On Thu, May 3, 2012 at 11:36 AM, Brian G. Peterson wrote: > On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote: >> If somebody in R Core would like this and think about putting it, or >> something like it, into the base, then many chores involving predicted >> values would become much easier. >> > Why does this need to be in base? Implement it in a package. >> > If it works, and is additive, people will use it. Look at 'reshape' or > 'xts' or 'Matrix' just to name a few examples of widely used packages. > I can't use it to fix termplot unless it is in base. Or are you suggesting I create my own "termplot" replacement? > Regards, > > - Brian > > -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Proposal: model.data
On Thu, May 3, 2012 at 12:19 PM, Brian G. Peterson wrote: > On Thu, 2012-05-03 at 12:09 -0500, Paul Johnson wrote: >> Greetings: >> >> On Thu, May 3, 2012 at 11:36 AM, Brian G. Peterson >> wrote: >> > On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote: >> >> If somebody in R Core would like this and think about putting it, or >> >> something like it, into the base, then many chores involving predicted >> >> values would become much easier. >> >> >> > Why does this need to be in base? Implement it in a package. >> >> So, nobody agrees with me that R base should have model.data? Too bad! You could save us a lot of effort. I've found different efforts to get the same work done in termplot and all of the implementations of predict. And just about every regression related package has its own approach. If there were one good way that would always work, just think how convenient it would be for all those function & package writers. Oh, well. I'm not saying that my model.data is perfect, just that I wish there were a perfect one :) Yesterday, I realized that predict.nls probably has to deal with this problem so I studied that and have yet another version of model.data to propose to you. I'm using this in my regression-support package rockchalk, so you don't need to give me Brian Peterson's advice to "put it in a package". The idea here is to take variables from a fitted model's data if it can find them, and then grab variables from the parent environment IF they have the correct length. This means we ignore variables like d in poly(x, d) because the variable d is not of the same length as the variables in the model. ##' Creates a "raw" (UNTRANSFORMED) data frame equivalent ##' to the input data that would be required to fit the given model. ##' ##' Unlike model.frame and model.matrix, this does not return transformed ##' variables. ##' ##' @param model A fitted regression model in which the data argument ##' is specified. This function will fail if the model was not fit ##' with the data option. ##' @return A data frame ##' @export ##' @author Paul E. Johnson ##' @example inst/examples/model.data-ex.R model.data <- function(model){ #from nls, returns -1 for missing variables lenVar <- function(var, data) tryCatch(length(eval(as.name(var), data, env)), error = function(e) -1) fmla <- formula(model) varNames <- all.vars(fmla) ## all variable names ## varNames includes d in poly(x,d), possibly other "constants" ## varNamesRHS <- all.vars(formula(delete.response(terms(model ## previous same as nls way? fmla2 <- fmla fmla2[[2L]] <- 0 varNamesRHS <- all.vars(fmla2) varNamesLHS <- setdiff(varNames, varNamesRHS) env <- environment(fmla) if (is.null(env)) env <- parent.frame() dataOrig <- eval(model$call$data, environment(formula(model))) rndataOrig <- row.names(dataOrig) n <- sapply(varNames, lenVar, data=dataOrig) targetLength <- length(eval(as.name(varNamesLHS[1]), dataOrig, env)) varNames <- varNames[ n == targetLength ] ldata <- lapply(varNames, function(x) eval(as.name(x), dataOrig, env)) names(ldata) <- varNames data <- data.frame(ldata[varNames]) if (!is.null(rndataOrig)) row.names(data) <- rndataOrig ## remove rows listed in model's na.action ## TODO: question: what else besides OMIT might be called for? if ( !is.null(model$na.action)){ data <- data[ -as.vector(model$na.action), ] } ## keep varNamesRHS that exist in datOrig attr(data, "varNamesRHS") <- setdiff(colnames(data), varNamesLHS) invisible(data) } And some example output: > ## check if model.data works when there is no data argument > set.seed(12345) > x1 <- rpois(100, l=6) > x2 <- rnorm(100, m=50, s=10) > x3 <- rnorm(100) > y <- rnorm(100) > m0 <- lm(y ~ x1 + x2 + x3) > m0.data <- model.data(m0) > x1[4:5] <- NA > m0 <- lm(y ~ x1 + x2 + x3) > m0.data <- model.data(m0) > head(m0.data) y x1 x2 x3 1 -0.8086741 7 44.59614 -1.6193283 2 1.0011198 9 69.47693 0.5483979 3 0.4560525 8 50.53590 0.1952822 6 0.6417692 4 52.77954 -0.2509466 7 -0.4150210 5 56.91171 1.6993467 8 -0.4595757 6 58.23795 -0.3442988 > x1 <- rpois(100, l=6) > x2 <- rnorm(100, m=50, s=10) > x3 <- rnorm(100) > xcat1 <- gl(2,50, labels=c("M","F")) > xcat2 <- cut(rnorm(100), breaks=c(-Inf, 0, 0.4, 0.9, 1, Inf), labels=c("R", > "M", "D", "P", "G")) > dat <- data.frame(x1, x2, x3, xcat1, xcat2) > rm(
[Rd] Requests for vignette clarification (re: Writing R Extensions)
I apologize that these questions are stupid and literal. I write to ask for clarification of comments in the R extensions manual about vignettes. I'm not great at LaTeX, but I'm not a complete novice either, and some of the comments are puzzling to me. 1. I'm stumbling over this line: "Make sure all files needed to run the R code in the vignette (data sets, ...) are accessible by either placing them in the inst/doc hierarchy of the source package or by using calls to system.file()." Where it says "inst/doc", can I interpret it to mean "vignettes"? The vignette files are under vignettes. Why wouldn't those other files be in there? Or does that mean I'm supposed to copy the style and bib files from the vignettes folder to the inst/doc folder? Or none of the above :) 2. I'm also curious about the implications of the parenthesized section of this comment: "By default R CMD build will run Sweave on all files in Sweave format in vignettes, or if that does not exist, inst/doc (but not in sub-directories)." At first I though that meant it will search vignettes and subdirectories under vignettes, or it will look under inst/doc, but no subdirectories under inst/doc. So I created vignettes in subdirectories under vignettes and they are ignored by the build process, so that was obviously wrong. For clarification, it would help me if the manual said "By default R CMD build will run Sweave on all files in Sweave format in vignettes (but not in sub-directories), or if that does not exist, inst/doc ." In this list I've read several questions/complaints from people who don't want their vignettes rebuild during the package check or build process, and I wondered if there is a benefit to having vignettes in subdirectories. Could inclusion of troublesome vignettes in subdirectories be a way that people can circumvent the rebuilding and re-checking of vignettes during build, check, or install? If I build my vignettes manually and copy the pdf output over to inst/doc, will those pdf files be "legitimate" vignette files as far as CRAN is concerned? The writeup in R Extensions is a little bit confusing on that point. "By including the PDF version in the package sources it is not necessary that the vignette PDFs can be re-built at install time, i.e., the package author can use private R packages, screen snapshots and LaTeX extensions which are only available on his machine." Its just confusing, that's all I can say about it. I could learn how to do this from some examples of packages that manage vignettes the "right way", if you could tell me which ones are "right" :) I'd like to see one that has a Makefile, uses a bib file, and, if possible, one that imports a pdf file from elsewhere. If there is one that uses subdirectories under vignettes to keep separate the content of vignettes, that would be extra helpful. I'm eager to do this in the correct way, just point me at some that are proper. pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Requests for vignette clarification (re: Writing R Extensions)
On Sun, Jun 3, 2012 at 2:02 PM, Paul Gilbert wrote: > I'll make a guess at some parts of this. > > > On 12-06-01 02:53 PM, Paul Johnson wrote: >> >> I apologize that these questions are stupid and literal. >> >> I write to ask for clarification of comments in the R extensions >> manual about vignettes. I'm not great at LaTeX, but I'm not a >> complete novice either, and some of the comments are puzzling to me. >> >> 1. I'm stumbling over this line: >> >> "Make sure all files needed to run the R code in the vignette (data >> sets, ...) are accessible by either placing them in the inst/doc >> hierarchy of the source package or by using calls to system.file()." >> >> Where it says "inst/doc", can I interpret it to mean "vignettes"? The >> vignette files are under vignettes. Why wouldn't those other files be >> in there? Or does that mean I'm supposed to copy the style and bib >> files from the vignettes folder to the inst/doc folder? Or none of >> the above :) > > > I think the idea is that a user looking at an installed version of the > package will be able to see things that are in the doc/ directory of the > installed package. This automatically includes the source files (eg *.Stex) > from vignettes/ and also the generated *.pdf and the *.R files stripped from > the *.Stex files. If you want them to have access to other files then you > should put those somewhere so they get installed, such as in the source > package /inst/doc directory so they get put in the doc/ directory of the > installed package. That should probably include anything else that is > important to reproduce the results in the vignette, but I do not count the > .bib file in that list (so I have it in vignettes/ and users would need to > look at the package source to find it). > >> >> 2. I'm also curious about the implications of the parenthesized >> section of this comment: >> >> "By default R CMD build will run Sweave on all files in Sweave format >> in vignettes, or if that does not exist, inst/doc (but not in >> sub-directories)." >> >> At first I though that meant it will search vignettes and >> subdirectories under vignettes, or it will look under inst/doc, but no >> subdirectories under inst/doc. So I created vignettes in >> subdirectories under vignettes and they are ignored by the build >> process, so that was obviously wrong. For clarification, it would >> help me if the manual said >> >> "By default R CMD build will run Sweave on all files in Sweave format >> in vignettes (but not in sub-directories), or if that does not exist, >> inst/doc ." >> >> In this list I've read several questions/complaints from people who >> don't want their vignettes rebuild during the package check or build >> process, and I wondered if there is a benefit to having vignettes in >> subdirectories. Could inclusion of troublesome vignettes in >> subdirectories be a way that people can circumvent the rebuilding and >> re-checking of vignettes during build, check, or install? If I build >> my vignettes manually and copy the pdf output over to inst/doc, will >> those pdf files be "legitimate" vignette files as far as CRAN is >> concerned? The writeup in R Extensions is a little bit confusing on >> that point. >> >> "By including the PDF version in the package sources it is not >> necessary that the vignette PDFs can be re-built at install time, >> i.e., the package author can use private R packages, screen snapshots >> and LaTeX extensions which are only available on his machine." >> >> Its just confusing, that's all I can say about it. > > > There was at least one earlier R-devel discussion of this, in which I > contributed an incorrect understanding, but was generally straightened out > by Uwe. I hope I have a correct understanding now. > > You can put a pdf file in inst/doc and specify "BuildVignettes: false" in > the DESCRIPTION file, in which case the already constructed pdf from > inst/doc will be used. The purpose of this is to allow vignettes which would > not be completely constructed from sources, for example, because certain > data or other resources may not be generally available. However, R CMD check > will still try to parse the Sweave file and run the R code, and fail if it > does not run. So, when the resources to build the vignette are not generally > available this does require some special attention, often with try(), in the > code for your vignette. > > It is possible to clai
Re: [Rd] stumped on re-building package vignette
On Wed, Jun 6, 2012 at 8:21 AM, Michael Friendly wrote: > [Env: Win Xp / StatET 2.0 / R 2.15.0] > > In my heplots package I extended the HE-examples.Rnw vignette under > inst/doc. The package passes R CMD check > on my machine: > > * using log directory 'C:/eclipse-3.7/heplots.Rcheck' > * using R version 2.15.0 (2012-03-30) > * using platform: i386-pc-mingw32 (32-bit) > ... > * checking sizes of PDF files under 'inst/doc' ... OK > * checking installed files from 'inst/doc' ... OK > * checking examples ... OK > * checking for unstated dependencies in vignettes ... OK > * checking package vignettes in 'inst/doc' ... OK > * checking running R code from vignettes ... > 'HE-examples.Rnw' ... OK > 'repeated.Rnw' ... OK > ... 2 suggestions. First, On your own workstation, run the cmd check like so R CMD check --as-cran package.tar.gz That will give you results more like CRAN or Rforge. Second, how did your pdf get into the folder "fig"? It makes me suspect you are using some forbidden Sweave settings. Oh, Third, here's a suggestion. Build your own vignette, copy from vignettes to inst/doc, and then in the DESCRIPTION file include the option to ask the R system not to rebuild your vignette. BuildVignettes: false I think you will see this discussed/debated in R-devel and the situation is somewhat fluid. pj > > However, on R-Forge and on CRAN, the following error/warning is generated: > > Mon Jun 4 20:18:22 2012: Checking package heplots (SVN revision 136) ... > * using log directory > ‘/mnt/building/build_2012-06-04-20-02/RF_PKG_CHECK/PKGS/heplots.Rcheck’ > * using R version 2.15.0 Patched (2012-06-03 r59505) > * using platform: x86_64-unknown-linux-gnu (64-bit) > ... > * checking for unstated dependencies in vignettes ... OK > * checking package vignettes in ‘inst/doc’ ... OK > * checking running R code from vignettes ... > ‘HE-examples.Rnw’ ... [4s/4s] OK > ‘repeated.Rnw’ ... [4s/4s] OK > [9s/9s] OK > * checking re-building of vignette PDFs ... NOTE > Error in re-building vignettes: > ... > Loading required package: car > Loading required package: MASS > Loading required package: nnet > Error: processing vignette 'HE-examples.Rnw' failed with diagnostics: > cannot open file 'fig/plot-plastic1.pdf' > Execution halted > > * checking PDF version of manual ... OK > > I am unable to determine why the file fig/plot-plastic1.pdf cannot be > opened. It is in my inst/doc/fig directory & is regenerated > by the .Rnw file. What could cause this? > > Second, I have tried manually running tools::compactPDF("HE-examples.pdf) on > the .pdf file under inst/doc in the package, > but no change is made to the .pdf file. I can't see any way to correct this. > > -- > Michael Friendly Email: friendly AT yorku DOT ca > Professor, Psychology Dept. > York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 > 4700 Keele Street Web: http://www.datavis.ca > Toronto, ONT M3J 1P3 CANADA > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Working on a Vignette called Rcheology
Greetings, R developers Here are my requests today. 1. Would you care to review this vignette http://pj.freefaculty.org/R/Rchaeology.pdf and tell me if you think it is wrong headed, and 2. Supposing you do not think I'm completely wrong, would you care to point me at more examples of R idioms that lead to deep insights into the nature of R programming? Longer more boring explanation: In the rockchalk package, I have functions to make it easier to teach regression to students who aren't R programmers. I just uploaded package version 1.6 to CRAN. While I do this, I'm reading R source code all the time because I need to understand how functions like summary and anova receive R objects, take them apart, and do specific chores. I've been very glad that R is open source while I do this. While working on this, an inspiration hit me! I want to call "Rchaeology" the study of R idioms and customs deduced from the R source code written by the experts. Practicing this makes me an Rchaeologist. I've started a vignette, it's in rockchalk and up here: http://pj.freefaculty.org/R/Rchaeology.pdf I plan to work out sections that explore some particular usage examples that touch on things that expose special features of the R language. Right now, I've only got one worked out, it explores this idiom that Gabor G. explained to me, which receives a formula object and replaces a variable "x1" with "x1c". do.call("substitute", list(newFmla, setNames(list(as.name("x1c")), "x1"))) I think that one is fabulous, It sheds a lot of light when you break it down to pieces. I'd like to build up a good list of "Do This, Not That" bits, but it is hard to find particular approaches that almost all of you will agree to. Almost everybody agrees that calling rbind over and over again is slow, and do.call with rbind and a collection of stackable things is better: http://pj.freefaculty.org/R/WorkingExamples/stackListItems.R If you have ideas for "Do This, Not That", I'd be glad to hear them and I'll look for example applications that make them relevant. pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Problem in vignette packaging of Sweave in utils package
In ?Sweave, it refers to Sweave User Manual. In the doc folder of utils package, I see "Sweave.pdf". However, I can't find it from within R > vignette("Sweave User Manual") Warning message: vignette ‘Sweave User Manual’ not found > browseVignettes("utils") No vignettes found by browseVignettes("utils") > library(help=utils) does not refer to any vignettes. The vignette does not appear in the main page for utils in help.start(). I checked the source code for the Sweave vignette, but I don't see anything wrong. It has all of the required elements: %\VignetteIndexEntry{Sweave User Manual} %\VignettePackage{utils} %\VignetteDepends{tools} %\VignetteDepends{datasets} %\VignetteDepends{stats} Am I accessing it incorrectly, or is there something wrong in my install of R-2.15.1? > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.15.1 -- Paul E. Johnson Professor, Political ScienceAssoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.orghttp://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Problem in vignette packaging of Sweave in utils package
On Tue, Jul 3, 2012 at 12:54 PM, Yihui Xie wrote: > Strange enough; I just noticed the HTML index pages of several base > packages were gone (e.g. base, stats, tools, utils) under Ubuntu. Not > sure if this is a problem of Debian packages or R itself. > It is not only on Debian where I see: > vignette("Sweave") Warning message: vignette ‘Sweave’ not found This afflicts RPM based installs as well. It is not a new problem, AFAICT. Here's output from R-2.15.0 on a Centos Linux cluster node.: > vignette("Sweave") Warning message: vignette 'Sweave' not found > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3] LC_TIME=en_US.iso885915LC_COLLATE=en_US.iso885915 [5] LC_MONETARY=en_US.iso885915LC_MESSAGES=en_US.iso885915 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.15.0 > -- > Yihui Xie > Phone: 515-294-2465 Web: http://yihui.name > Department of Statistics, Iowa State University > 2215 Snedecor Hall, Ames, IA > -- Paul E. Johnson Professor, Political ScienceAssoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.orghttp://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Are R packages supposed to be "relocatable"? (avoiding BioConductor scripts...)
I've asked a question in the BioConductor list about package management. My solution depends on your answer to the following question. Are installed R packages "relocatable"? I mean relocatable in the same sense that files in a RedHat RPM file might be "relocatable" after compiling (http://www.rpm.org/max-rpm/ch-rpm-reloc.html). This allows one to build a package as the ordinary user and then the root user can take the result and put it wherever it fits well in the path (say, /usr/local/lib/R/library). Here is why I asked. I manage packages in our cluster and some users have asked me to install some BioConductor packages. BioConductor install documents expect me to run a script as root that does a bunch of changes, and I'm just unwilling to do that. If I have to do something as root, it has to be something more focused like running a particular R command (install.packages(), for example). It seems insane to me that they expect (http://www.bioconductor.org/install) a root user to run source("http://bioconductor.org/biocLite.R";) biocLite("limma") If I could do the installs as me with their script, and then copy the install folder into the system, then it would be OK, if the packages would work. Or perhaps post-processing is required to fiddle some paths inside package files? pj -- Paul E. Johnson Professor, Political ScienceAssoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.orghttp://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Problem in vignette packaging of Sweave in utils package
New help request below On Sat, Jul 7, 2012 at 7:25 AM, Duncan Murdoch wrote: > On 12-07-03 1:21 PM, Paul Johnson wrote: >> >> In ?Sweave, it refers to Sweave User Manual. In the doc folder of >> utils package, I see "Sweave.pdf". >> >> However, I can't find it from within R >> >> >>> vignette("Sweave User Manual") >> >> Warning message: >> vignette ‘Sweave User Manual’ not found > > > Turns out there was a bug in the code to install from tarballs, so base > package vignettes weren't fully installed. I've fixed it; installs from > R-patched or R-devel revisions after 59750 should be okay. > > Duncan Murdoch > > Thanks, Duncan Can I ask you to check this one again? I just built RPMS for R-2.15.1-patched 2012-07-25 on a Rocks Cluster (Centos 5.3). I still see the trouble that the vignette for Sweave is missing from the vignette index. Below I'm showing everything I can think of that is related to this, hoping it will give you an idea if the problem is on my end. > sessionInfo() R version 2.15.1 Patched (2012-07-25 r59963) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.15.1 > browseVignettes("utils") No vignettes found by browseVignettes("utils") > vignette("Sweave") Warning message: vignette ‘Sweave’ not found As far as I can tell, the vignette WAS built and installed, as evidenced: $ pwd /usr/lib64/R/library/utils/doc [pauljohn@devel doc]$ ls Sweave.pdf But it is not found by the vignette indexing system, it's not in the output from help(package="utils") Maybe I did something to break the install when making the RPM. But I don't know what:) I also built R the "old fashioned way" in the shell and the result is the same. For the RPM, I used a spec file I've slightly adapted from Tom "Spot" Calloway's spec file that is used to build R for EPEL and Fedora. Here's a copy: http://pj.freefaculty.org/misc/R-2.15.1-patch.spec And the output (stdout and stderr) from the build process, obtained from $ rpmbuild -ba R-2.15.1-patch.spec --define "dist kurocks53" > R-2.15.1-patch.txt 2>&1 http://pj.freefaculty.org/misc/R-2.15.1-patch.txt In order to make this work, I did one unsavory thing. I took the contents of this R-patched_2012-07-25.tar.gz and re-named the directory R-2.15.1 and then repackaged that as R-2.15.1.tar.gz. I know this is unsavory to rename somebody else's packaging, but it was the only way I could get the RPM build system to accept the version number and file name. I put the date in the release version to reduce confusion. The RPMs have names like: R-core-2.15.1-20120725kurocks53.x86_64.rpm Oh, heck. I'm just uploading them all to the same folder as the spec file, on the off chance you can see what's wrong. libRmath-2.15.1-20120725kurocks53.x86_64.rpm libRmath-devel-2.15.1-20120725kurocks53.x86_64.rpm libRmath-static-2.15.1-20120725kurocks53.x86_64.rpm R-2.15.1-20120725kurocks53.x86_64.rpm R-core-2.15.1-20120725kurocks53.x86_64.rpm R-debuginfo-2.15.1-20120725kurocks53.x86_64.rpm R-devel-2.15.1-20120725kurocks53.x86_64.rpm And the SRPM: R-2.15.1-20120725kurocks53.src.rpm pj -- Paul E. Johnson Professor, Political ScienceAssoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.orghttp://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Problem in vignette packaging of Sweave in utils package
On Fri, Jul 27, 2012 at 9:54 AM, Duncan Murdoch wrote: > On 12-07-27 1:23 AM, Paul Johnson wrote: >> >> New help request below >> >> On Sat, Jul 7, 2012 at 7:25 AM, Duncan Murdoch >> wrote: >>> >>> On 12-07-03 1:21 PM, Paul Johnson wrote: >>>> >>>> >>>> In ?Sweave, it refers to Sweave User Manual. In the doc folder of >>>> utils package, I see "Sweave.pdf". >>>> >>>> However, I can't find it from within R >>>> >>>> >>>>> vignette("Sweave User Manual") >>>> >>>> >>>> Warning message: >>>> vignette ‘Sweave User Manual’ not found >>> >>> >>> >>> Turns out there was a bug in the code to install from tarballs, so base >>> package vignettes weren't fully installed. I've fixed it; installs from >>> R-patched or R-devel revisions after 59750 should be okay. >>> >>> Duncan Murdoch >>> >>> >> >> Thanks, Duncan >> >> Can I ask you to check this one again? > > > Sorry, I can't help with this. If you build from svn or from a tarball, > then I could help, but someone else produces the rpms, and I don't have > access to a Linux system to try them. > > Duncan Murdoch > > Sorry I was unclear. I did build from R-patched tarball, from here: ftp://ftp.stat.math.ethz.ch/Software/R/R-patched_2012-07-25.tar.gz I get the same result whether I build that into an RPM or if I build and install the "old fashioned" way (configure, make, make install). The Sweave pdf is created, but not found inside R. I just re-ran the compilation in an Emacs shell and you can see everything that goes on: http://pj.freefaculty.org/misc/R-patched-build-20120725.txt As you will see, I run into problems with pdf creation in make pdf dealing with refman. I think that links back to the ancient tetex toolkit and the lack of incolsolata font. If that is the reason R can't find the vignette, I would be glad to know. I do get the Sweave.pdf vignette created, just not indexed. pj -- Paul E. Johnson Professor, Political ScienceAssoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.orghttp://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] tcltk capability
Sorry I did not see this sooner. Response below: On Thu, Aug 30, 2012 at 2:48 PM, Gabor Grothendieck wrote: > There are quite a few packages that make use of tcltk and although > > - most R distributions have tcltk capability > - its possible to query this via capabilities()[["tcltk"]] > - attempting to build a package that needs it on such an R > distribution will give a message letting one know > > users still seem to have problems with this. I agree that packages are using tcltk, but users do so at their peril. Just loading tcltk breaks multicore functions like mclapply. I don't understand why, you might be able to explain it to me. http://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15040 tcltk is discouraged, at least to some in R core. I would really appreciate some guidance on what graphics library I should use to develop GUI for R programs. There's no future in tcltk, apparently, but I don't know what to work with. pj pj > > There was recently an R bug report submitted that likely stemmed from > the user having such a build of R and recently there was also an R > help post that seemed to be related to this as well. As more and more > packages add GUI capability these incidents are bound to increase. > > The best situation would be to arrange that every distribution of R > have tcltk capability but if this is difficult to arrange then I > suggest there be a message at start up similar to the existing locale > message letting the user know that the R distribution they are running > lacks tcltk capability. Preferably the message would be somewhat > menacing so that users are encouraged to get a different distribution > of R right from the start. > > This might reduce such problem reports somewhat. > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Paul E. Johnson Professor, Political ScienceAssoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.orghttp://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Changing arguments inside .Call. Wise to encourage "const" on all arguments?
I'm continuing my work on finding speedups in generalized inverse calculations in some simulations. It leads me back to .C and .Call, and some questions I've never been able to answer for myself. It may be I can push some calculations to LAPACK in or C BLAS, that's why I realized again I don't understand the call by reference or value semantics of .Call Why aren't users of .Call encouraged to "const" their arguments, and why doesn't .Call do this for them (if we really believe in return by value)? R Gentleman's R Programming for Bioinformatics is the most understandable treatment I've found on .Call. It appears to me .Call leaves "wiggle room" where there should be none. Here's Gentleman on p. 185. "For .Call and .External, the return value is an R object (the C functions must return a SEXP), and for these functions the values that were passed are typically not modified. If they must be modified, then making a copy in R, prior to invoking the C code, is necessary." I *think* that means: .Call allows return by reference, BUT we really wish users would not use it. Users can damage R data structures that are pointed to unless they really truly know what they are doing on the C side. ?? This seems dangerous. Why allow return by reference at all? On p. 197, there's a similar comment "Any function that has been invoked by either .External or .Call will have all of its arguments protected already. You do not need to protect them. [T]hey were not duplicated and should be treated as read-only values." "should be ... read-only" concerns me. They are "protected" in the garbage collector sense, but they are not protected from "return by reference" damage. Right? Why doesn't the documentation recommend function writers to mark arguments to C functions as const? Isn't that what the return by value policy would suggest? Here's a troublesome example in R src/main/array.c: /* DropDims strips away redundant dimensioning information. */ /* If there is an appropriate dimnames attribute the correct */ /* element is extracted and attached to the vector as a names */ /* attribute. Note that this function mutates x. */ /* Duplication should occur before this is called. */ SEXP DropDims(SEXP x) { SEXP dims, dimnames, newnames = R_NilValue; int i, n, ndims; PROTECT(x); dims = getAttrib(x, R_DimSymbol); [... SNIP ] setAttrib(x, R_DimNamesSymbol, R_NilValue); setAttrib(x, R_DimSymbol, R_NilValue); setAttrib(x, R_NamesSymbol, newnames); [... SNIP ] return x; } Well, at least there's a warning comment with that one. But it does show .Call allows call by reference. Why allow it, though? DropDims could copy x, modify the copy, and return it. I wondered why DropDims bothers to return x at all. We seem to be using modify and return by reference there. I also wondered why x is PROTECTED, actually. Its an argument, wasn't it automatically protected? Is it no longer protected after the function starts modifying it? Here's an example with similar usage in Writing R Extensions, section 5.10.1 "Calling .Call". It protects the arguments a and b (needed ??), then changes them. #include #include SEXP convolve2(SEXP a, SEXP b) { R_len_t i, j, na, nb, nab; double *xa, *xb, *xab; SEXP ab; PROTECT(a = AS_NUMERIC(a)); /* PJ wonders, doesn't this alter "a" in calling code*/ PROTECT(b = AS_NUMERIC(b)); na = LENGTH(a); nb = LENGTH(b); nab = na + nb - 1; PROTECT(ab = NEW_NUMERIC(nab)); xa = NUMERIC_POINTER(a); xb = NUMERIC_POINTER(b); xab = NUMERIC_POINTER(ab); for(i = 0; i < nab; i++) xab[i] = 0.0; for(i = 0; i < na; i++) for(j = 0; j < nb; j++) xab[i + j] += xa[i] * xb[j]; UNPROTECT(3); return(ab); } Doesn't PROTECT(a = AS_NUMERIC(a)); have the alter the data structure "a" both inside the C function and in the calling R code as well? And, if a was PROTECTED automatically, could we do without that PROTECT()? pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] R-2.15.2 changes in computation speed. Numerical precision?
Speaking of optimization and speeding up R calculations... I mentioned last week I want to speed up calculation of generalized inverses. On Debian Wheezy with R-2.15.2, I see a huge speedup using a souped up generalized inverse algorithm published by V. N. Katsikis, D. Pappas, Fast computing of theMoore-Penrose inverse matrix, Electronic Journal of Linear Algebra, 17(2008), 637-650. I was so delighted to see the computation time drop on my Debian system that I boasted to the WIndows users and gave them a test case. They answered back "there's no benefits, plus Windows is faster than Linux". That sent me off on a bit of a goose chase, but I think I'm beginning to understand the situation. I believe R-2.15.2 introduced a tighter requirement for precision, thus triggering longer-lasting calculations in many example scripts. Better algorithms can avoid some of that slowdown, as you see in this test case. Here is the test code you can run to see: http://pj.freefaculty.org/scraps/profile/prof-puzzle-1.R It downloads a data file from that same directory and then runs some multiple imputations with the Amelia package. Here's the output from my computer http://pj.freefaculty.org/scraps/profile/prof-puzzle-1.Rout That includes the profile of the calculations that depend on the ordinary generalized inverse algorithm based on svd and the new one. See? The KP algorithm is faster. And just as accurate as Amelia:::mpinv or MASS::ginv (for details on that, please review my notes in http://pj.freefaculty.org/scraps/profile/qrginv.R). So I asked WIndows users for more detailed feedback, including sessionInfo(), and I noticed that my proposed algorithm is not faster on Windows--WITH OLD R! Here's the script output with R-2.15.0, shows no speedup from the KPginv algorithm http://pj.freefaculty.org/scraps/profile/prof-puzzle-1-Windows.Rout On the same machine, I updated to R-2.15.2, and we see the same speedup from the KPginv algorithm http://pj.freefaculty.org/scraps/profile/prof-puzzle-1-CRMDA02-WinR2.15.2.Rout After that, I realized it is an R version change, not an OS difference, I was a bit relieved. What causes the difference in this case? In the Amelia code, they try to avoid doing the generalized inverse by using the ordinary solve(), and if that fails, then they do the generalized inverse. In R 2.15.0, the near singularity of the matrix is ignored, but not in R 2.15.2. The ordinary solve is failing almost all the time, thus triggering the use of the svd based generalized inverse. Which is slower. The Katsikis and Pappas 2008 algorithm is the fastest one I've found after translating from Matlab to R. It is not so universally applicable as svd based methods, it will fail if there are linearly dependent columns. However, it does tolerate columns of all zeros, which seems to be the problem case in the particular application I am testing. I tried very hard to get the newer algorithm described here to go as fast, but it is way way slower, at least in the implementations I tried: ## KPP ## Vasilios N. Katsikis, Dimitrios Pappas, Athanassios Petralias. "An improved method for ## the computation of the Moore Penrose inverse matrix," Applied ## Mathematics and Computation, 2011 The notes on that are in the qrginv.R file linked above. The fact that I can't make that newer KPP algorithm go faster, although the authors show it can go faster in Matlab, leads me to a bunch of other questions and possibly the need to implement all of this in C with LAPACK or EIGEN or something like that, but at this point, I've got to return to my normal job. If somebody is good at R's .Call interface and can make a pure C implementation of KPP. I think the key thing is that with R-2.15.2, there is an svd-related bottleneck in the multiple imputation algorithms in Amelia. The replacement version of the function Amelia:::mpinv does reclaim a 30% time saving, while generating imputations that are identical, so far as i can tell. pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R-2.15.2 changes in computation speed. Numerical precision?
On Thu, Dec 13, 2012 at 3:33 AM, Uwe Ligges wrote: > Long message, but as far as I can see, this is not about base R but the > contributed package Amelia: Please discuss possible improvements with its > maintainer. > Thanks for answering, but I'm really surprised by your answer. The package is a constant, but R got much slower between R-2.15.1 and R-2.15.2. The profile of functions called radically changed, svd gets called much more because solve() fails much more often. No change in R could account for it? I'm not saying R is wrong, it may be more accurate and better. After chasing the slowdown for a week, I need some comfort. Does the LINPACK -> LAPACK change play a role. The change I'm looking for is something that would substantially tune up mathematical precision so that matrices that did not seem singular before are now, in the eyes of functions like chol() and solve(). Whereas in R-2.15.1, a matrix can be inverted by solve(), for example, now R-2.15.2 rejects the matrix as singular. I will take the problem up with applications, of course. But you see how package writers might think its ridiculous. They argue, "I had a perfectly fine calculation against R-2.15.0 and R-2.15.1, and now with R-2.15.2 it takes three times as long? And you want me to revise my package?" Would you be persuaded there's an R base question if I showed you a particular matrix that can be decomposed or solved in R-2.15.1 but cannot be in R-2.15.2? I should have thought of that before, I suppose :) pj > Best, > Uwe Ligges > > > > On 12.12.2012 19:14, Paul Johnson wrote: >> >> Speaking of optimization and speeding up R calculations... >> >> I mentioned last week I want to speed up calculation of generalized >> inverses. On Debian Wheezy with R-2.15.2, I see a huge speedup using a >> souped up generalized inverse algorithm published by >> >> V. N. Katsikis, D. Pappas, Fast computing of theMoore-Penrose inverse >> matrix, Electronic Journal of Linear Algebra, >> 17(2008), 637-650. >> >> I was so delighted to see the computation time drop on my Debian >> system that I boasted to the WIndows users and gave them a test case. >> They answered back "there's no benefits, plus Windows is faster than >> Linux". >> >> That sent me off on a bit of a goose chase, but I think I'm beginning >> to understand the situation. I believe R-2.15.2 introduced a tighter >> requirement for precision, thus triggering longer-lasting calculations >> in many example scripts. Better algorithms can avoid some of that >> slowdown, as you see in this test case. >> >> Here is the test code you can run to see: >> >> http://pj.freefaculty.org/scraps/profile/prof-puzzle-1.R >> >> It downloads a data file from that same directory and then runs some >> multiple imputations with the Amelia package. >> >> Here's the output from my computer >> >> http://pj.freefaculty.org/scraps/profile/prof-puzzle-1.Rout >> >> That includes the profile of the calculations that depend on the >> ordinary generalized inverse algorithm based on svd and the new one. >> >> See? The KP algorithm is faster. And just as accurate as >> Amelia:::mpinv or MASS::ginv (for details on that, please review my >> notes in http://pj.freefaculty.org/scraps/profile/qrginv.R). >> >> So I asked WIndows users for more detailed feedback, including >> sessionInfo(), and I noticed that my proposed algorithm is not faster >> on Windows--WITH OLD R! >> >> Here's the script output with R-2.15.0, shows no speedup from the >> KPginv algorithm >> >> http://pj.freefaculty.org/scraps/profile/prof-puzzle-1-Windows.Rout >> >> On the same machine, I updated to R-2.15.2, and we see the same >> speedup from the KPginv algorithm >> >> >> http://pj.freefaculty.org/scraps/profile/prof-puzzle-1-CRMDA02-WinR2.15.2.Rout >> >> After that, I realized it is an R version change, not an OS >> difference, I was a bit relieved. >> >> What causes the difference in this case? In the Amelia code, they try >> to avoid doing the generalized inverse by using the ordinary solve(), >> and if that fails, then they do the generalized inverse. In R 2.15.0, >> the near singularity of the matrix is ignored, but not in R 2.15.2. >> The ordinary solve is failing almost all the time, thus triggering the >> use of the svd based generalized inverse. Which is slower. >> >> The Katsikis and Pappas 2008 algorithm is the fastest one I've found >> after translating from Matlab to R. It is not so universally >> applicable as svd based methods, it
Re: [Rd] R-2.15.2 changes in computation speed. Numerical precision?
On Thu, Dec 13, 2012 at 9:01 PM, Yi (Alice) Wang wrote: > I have also encountered a similar problem. My mvabund package runs much > faster on linux/OSX than on windows with both R/2.15.1 and R/2.15.2. For > example, with mvabund_3.6.3 and R/2.15.2, > system.time(example(anova.manyglm)) > Hi, Alice You have a different problem than I do. The change from R-2.15.1 to R-2.15.2 makes the program slower on all platforms. The slowdown that emerges in R-2.15.2 on all types of hardware concerns me. It only seemed like a "Windows is better" issue when all the Windows users who tested my program were using R-2.15.0 or R-2.15.1. As soon as they update R, then they have the slowdown as well. pj > on OSX returns > >user system elapsed > 3.351 0.006 3.381 > > but on windows 7 it returns > >user system elapsed > 13.13 0.00 13.14 > > I also used svd frequently in my c code though by calling the gsl functions > only. In my memory, I think the comp time difference is not that significant > with earlier R versions. So maybe it is worth an investigation? > > Many thanks, > Yi Wang > > > On Thu, Dec 13, 2012 at 5:33 PM, Uwe Ligges > wrote: >> >> Long message, but as far as I can see, this is not about base R but the >> contributed package Amelia: Please discuss possible improvements with its >> maintainer. >> >> Best, >> Uwe Ligges >> >> >> On 12.12.2012 19:14, Paul Johnson wrote: >>> >>> Speaking of optimization and speeding up R calculations... >>> >>> I mentioned last week I want to speed up calculation of generalized >>> inverses. On Debian Wheezy with R-2.15.2, I see a huge speedup using a >>> souped up generalized inverse algorithm published by >>> >>> V. N. Katsikis, D. Pappas, Fast computing of theMoore-Penrose inverse >>> matrix, Electronic Journal of Linear Algebra, >>> 17(2008), 637-650. >>> >>> I was so delighted to see the computation time drop on my Debian >>> system that I boasted to the WIndows users and gave them a test case. >>> They answered back "there's no benefits, plus Windows is faster than >>> Linux". >>> >>> That sent me off on a bit of a goose chase, but I think I'm beginning >>> to understand the situation. I believe R-2.15.2 introduced a tighter >>> requirement for precision, thus triggering longer-lasting calculations >>> in many example scripts. Better algorithms can avoid some of that >>> slowdown, as you see in this test case. >>> >>> Here is the test code you can run to see: >>> >>> http://pj.freefaculty.org/scraps/profile/prof-puzzle-1.R >>> >>> It downloads a data file from that same directory and then runs some >>> multiple imputations with the Amelia package. >>> >>> Here's the output from my computer >>> >>> http://pj.freefaculty.org/scraps/profile/prof-puzzle-1.Rout >>> >>> That includes the profile of the calculations that depend on the >>> ordinary generalized inverse algorithm based on svd and the new one. >>> >>> See? The KP algorithm is faster. And just as accurate as >>> Amelia:::mpinv or MASS::ginv (for details on that, please review my >>> notes in http://pj.freefaculty.org/scraps/profile/qrginv.R). >>> >>> So I asked WIndows users for more detailed feedback, including >>> sessionInfo(), and I noticed that my proposed algorithm is not faster >>> on Windows--WITH OLD R! >>> >>> Here's the script output with R-2.15.0, shows no speedup from the >>> KPginv algorithm >>> >>> http://pj.freefaculty.org/scraps/profile/prof-puzzle-1-Windows.Rout >>> >>> On the same machine, I updated to R-2.15.2, and we see the same >>> speedup from the KPginv algorithm >>> >>> >>> http://pj.freefaculty.org/scraps/profile/prof-puzzle-1-CRMDA02-WinR2.15.2.Rout >>> >>> After that, I realized it is an R version change, not an OS >>> difference, I was a bit relieved. >>> >>> What causes the difference in this case? In the Amelia code, they try >>> to avoid doing the generalized inverse by using the ordinary solve(), >>> and if that fails, then they do the generalized inverse. In R 2.15.0, >>> the near singularity of the matrix is ignored, but not in R 2.15.2. >>> The ordinary solve is failing almost all the time, thus triggering the >>> use of the svd based generalized inverse. Which is slower. >>> >>> The Katsiki
[Rd] Found explanation for R-2.15.2 slowdown in one case; caution for any users of La_chol
2 days ago, I posted my long message about the observed slowdown in a package between R-2.15.0 and R-2.15.2. Uwe Ligges urged me to make a self-contained R example. That was the encouragement I needed. I tracked the problem down to a failing use of a LAPACK routine. R's LAPACK C interface changed one variable in one function. But it turned out to be an important change. In case others have code that is behaving in unexpected says, I'd urge package writers to double-check their usage of the Cholesky inverse. Here are details: In R 2.15.0, src/main/lapack.c, we have the prototype: SEXP La_chol (SEXP A) BUT in R 2.15.2, the prototype changed: SEXP La_chol (SEXP A, SEXP pivot) In the problem case I was studying, the effort to use La_chol was wrapped in a "try" catch framework, and when Cholesky failed, it fell back to a singular value decomposition. That's much slower, of course. Hence the program seemed slower under R-2.15.2, but it was really failing in a way that I had not noticed. pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Found explanation for R-2.15.2 slowdown in one case; caution for any users of La_chol
On Sat, Dec 15, 2012 at 11:44 AM, Uwe Ligges wrote: > > > On 14.12.2012 23:31, Paul Johnson wrote: >> > That is the reason why R CMD check gives a WARNING in the checks of your > package for quite some time now: > > > checking foreign function calls ... WARNING > > Foreign function calls with 'PACKAGE' argument in a base package: > .Call("La_chol", ..., PACKAGE = "base") > .Call("La_chol2inv", ..., PACKAGE = "base") > Packages should not make .C/.Call/.Fortran calls to base packages. They > are not part of the API, for use only by R itself and subject to change > without notice. > Just to be clear. Its not *my* package. It is a package we use A LOT I had to get to the bottom of the problem. I'm *very glad* to understand it. >From a user perspective, I wonder if CRAN could warn me somehow? Would it be a reasonable wishlist request to have the build process take a snapshot of warnings and then have the installer play them back to the user when the package is installed? That would have saved me some time this week. Now, when I install that, I don't see a warning. (see below). > install.packages("Amelia") Installing package(s) into ‘/home/pauljohn/R/x86_64-pc-linux-gnu-library/2.15’ (as ‘lib’ is unspecified) --- Please select a CRAN mirror for use in this session --- Loading Tcl/Tk interface ... done trying URL 'http://rweb.quant.ku.edu/cran/src/contrib/Amelia_1.6.3.tar.gz' Content type 'application/x-gzip' length 1060878 bytes (1.0 Mb) opened URL == downloaded 1.0 Mb * installing *source* package ‘Amelia’ ... ** package ‘Amelia’ successfully unpacked and MD5 sums checked ** R ** data ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ‘amelia.Rnw’ ** testing if installed package can be loaded * DONE (Amelia) The downloaded source packages are in ‘/tmp/Rtmpi0CLZe/downloaded_packages’ > > So time to fix that. See also: > http://cran.r-project.org/web/checks/check_results_Amelia.html > > > Best, > Uwe Ligges > -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] weird bug with parallel, RSQlite and tcltk
Happy new year. On Mon, Dec 31, 2012 at 12:08 PM, Karl Forner wrote: > Hello, > > I spent a lot of a time on a weird bug, and I just managed to narrow it down. > > In parallel code (here with parallel::mclappy, but I got it > doMC/multicore too), if the library(tcltk) is loaded, R hangs when > trying to open a DB connection. > I got the same behaviour on two different computers, one dual-core, > and one 2 xeon quad-core. > I believe this is the same problem discussed here: https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15040 I am interested to know the long term implications of the discussion going on here, but can I interject to ask the short term question: With R-2.15.2, is there something I can do to defend myself before calling mclapply? Suppose I don't intentionally use tcltk, but some package I use has loaded it. Or somebody ran install.packages(). What to do? I thought this would work: detach(package:tcltk, unload=TRUE) I ran one test case where it seemed to work, but then started a several new sessions and got a big crashes. Example 1. detach(package:tcltk) without unload=TRUE hangs > library(parallel) > library(tcltk) Loading Tcl/Tk interface ... done > detach(package:tcltk) > example(mclapply) mclppl> ## No test: mclppl> simplify2array(mclapply(rep(4, 5), rnorm)) [,1] [,2] [,3] [,4][,5] [1,] -0.1499368 -0.7511576 0.7675234 -0.1324973 -0.01118655 [2,] -0.5989764 0.6605780 -0.9417879 0.2295157 -0.16750436 [3,] 1.4585402 0.2513995 1.6857897 -0.1200986 1.19094290 [4,] -0.1459071 -1.7570450 0.2968794 1.1964827 1.87066283 mclppl> # use the same random numbers for all values mclppl> set.seed(1) mclppl> simplify2array(mclapply(rep(4, 5), rnorm, mc.preschedule = FALSE, mclppl+ mc.set.seed = FALSE)) Example 2. detach with unload=TRUE dies > library(tcltk) Loading Tcl/Tk interface ... done > library(parallel) > detach(package:tcltk, unload=TRUE) > example(mclapply) called Tcl_FindHashEntry on deleted table Aborted Ouch. Back to the shell, completely unceremonious. Does it matter in which order they are loaded? > library(parallel) > library(tcltk) Loading Tcl/Tk interface ... done > detach(package:tcltk, unload=TRUE) > example(mclapply) called Tcl_FindHashEntry on deleted table Aborted In the short term, It would really help if we just had a foolproof way to close, kill, remove, delete, detach and otherwise block tcltk, its functions, its anything! pj > Here's the code: > > library(parallel) > library(RSQLite) > library(tcltk) > #unloadNamespace("tcltk") > > res <- mclapply(1:2, function(x) { > db <- DBI::dbConnect("SQLite", ":memory:") > }, mc.cores=2) > print("Done") > > When I execute it (R --vanilla < test_parallel_db.R), it hangs > forever, and I have to type several times CTRL+C to interrupt it. I > then get this message: > > Warning messages: > 1: In selectChildren(ac, 1) : error 'Interrupted system call' in select > 2: In selectChildren(ac, 1) : error 'Interrupted system call' in select > > Then, just remove library(tcltk), or uncomment > unloadNamespace("tcltk"), and it works fine again. > > I guess there's a bug somewhere, but where exactly ? > > Best, > > Karl Forner > > Further info: > > > R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows" > Copyright (C) 2012 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > Platform: x86_64-unknown-linux-gnu (64-bit) > > ubuntu 12.04 and 12.10 > > ubuntu package tk8.5 > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] best practice for packages using mclapply to avoid tcltk
On Sun, Feb 3, 2013 at 1:34 PM, Simon Urbanek wrote: > As Peter pointed out earlier, this is better addressed by disabling the > Tcl/Tk event loop in forked processes. > Dear Simon: I don't understand. Can you please try to say it again? I find Peter's comment (on Jan 3, 2013, thread title: weird bug with parallel, RSQlite and tcltk): "More likely, the wisdom of calling R_ProcessEvents and R_PolledEvents in parallel processes should be questioned. I tend to think that they should both be disabled completely conditionally on R_isForkedChild. At least in the Tk loop, some of the events are generated as responses to specific queries, and having one process ask for something and another one handling the reply, leaving the first one waiting indefinitely, is just Not Right." That suggested to me the problem is in R itself, or the tcltk package If package writers don't follow my suggestion, what do they think they should do instead? pj > Cheers, > Simon > > On Feb 2, 2013, at 5:02 PM, Paul Johnson wrote: > >> Dear R-devel friends: >> >> I'm back to bother you again about the conflict between mclapply and >> tcltk. I've been >> monitoring several packages that want to use mclapply to parallelize >> computations and >> need to figure out what should be done. >> >> It appears tcltk cannot be safely unloaded, so the best we can do is >> check for the presence of tcltk and stop if it is found before >> mclapply() is used. >> >> I wish you would please review my suggestion below. Maybe checkForTcltk() >> could be used in the parallel package. Otherwise, we are letting >> people run with scissors. >> >> There's a warning about this in ?mclapply >> >> It is _strongly discouraged_ to use these functions in GUI or >>embedded environments, because it leads to several processes >>sharing the same GUI which will likely cause chaos (and possibly >>crashes). Child processes should never use on-screen graphics >>devices.(Warning in ?mclapply) >> >> Bug report: (https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15040 ) >> >> By the way, what is "embedded environments" in ?mclapply >> >> ## I don't want to crash your system, but if you want to see a freeze-up: >> ## change 0 to 1 and run this: >> if (0){ >> library(parallel) >> library(tcltk) >> example(mclapply) >> } >> >> ## What are packagers supposed to do if they want to call mclapply? >> ## It appears to me the best a package can do is scan for tcltk and >> ## stop. Here's a function that does so. >> >> checkForTcltk <- function(){ >>if ("tcltk" %in% loadedNamespaces()){ >>stop("This function cannot be used because the R tcltk >>package is loaded. It is necessary to Close R, and re-run >>the program making sure that tcltk is never loaded before >>this function is called.") >>} >> } >> >> ## put that to use. >> MCLApply <- function(){ >>if (!require(parallel)) stop ("parallel wouldn't load") >>checkForTcltk() >>example(mclapply) >> } >> >> ## test that: >> >> checkForTcltk() >> MCLApply() >> >> library(tcltk) >> checkForTcltk() >> >> >> ## Why can't tcltk just be unloaded? I don't understand, but it is a deep >> ## problem. >> >> ## Consider the ominous warnings in R detach's help: >> ## >> ## "Unloading some namespaces has undesirable side effects: e.g. >> ## unloading ‘grid’ closes all graphics devices, and on most systems >> ## ‘tcltk’ cannot be reloaded once it has been unloaded and may crash >> ## R if this is attempted." (Note: section of ?detach). >> ## >> ## To be fair, that does not say unloading tcltk is fatal, but >> ## reloading it is fatal. And I've seen plenty of times when >> ## unloading it is fatal. >> >> ## Example 1. Crash on library.dynam.unload() >> detach("package:tcltk", unload = TRUE) >> library.dynam.unload("tcltk", system.file(package = "tcltk")) >> >> ## Output >> ## > library.dynam.unload("tcltk", system.file(package = "tcltk")) >> ## > >> ## *** caught segfault *** >> ## address 0x7f69c9d99580, cause 'memory not mapped' >> >> ## Possible actions: >> ## 1: abort (with core dump, if enabled) >> ## 2: normal R exit >> ## 3: exit R without saving workspace >> ## 4: exit R saving workspace >>
[Rd] ifelse can't return a list? Please explain (R-2.15.3)
I hope you are doing well. For me, this was an unexpected problem. I've hoped for quite a few wrong things today, but I'm only asking you about this one. Why does ifelse(1, list(a, b, c), list(x, y, z)) return a list with only a, not list(a, b, c) as I hoped. I wish it would either cause an error or return the whole list, not just the first thing. Working example: > x <- 1 > y <- 2 > z <- 3 > a <- 4 > b <- 5 > c <- 6 > list(x,y,z) [[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3 > list(a,b,c) [[1]] [1] 4 [[2]] [1] 5 [[3]] [1] 6 > ifelse(1, list(a,b,c), list(x,y,z)) [[1]] [1] 4 > ifelse(0, list(a,b,c), list(x,y,z)) [[1]] [1] 1 > sessionInfo() R version 2.15.3 (2013-03-01) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rockchalk_1.5.5.10 car_2.0-16 nnet_7.3-5 MASS_7.3-23 loaded via a namespace (and not attached): [1] compiler_2.15.3 tools_2.15.3 I realize I can code around this, but I'm just curious about why ifelse hates me so much :( > if (1) myvar <- list(a, b, c) else myvar <- list(x, y, z) > myvar [[1]] [1] 4 [[2]] [1] 5 [[3]] [1] 6 > myvar <- if (1) list(a, b, c) else list(x, y, z) > myvar [[1]] [1] 4 [[2]] [1] 5 [[3]] [1] 6 -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Patch proposal for R style consistency (concerning deparse.c)
Hello, everybody. I recognize I'm asking you to deal with a not-very-important problem. But its important to me :) I've noticed a little inconsistency in the print.function() output. I traced the matter to deparse.c, for which I attach a patch that addresses 3 separate things. There's one important thing that I'm willing to stand behind and 2 litter things that I think would be good, but won't argue too much about. I ask if this is a legitimate wishlist item in your bug tracker. Here are the changes I'm trying to achieve. 1. I want this: } else { and not the current } else { 2 & 3. I want to omit space after if and for. Since if and for are functions in R, not keywords, I suggest that there should not be a space before the opening parentheses. Item 1 brings the output into line with the advice of the help page for "if". The output looks more like the a randomly selected R file from the R source or the recommended packages. I'm teaching new users that they should study the R source code itself as a guide for programming (http://pj.freefaculty.org/R/Rstyle.pdf). My argument there would go more smoothly if print.function obeyed the "} else {" policy that most of your source code and packages adhere to. We do see r-help questions about the problem that user code can't be run line-by-line, even when it is styled like the print.function() output. As you know, this is discussed in R Extensions Section 3.1 on tidying R code. In my opinion, the current approach makes code less tidy, rather than more. Consider an excerpt from the top of predict.glm.R. This R code is correctly formatted, it is consistent with most of the rest of R: if (!se.fit) { ## No standard errors if(missing(newdata)) { pred <- switch(type, link = object$linear.predictors, response = object$fitted.values, terms = predict.lm(object, se.fit=se.fit, scale = 1, type="terms", terms=terms) ) if(!is.null(na.act)) pred <- napredict(na.act, pred) } else { But the output from '> predict.glm' is not as good: if (!se.fit) { if (missing(newdata)) { pred <- switch(type, link = object$linear.predictors, response = object$fitted.values, terms = predict.lm(object, se.fit = se.fit, scale = 1, type = "terms", terms = terms)) if (!is.null(na.act)) pred <- napredict(na.act, pred) } else { I have applied the attached patch with R 3.0.0. I don't see any damage from it, just more tidier output: > predict.glm function (object, newdata = NULL, type = c("link", "response", "terms"), se.fit = FALSE, dispersion = NULL, terms = NULL, na.action = na.pass, ...) { type <- match.arg(type) na.act <- object$na.action object$na.action <- NULL if(!se.fit) { if(missing(newdata)) { pred <- switch(type, link = object$linear.predictors, response = object$fitted.values, terms = predict.lm(object, se.fit = se.fit, scale = 1, type = "terms", terms = terms)) if(!is.null(na.act)) pred <- napredict(na.act, pred) } else { pred <- predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == "link", "response", type), terms = terms, na.action = na.action) switch(type, response = { pred <- family(object)$linkinv(pred) }, link = , terms = ) } } else { Of course, you know much better than I do what else might break as a result of this change. In case the email list processor scrubs the patch, please try here: http://pj.freefaculty.org/scraps/deparse-pj-20130417.patch pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Patch proposal for R style consistency (concerning deparse.c)
OK, I concede that. Now, how about "} else {" I will provide patch that does only that change. ? On Thu, Apr 18, 2013 at 3:05 AM, peter dalgaard wrote: > > On Apr 18, 2013, at 05:39 , Paul Johnson wrote: > >> 2 & 3. I want to omit space after if and for. Since if and for are >> functions in R, not keywords, I suggest that there should not be a >> space before the opening parentheses. > > Wrong. They are part of language constructs (and they _are_ keywords, not > names, that's why ?for won't work). The function calls are `if`(fee, {foo}, > {fie}) and something rebarbative for `for`(). > > Besides, both constructs are harder to read without the spaces. > > -- > Peter Dalgaard, Professor > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Email: pd@cbs.dk Priv: pda...@gmail.com > -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel