Martin Maechler writes: > There may be one small problem: IIUC, the wayback machine is a > +- private endeavor and really great and phantastic but it does > need (US? tax deductible) donations, https://archive.org/donate/, > to continue thriving. > This makes me hesitate a bit to link to it within the "base R" > documentation. But that may be wrong -- and I should really use > it to *help* the project ?
I agree that the Wayback Machine is a private endeavor. After reviewing other base library documentation, I have concluded that it would regardless be consistent with current practice to reference it in the base documentation. I share your concern regarding the support of other institutions, and I have found some references that are more problematic to me than the one of present interest. I would thus support an initiative to consider the social implications of the different references and to adjust the references accordingly. Below I start by making a distinction between two types of references that I think should be treated differently in terms of your concern. Next, I assess whether there is a precedent for inclusion of references to private publishers, as in the present patch; I include that there is such a president. Then I present my opinion regarding the present patch. Finally, I present some other considerations that I find relevant to the discussion. Distinguishing between two link types ------------------------------------- For discussion of this issue, I think it is helpful to distinguish between references to sources and references to other materials. In the case of references of to sources, there is little choice but to reference the publisher, even though the overwhelming majority of referenced publishers are private companies that impose restrictive licenses on their journals and books and cannot be reasonably trusted to maintain access to the materials nor availability of webpages. With other references, it is possible to replace the reference with a different document that contains similar information. For example, if a function implements an method based on a particular journal article, that article's citation needs to stay, even if the journal is published by a private institution. On the other hand, if the reference just provides context or suggestions related to usage, then the reference is provided just as information and can be replaced. Precedent for inclusion of private non-source materials ------------------------------------------------------- The dead link of interest is only informational, not a citation of a source, and so it could be replaced. So I assessed whether it would match current practice to include it, and I concluded that there is substantial precedent for inclusion of private reference materials other than strict sources. Not having access to a good library at the moment, I have limited my research on this matter to website references. In SVN revision 73164, \url calls are distributed among 148 files, from 1 call to 13 calls per file, with mean of 1.75 and median of 1. grep '\\url' src/library/*/*/*.Rd | cut -d: -f1 | uniq -c | sort -n Total number of library documentation files is 1419. find src/library/ -name \*.Rd | wc -l I randomly selected 20 matching files for further study. % grep '\\url' src/library/*/*/*.Rd | cut -d: -f1 | uniq -c | sort -R | head -n 20 | tee /tmp/rd 2 src/library/grDevices/man/pdf.Rd 1 src/library/base/man/taskCallbackNames.Rd 1 src/library/stats/man/shapiro.test.Rd 1 src/library/tcltk/man/TkWidgets.Rd 2 src/library/graphics/man/assocplot.Rd 1 src/library/base/man/sprintf.Rd 6 src/library/base/man/regex.Rd 3 src/library/datasets/man/HairEyeColor.Rd 1 src/library/stats/man/optimize.Rd 1 src/library/datasets/man/UKDriverDeaths.Rd 1 src/library/utils/man/object.size.Rd 1 src/library/utils/man/unzip.Rd 1 src/library/base/man/dcf.Rd 1 src/library/base/man/DateTimeClasses.Rd 3 src/library/stats/man/GammaDist.Rd 2 src/library/utils/man/maintainer.Rd 2 src/library/base/man/libcurlVersion.Rd 2 src/library/base/man/eigen.Rd 2 src/library/base/man/chol2inv.Rd 1 src/library/tools/man/update_pkg_po.Rd >From these 20 I composed a table with statistical unit of \url call and with variables filename, url, type of reference, and type of publisher. The following commands were helpful. sed -e 's/^[ 0-9]*//' /tmp/rd | xargs grep \\\\url | sed -e 's/$/::/' -e 's/:.*\\url./:/' > urls.csv sed 's/^[ 0-9]*//' /tmp/rd | xargs grep -A5 -B5 \\\\url | less I realized that I need to be a bit more precise about what I mean by a "source". I wound up grouping the type of reference for \url calls into the following categories. 1 Necessary sources, such as the specific file from which an algorithm or dataset was copied (as in stats/man/optimize.Rd) 2 Upstream documentation for bound libraries (tcltk/man/TkWidgets.Rd) 3 Extra information, such as tutorials on portable programming referenced in base/man/sprintf.Rd 4 Ambiguous, such as an general introduction on the topic that may have been used during the development of the function or may have been added just as further documentation (as in grDevices/man/pdf.Rd). These references did not include the date on which the webpage was accessed, so they aren't clear enough to count as source references even if they were in fact used during the development of the function. 5 Comments (stats/man/shapiro.test.Rd) and duplicates (stats/man/GammaDist.Rd) Earlier, I distinguished between references to sources and references to other materials. I think that the first and second categories should be considered the source type references and the third and fourth should be considered the non-source type references. I separated publisher types into the following * academic (I think that they were all public universities, but I did not check very thoroughly.) * government * private * R project Resulting categorization was as follows (attached urls.csv and urls.r). publisher source academic government private r-project 1 necessary 8 0 5 3 2 upstream 3 0 6 0 3 extra 0 0 1 1 4 ambiguous 0 1 4 0 5 ignore 0 1 1 1 The references of concern are the replaceable sources (types 3 and 4) to private publishers, which account for 5 out of the 35 \url calls and 5 of the 20 files. To fit the table in an email, I have truncated the URLs to their domains. filename domain src/library/grDevices/man/pdf.Rd en.wikipedia.org src/library/base/man/sprintf.Rd developer.r-project.org src/library/base/man/regex.Rd perldoc.perl.org src/library/utils/man/object.size.Rd en.wikipedia.org src/library/stats/man/GammaDist.Rd en.wikipedia.org Note that the sprintf documentation is in fact a link to an r-project page (https://developer.r-project.org/Portability.html) that has lots of other links on it, including links to fortran.com, en.wikipedia.org, pubs.opengroup.org, and people.redhat.com. So we see that several \url calls reference private publishers even though the links could be replaced with alternatives. By my categorization, 5 out of 20 sampled files (95% confidence interval of 9 files to 65 files of the population of 148 matching files, based on a bespoke t-test with finite population correction because I had trouble compiling the sampling package, see urls.r) include a replaceable reference to a private publisher. I briefly looked through the full population of Rd files, and I got the impression that this sort of private reference may be restricted to a just a few publishers, with Wikimedia possibly being the most prominent. To summarize, several other documentation files already reference private publishers, and the set of publishers is small enough that it would be feasible to review each publisher in order consider whether the references to it should replaced with alternatives. Opinion regarding the present patch ----------------------------------- I think that linking to the Wayback Machine, by the Internet Archive, is consistent with the practice in many other base libraries and that it is thus acceptable. At present, base makes no references to the Wayback Machine but makes several references to English Wikipedia. An even more consistent option is thus to link to the English Wikipedia article for Great Basin bristlecone pine (https://en.wikipedia.org/wiki/Pinus_longaeva) or for Methuselah (https://en.wikipedia.org/wiki/Methuselah_(tree)) instead of the Wayback Machine page that I reference in the patch. (Note that the treering data are from a tree in the Methuselah Walk but not from Methuselah itself.) On the other hand, if we are to avoid referencing private institutions unnecessarily, we should create a broader initiative to replace private non-source references in base documentation. For me, more worrisome than references to the Open Group or to Wikimedia are the references to the private company GitHub, as in utils/man/tar.Rd; aside from the social implications of supporting a private company whose repository hosting service has been accessed by the Free Software Foundation as unethical (https://www.gnu.org/software/repo-criteria-evaluation.html), I do not even trust in the long-term availability of its webpages. And of course, if we do not correct the dead link in treering, I think we should remove the dead link. We can optionally replace it with a very short description of Great Basin bristlecone pines. Further discussion ------------------ RESTRICTION CRITERIA If R is to have formal restrictions as to what sorts of references may be included in the base documentation, I think that private versus public is not an appropriate criterion. To start, private universities may be similarly acceptable to public universities, and certain government institutions may be problematic. Also, most of the present references to software specifications refer to private institutions. Considering the goals of the R project and its status as a component of the GNU project, I think that it would make more sense for the criteria to be based on the license of the referenced work, rather than on characteristics of the legal entity that has published it. AVOIDING LINKS For practical reasons, I think it would be nice to avoid the sort of link that we are presently discussing and instead to distribute the contents of that link. If the contents are incorporated into R, then dead links are not an issue, we are free to edit the extra documentation that otherwise would have been linked, and users can view the documentation without a internet connection. I think that the datasets documentation, in particular, could benefit substantially from a few sentences of context being added to each documentation file. That said, it is possible that this would be enough work that it would not be worthwhile; this extra documentation could easily become much larger than the rest of the R source code, especially if images are included as in the case of the Methuselah Walk photographs, so implementing this would be more involved than simply obtaining acceptable licenses on the extra documentation and copying passages to Rd files.
filename,url,source,publisher src/library/grDevices/man/pdf.Rd,https://en.wikipedia.org/wiki/CMYK_color_model#Mapping_RGB_to_CMYK,4,private src/library/grDevices/man/pdf.Rd,https://www.r-project.org/doc/Rnews/Rnews_2006-2.pdf,1,r-project src/library/base/man/taskCallbackNames.Rd,https://developer.r-project.org/TaskHandlers.pdf,3,r-project src/library/stats/man/shapiro.test.Rd,http://lib.stat.cmu.edu/apstat/R94,5,private src/library/tcltk/man/TkWidgets.Rd,http://www.tkdocs.com,2,private src/library/graphics/man/assocplot.Rd,http://www.math.yorku.ca/SCS/sugi/sugi17-paper.html,1,academic src/library/graphics/man/assocplot.Rd,http://epub.wu.ac.at/dyn/openURL?id=oai:epub.wu-wien.ac.at:epub-wu-01_8a1,1,academic src/library/base/man/sprintf.Rd,https://developer.r-project.org/Portability.html,3,private src/library/base/man/regex.Rd,http://www.pcre.org,2,private src/library/base/man/regex.Rd,http://www.pcre.org/original/doc/html/,2,private src/library/base/man/regex.Rd,http://laurikari.net/tre/documentation/regex-syntax/,2,private src/library/base/man/regex.Rd,http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html,1,private src/library/base/man/regex.Rd,http://www.pcre.org/original/pcre.txt,1,private src/library/base/man/regex.Rd,http://perldoc.perl.org/perlre.html,4,private src/library/datasets/man/HairEyeColor.Rd,http://euclid.psych.yorku.ca/ftp/sas/vcd/catdata/haireye.sas,1,academic src/library/datasets/man/HairEyeColor.Rd,http://www.math.yorku.ca/SCS/sugi/sugi17-paper.html,1,academic src/library/datasets/man/HairEyeColor.Rd,http://www.math.yorku.ca/SCS/Papers/asa92.html,1,academic src/library/stats/man/optimize.Rd,http://www.netlib.org/fmm/fmin.f,1,academic src/library/datasets/man/UKDriverDeaths.Rd,http://www.ssfpack.com/dkbook/,1,private src/library/utils/man/object.size.Rd,https://en.wikipedia.org/wiki/Binary_prefix,4,private src/library/utils/man/unzip.Rd,http://zlib.net,1,private src/library/base/man/dcf.Rd,https://www.debian.org/doc/debian-policy/ch-controlfields.html,1,private src/library/base/man/DateTimeClasses.Rd,https://www.r-project.org/doc/Rnews/Rnews_2001-2.pdf,1,r-project src/library/stats/man/GammaDist.Rd,https://en.wikipedia.org/wiki/Incomplete_gamma_function,4,private src/library/stats/man/GammaDist.Rd,http://dlmf.nist.gov/8.2#i,4,government src/library/stats/man/GammaDist.Rd,http://dlmf.nist.gov/,5,government src/library/utils/man/maintainer.Rd,https://stat.ethz.ch/pipermail/r-help/2010-February/230027.html,1,r-project src/library/utils/man/maintainer.Rd,http://n4.nabble.com/R-help-question-How-can-we-enable-useRs-to-contribute-corrections-to-help-files-faster-tp1572568p1572868.html,5,r-project src/library/base/man/libcurlVersion.Rd,http://curl.haxx.se/docs/sslcerts.html,2,private src/library/base/man/libcurlVersion.Rd,http://curl.haxx.se/docs/ssl-compared.html,2,private src/library/base/man/eigen.Rd,http://www.netlib.org/lapack,1,academic src/library/base/man/eigen.Rd,http://www.netlib.org/lapack/lug/lapack_lug.html,2,academic src/library/base/man/chol2inv.Rd,http://www.netlib.org/lapack,1,academic src/library/base/man/chol2inv.Rd,http://www.netlib.org/lapack/lug/lapack_lug.html,2,academic src/library/tools/man/update_pkg_po.Rd,https://www.stats.ox.ac.uk/pub/Rtools/goodies/gettext-tools.zip,2,academic
urls <- read.csv('urls.csv') urls.tab <- function(urls) { urls$source <- factor(urls$source) levels(urls$source) <- paste(levels(urls$source), c('necessary', 'upstream', 'extra', 'ambiguous', 'ignore')) print(table(urls[c('source', 'publisher')])) } urls.tab(urls) interesting <- (urls$source==3|urls$source==4) & (urls$publisher=='private') urls.interesting <- urls[interesting,1:2] N <- 148 n <- length(levels(urls$filename)) x <- nrow(urls.interesting) p <- x/n fpc <- sqrt((N-n)/(N-1)) se <- (sqrt(p*(1-p))/sqrt(n)) * fpc t <- qt(1-.025, n-1) print(round(N*(p+c(-1,1)*t*se)))
______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel