?read.csv leads to the help page for read.table, of which read.csv is a special case. In the description, the first argument is called 'file', which suggests to the unwary reader that it can only name a file. But if you persist and read the description of the arguments, you learn that "file can be a readable text-mode connection" and in the next paragraph "file can also be a complete URL."
Chances are, I'm going to make mistakes. I'm going to want to fix my script and try again. And again. And again. If I download a file, all these attempts are going to my local SSD. If I use a URL, all these attempts are going to reach out over the network, and if my connection goes down (I've had a pile of washing fall on my modem, which then overheated and shut down) I can't continue. When is it a good idea to use read.table or any of its wrappers with a URL as first argument? On Tue, 18 Mar 2025 at 03:39, Kevin Zembower <ke...@zembower.org> wrote: > > Hello, all, thanks, again for the detailed comments and suggestions. > This is one reason I really enjoy this group: the lively and > knowledgeable discussions that questions generate. I'm a little > hesitant that a future reader, just skimming the subject lines, will > miss the true breath of this discussion. > > I'd like to clarify my use of the tidyverse library. I used it so that > I could use read_csv(). I was under the mistaken understanding that > read.csv() would not fetch a file from the internet, using a > 'https://...' URL, that only read_csv() would do that. I'm pretty sure > that at some point in the 15 years that I've been aware of and using R, > read.csv() would not do that. I didn't do a lot with R during the time > that the tidyverse was developed and became popular; I had to learn it > fresh just a few years ago, when I kind of came back to R. > > In this project, I was doing a 'data exploration' of sorts. I wasn't > concerned with optimizing anything but getting a correct answer. I > didn't explore whether other functions would also fetch internet files, > I didn't compare the execution speeds of apply() versus rowMeans() > (although, since I didn't know about rowMeans(), I'm glad Tim mentioned > it; I'll be sure to file this away for the next time it comes up). > > Almost all respondents to my original question about sample() pointed > out my example didn't use the 'size=' parameter. I constructed my first > MWE (beyond the first one-line snippet I originally posted) to explain > how I couldn't just use 'size' to fill a matrix, like the bootstart > model I was working from, because I needed permutations of the original > dataset. (I had never heard of a permutation test. I think that's > beyond the scope of Stats101.) I wanted to make sure that anyone who > wished to participate in this discussion could do it in the easiest way > possible, without loading a library that they would otherwise have no > use for. > > I asked for any suggestions for my R coding style, and I appreciate all > the respondents who went way above the call and researched the sources > I was working from and made suggestions and improvements. I'm still > reading through these to fully understand them, but I'm very grateful > that you took the time to try to help me. > > Thank you all, again, for your efforts, and sharing your knowledge and > experience with all of us on this list. > > -Kevin > > On Sun, 2025-03-16 at 16:04 -0700, Jeff Newmiller wrote: > > The original question was about sample, a base R function. Dragging > > in tidyverse along the way could be regarded as complicating the > > question unnecessarily, but in some cases there can be undesirable or > > simply unexpected interactions between functions drawn from different > > packages. Such complications can turn out to be intrinsic to the > > question being posed, in which case it will be necessary to have > > things in their example just as they are in the original environment. > > In this case that does not seem to be the case... and OP may get > > fewer responses to their question because some people don't keep > > tidyverse installed and may not want to add it just to answer a > > question... leading to fewer responses. In some cases no one may > > respond, and OP would be left with no help. > > > > In this case it all turned out fine.. so this debate is getting > > stale, and there are reasons why including or excluding tidyverse > > might have been better. But in general, building a true Minimum > > Reproducible Example (MRE) will help communicate most clearly > > (consider using the reprex package to verify the example) and > > minimizing unnecessary packages (reprex can help paring things down) > > may avoid the dreaded "crickets" on the mailing list in the future. > > And sometimes building an MRE will help OP answer their own question. > > > > On March 16, 2025 12:52:07 PM PDT, avi.e.gr...@gmail.com wrote: > > > Thanks for the clarification, Richard, as I clearly made the wrong > > > guess of what you meant. > > > > > > Your idea or objection was that you see the included read.csv > > > function as adequate and see no incentive to use read_csv, and > > > especially not if that is the only function being used. I only > > > partially agree. > > > > > > As usual, I look at things from multiple overlapping perspectives. > > > > > > There are actually more ways to read in a CSV or other such data > > > files including fread from data.table and another called feather > > > and other base functions. Some people choose ONE and use it > > > whenever possible and your choice might be the base version and > > > mine would not be. > > > > > > So, one perspective is that the base version is in some sense pre- > > > loaded and any other must be pre-downloaded and added with a > > > library statement. I am not sure how much that costs or if the base > > > version is also only partially preloaded and gotten only as needed. > > > But it can be a valid concern, especially as some people write > > > defensive code so that if it is not already installed, they first > > > fetch it. > > > > > > Another perspective, especially for larger files, is speed. One > > > article I have suggests the base version is quite SLOW. > > > > > > https://www.r-bloggers.com/2017/04/fast-data-loading-from-files-to-r/ > > > > > > But that was in 2017, and using such concerns, you may be better > > > off with data.table ... > > > > > > Another issue is that some people have found it handy to deal with > > > tibbles rather than unenhanced data.frames and if you read it in > > > using the base, you may end up converting it later so the > > > underscore version saves a small step. The OP clearly does not need > > > this as no other tidyverse functions are used. Others may care. > > > > > > But related to this are things like not converting strings to > > > factors by default or play around with column names. It can be time > > > consuming to read in data and then use multiple commands to change > > > it to the way you want it, such as undoing the factors (albeit you > > > can just set the default in the base too) or converting a column it > > > guessed was integer to Boolean and so on. > > > > > > And I note I have used other features that I like and base does not > > > support. But, again, if the OP does not have any plans on using any > > > such features or defaults and is reading fairly small amounts of > > > data and running it once, there is no special reason to make it > > > worth leaving the base. If they may later want to use additional > > > tidyverse functionality, switching to use this by default may be > > > wise. > > > > > > My philosophy is to keep thing as simple as reasonable but no > > > simpler than reasonable. In programming languages, it is to use a > > > simple consistent set of tools that gets me what I want with > > > accuracy and thus it can be simpler to use the tidyverse a lot as > > > my default. To each their own. > > > > > > -----Original Message----- > > > From: Richard O'Keefe <rao...@gmail.com> > > > Sent: Sunday, March 16, 2025 7:53 AM > > > To: avi.e.gr...@gmail.com > > > Cc: Kevin Zembower <ke...@zembower.org>; r-help@r-project.org > > > Subject: Re: [R] What don't I understand about sample()? > > > > > > I think you think I mistook read_csv for read.csv. Not so. The > > > point > > > was that base R with no additional packages loaded already contains > > > a > > > CSV reader which is entirely adequate for the task at hand. When > > > you > > > are already struggling with the basics of a system (like how often > > > and > > > when arguments are evaluated), I think it's wisests to stick with > > > basic tools. When they taught me carpentry at school, they had me > > > on > > > chisels before getting to lathes (and in fact never did get to > > > lathes > > > at my school). > > > > > > Sure, R isn't perfect. But whenever I open the SAS manuals I > > > remember > > > that things could be much worse. > > > > > > On Sun, 16 Mar 2025 at 17:51, <avi.e.gr...@gmail.com> wrote: > > > > > > > > Richard, > > > > > > > > The function with a period as a separator that you cite, > > > > read.csv, is part of normal base R. > > > > > > > > We have been discussing a different function named just a tad > > > > different that uses an underscore as a separator, read_csv that > > > > is similar but has some changes in how it works and the options > > > > supported and is considered part of the tidyverse grouping of > > > > packages and can also be gotten more compactly by importing > > > > package "readr" ... > > > > > > > > The OP, for reasons of their own, wanted to use read_csv and did > > > > not want or need anything else in the related packages. > > > > > > > > Of course, nobody is required to use other packages, albeit, as > > > > you noted, many packages you may choose to use have some > > > > dependencies on others you don't. > > > > > > > > Like many good things, added functionality available to you does > > > > add complexity and room for failures. But when a package is > > > > useful enough to be very useful, it can develop enough momentum > > > > that some functionality might well be a good idea to move into > > > > base R. As an example I already mentioned, of the various pipe > > > > implementations, a version has been added to base R and I suspect > > > > many older packages, including in the tidyverse, can adjust their > > > > code in new releases to use it but with CARE. Anyone still using > > > > older versions of R will experience failures in such a scenario. > > > > > > > > Luckily, many uses within a package are likely to be safe if done > > > > properly. Can anyone share if any such methods are in use? > > > > > > > > I mean, as an example, could a package early on check if the R > > > > version being used is later than the introduction, or some other > > > > way to check if a |> operation is supported? Could they then > > > > somehow introduce an operator that is either bound to |> or > > > > perhaps %>% and use that in any places in the code where both > > > > work the same, and only use the magrittr pipe when doing > > > > something it does differently such as needing to use a period to > > > > specify which argument in a function is receiving the pipelined > > > > data. > > > > > > > > There are programs people want to keep frozen so they only use > > > > the versions of R and packages that existed at some moment so you > > > > avoid some inevitable conflicts. So, I despair that older > > > > versions of R may stick around way too long and break with any > > > > newer packages. > > > > > > > > But languages cannot remain totally static or chances are people > > > > will move on to newer languages that offer things they want. Then > > > > again, there seem to still be COBOL programs out there. > > > > > > > > -----Original Message----- > > > > From: Richard O'Keefe <rao...@gmail.com> > > > > Sent: Sunday, March 16, 2025 12:32 AM > > > > To: avi.e.gr...@gmail.com > > > > Cc: Kevin Zembower <ke...@zembower.org>; r-help@r-project.org > > > > Subject: Re: [R] What don't I understand about sample()? > > > > > > > > Rgui 4.4.3 on Windows. When I start it up, read.csv is just > > > > *there*. > > > > I don't need to load any package to get it. > > > > > > > > I have three reasons for being very sparing in the packages I > > > > use. > > > > 1. It took me long enough to get my head around R. More packages > > > > = > > > > more things to learn. I *still* have major trouble grasping > > > > tidyverse, and as far as I can see it doesn't solve any problem > > > > that > > > > *I* have. I install a package only when I have a specific need > > > > for > > > > something it does, like spatial statistics. (And yet I have > > > > hundreds > > > > of packages installed, because packages depend on other > > > > packages.) > > > > 2. Everything changes, and they don't all change coherently. A > > > > package I've used for years may not be available in the next > > > > release. > > > > This is not a theoretical possibility; it has happened to me > > > > often. > > > > "If I don't use it I can't lose it." Sometimes things break > > > > because > > > > something else on the system (tcl/tk, or the C or Fortran > > > > compiler) > > > > has changed. I'm tired of things breaking because the C or > > > > Fortran compiler > > > > is now stricter. > > > > 3. The universe of R packages is vast and constantly expanding. > > > > This > > > > makes it *impossible* for anyone to test every possible > > > > combination. I > > > > used to teach software engineering, and we had a slogan "if it > > > > isn't > > > > tested it doesn't work". Base R plus package X? Probably > > > > tested. > > > > Base R plus package Y? Probably tested. Base R plus X plus Y? > > > > Not unless X requires Y or Y requires X. > > > > > > > > There is also the didactic point that the more you work with base > > > > R > > > > the better you will understand it, which you will need to > > > > understand > > > > other things like tidyverse. It's like mastering the alphabet > > > > before you > > > > learn shorthand. > > > > > > > > > > > > On Sun, 16 Mar 2025 at 06:55, <avi.e.gr...@gmail.com> wrote: > > > > > > > > > > Kevin & Richard, and of course everyone, > > > > > > > > > > As the main topic here is not the tidyverse, I will mention the > > > > > perils of loading in more than needed in general. > > > > > > > > > > If you want to use one or a very few functions, it can be more > > > > > efficient and safe to load exactly what is needed. In the case > > > > > of wanting to use read_csv(), I think this suffices: > > > > > > > > > > library(readr) > > > > > > > > > > If you instead use: > > > > > > > > > > library(tidyverse) > > > > > > > > > > You load a varying number of packages (it may change) including > > > > > some like lubridate or forcats or ggplot2 that you may not be > > > > > even thinking of using or never heard of. > > > > > > > > > > The bigger problem is shadowing that happens. For example, you > > > > > may be getting warning messages like: > > > > > > > > > > ✖ dplyr::filter() masks stats::filter() > > > > > ✖ dplyr::lag() masks stats::lag() > > > > > > > > > > This can interfere with some other package you had already > > > > > loaded unless it uses a notation like mypackage::filter(...) in > > > > > their code to avoid being easily replaced but even then, if you > > > > > yourself called what you though was filter() from base R or > > > > > some package, you have a problem unless you invoke it like > > > > > base::filter(...) > > > > > > > > > > The order packages like this load can matter as well as when > > > > > you define a function of your own. So, it may be worth some > > > > > effort to zoom in and call exactly what you need and only when > > > > > you need it. I have seen code that only needs a package in rare > > > > > conditions and only loads the package in one branch of an IF > > > > > statement right before using in. > > > > > . > > > > > Packages can also be unloaded after use. > > > > > > > > > > From what you describe, none of this is crucially important as > > > > > you are using R for your own purposes in your own RMarkDown > > > > > file that you may not be distributing. And, when I write > > > > > programs where I keep adjusting and adding things from the > > > > > tidyverse, it is indeed much easier to just get the grouping on > > > > > top and forget about it. That is, until I decide to do > > > > > something with functional programming that uses > > > > > reduce/filter/map... and have an odd error! > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: R-help <r-help-boun...@r-project.org> On Behalf Of Kevin > > > > > Zembower via R-help > > > > > Sent: Saturday, March 15, 2025 1:29 PM > > > > > To: r-help@r-project.org > > > > > Subject: Re: [R] What don't I understand about sample()? > > > > > > > > > > Hi, Richard, thanks for replying. I should have mentioned the > > > > > third > > > > > edition, which we're using. The data file didn't change between > > > > > the > > > > > second and third editions, and the data on Body Mass Gain was > > > > > the same > > > > > as in the first edition, although the first edition data file > > > > > contained > > > > > additional variables. > > > > > > > > > > According to my text, the BMGain was measured in grams. Thanks > > > > > for > > > > > pointing out that my statement of the problem lacked crucial > > > > > information. > > > > > > > > > > The matrix in my example comes from an example in > > > > > https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where > > > > > the author > > > > > created a bootstrap example with a matrix that consisted of one > > > > > row for > > > > > every sample in the bootstrap, and one column for each mean in > > > > > the > > > > > original data. This allowed him to find the mean for each row > > > > > to create > > > > > the bootstrap statistics. > > > > > > > > > > The only need for the tidyverse is to use the read_csv() > > > > > function. I'm > > > > > regrettably lazy in not determining which of the multiple > > > > > functions in > > > > > the tidyverse library loads read_csv(), and just using that > > > > > one. > > > > > > > > > > Thanks, again, for helping me to further understand R and this > > > > > problem. > > > > > > > > > > -Kevin > > > > > > > > > > On Sat, 2025-03-15 at 12:00 +0100, > > > > > r-help-requ...@r-project.org wrote: > > > > > > Not having the book (and which of the three editions are you > > > > > > using?), > > > > > > I downloaded the data and played with it for a bit. > > > > > > dotchart() showed the Dark and Light conditions looked quite > > > > > > different, but also showed that there are not very many > > > > > > cases. > > > > > > After trying t.test, it occurred to me that I did not know > > > > > > whether > > > > > > "BMGain" means gain in *grams* or gain in *percent*. > > > > > > Reflection told me that for a growth experiment, percent made > > > > > > more > > > > > > sense, which reminded my of one of my first > > > > > > student advising experiences, where I said "never give the > > > > > > computer > > > > > > percentages; let IT calculate the percentages > > > > > > from the baseline and outcome, because once you've thrown > > > > > > away > > > > > > information, the computer can't magically get it back." > > > > > > In particular, in the real world I'd be worried about the > > > > > > possibility > > > > > > that there was some confounding going on, so I would > > > > > > much rather have initial weight and final weight as > > > > > > variables. > > > > > > If BMGain is an absolute measure, the p value for a t test is > > > > > > teeny > > > > > > tiny. > > > > > > If BMGain is a percentage, the p value for a sensible t test > > > > > > is about > > > > > > 0.03. > > > > > > > > > > > > A permutation test went like this. > > > > > > is.light <- d$Group == "Light" > > > > > > is.dark <- d$Group == "Dark" > > > > > > score <- function (g) mean(g[is.light]) - mean(g[is.dark]) > > > > > > base.score <- score(d$BMGain) > > > > > > perm.scores <- sapply(1:997, function (i) > > > > > > score(sample(d$BMGain))) > > > > > > sum(perm.scores >= base.score) / length(perm.scores) > > > > > > > > > > > > I don't actually see where matrix() comes into it, still less > > > > > > anything > > > > > > in the tidyverse. > > > > > > > > > > > > > > > > ______________________________________________ > > > > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > > > > > see > > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > > PLEASE do read the posting guide > > > > > https://www.R-project.org/posting-guide.html > > > > > and provide commented, minimal, self-contained, reproducible > > > > > code. > > > > > > > > > > ______________________________________________ > > > > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > > > > > see > > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > > PLEASE do read the posting guide > > > > > https://www.R-project.org/posting-guide.html > > > > > and provide commented, minimal, self-contained, reproducible > > > > > code. > > > > > > > > > > ______________________________________________ > > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > > https://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.