Re: [Rd] download.file does not process gz files correctly (truncates them?)

2018-05-09 Thread Tomas Kalibera

On 05/08/2018 05:15 PM, Hadley Wickham wrote:

On Thu, May 3, 2018 at 11:34 PM, Tomas Kalibera
 wrote:

On 05/03/2018 11:14 PM, Henrik Bengtsson wrote:

Also, as mentioned in my
https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when
not specifying the mode argument, the default on Windows is mode = "w"
*except* for certain, case-sensitive, filename extensions:

  if(missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
url)))
  mode <- "wb"

Just like the need for mode = "wb" on Windows, the above
special-file-extension-hack is only happening on Windows, and is only
documented in ?download.file if you're on Windows; so someone who's on
Linux/macOS trying to help someone on Windows may not be aware of
this. This adds to even more confusions, e.g. "works for me".

If we were designing the API today, it would probably make more sense not to
convert any line endings by default. Today's editors _usually_ can cope with
different line endings and it is probably easier to detect that a text file
has incorrect line endings rather than detecting that a binary file has been
corrupted by an attempt to convert line endings. But whether to change
existing, documented behavior is a different question. In order to help
users and programmers who do not read the documentation carefully we would
create problems for users and programmers who do. The current heuristic/hack
is in line with the compatibility approach: it detects files that are
obviously binary, so it changes the default behavior only for cases when it
would obviously cause damage.

 From a purely utilitarian standpoint, there are far more users who do
not carefully read the documentation than users who do ;)
And for that reason the behavior should be as intuitive as possible when 
designed. What was intuitive 15-20 years ago may not be intuitive now, 
but that should probably not be a justification for a change in 
documented behavior.

(I'd also argue that basing the decision on the file extension is
suboptimal, and it would be better to use the mime type if provided by
the server)
Yes, that would be nice. Also some binary files could be detected via 
magic numbers (yet not all, e.g. RDS do not have them). It won't be as 
trivial as decoding the URL, though.


Tomas



Hadley



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] download.file does not process gz files correctly (truncates them?)

2018-05-09 Thread Duncan Murdoch

On 08/05/2018 4:47 PM, Hadley Wickham wrote:

On Tue, May 8, 2018 at 8:15 AM, Hadley Wickham  wrote:

On Thu, May 3, 2018 at 11:34 PM, Tomas Kalibera
 wrote:

On 05/03/2018 11:14 PM, Henrik Bengtsson wrote:


Also, as mentioned in my
https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when
not specifying the mode argument, the default on Windows is mode = "w"
*except* for certain, case-sensitive, filename extensions:

  if(missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
url)))
  mode <- "wb"

Just like the need for mode = "wb" on Windows, the above
special-file-extension-hack is only happening on Windows, and is only
documented in ?download.file if you're on Windows; so someone who's on
Linux/macOS trying to help someone on Windows may not be aware of
this. This adds to even more confusions, e.g. "works for me".


If we were designing the API today, it would probably make more sense not to
convert any line endings by default. Today's editors _usually_ can cope with
different line endings and it is probably easier to detect that a text file
has incorrect line endings rather than detecting that a binary file has been
corrupted by an attempt to convert line endings. But whether to change
existing, documented behavior is a different question. In order to help
users and programmers who do not read the documentation carefully we would
create problems for users and programmers who do. The current heuristic/hack
is in line with the compatibility approach: it detects files that are
obviously binary, so it changes the default behavior only for cases when it
would obviously cause damage.


 From a purely utilitarian standpoint, there are far more users who do
not carefully read the documentation than users who do ;)

(I'd also argue that basing the decision on the file extension is
suboptimal, and it would be better to use the mime type if provided by
the server)


Also note that MS just announced support for unix line endings in notepad

https://blogs.msdn.microsoft.com/commandline/2018/05/08/extended-eol-in-notepad/


Perhaps soon RStudio will follow Notepad's lead, and not convert line 
endings when it saves a non-native file.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] download.file does not process gz files correctly (truncates them?)

2018-05-09 Thread peter dalgaard
There was a hint in the Twitterverse that Excel has issues with line endings in 
.csv. Can anyone elaborate on that? Then again, Excel goes belly-up on comma 
separators in central European locales anyway...

-pd

> On 8 May 2018, at 22:47 , Hadley Wickham  wrote:
> 
> 
> Also note that MS just announced support for unix line endings in notepad
> 
> https://blogs.msdn.microsoft.com/commandline/2018/05/08/extended-eol-in-notepad/
> 
> Hadley
> 

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] unlist errors on a nested list of empty lists

2018-05-09 Thread Steven Nydick
I do not have access to the bug reporting system. If somebody can get me
access, I can create a formal bug report.

The latter issues seem like duplicates of:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=12572 (with slightly
different output), but as that bug was reported nearly 10 years ago, it
might be worth creating an update under R version 3. I could not find the
first issue when searching the bug reports (which I ran into when trying to
parse JSON files), which is why I posted on r-devel.

On Tue, May 8, 2018 at 7:51 PM Duncan Murdoch 
wrote:

> On 08/05/2018 4:50 PM, Steven Nydick wrote:
> > It also does the same thing if the factor is not on the first level of
> > the list, which seems to be due to the fact that the islistfactor is
> > recursive, but if a list is a list-factor, the first level lists are
> > coerced into character strings.
> >
> >  > x <- list(list(factor(LETTERS[1])))
> >  > unlist(x)
> > Error in as.character.factor(x) : malformed factor
> >
> > However, if one of the factors is at the top level, and one is nested,
> > then the result is:
> >
> >  > x <- list(list(factor(LETTERS[1])), factor(LETTERS[2]))
> >  > unlist(x)
> >
> > [1]  B
> > Levels: B
> >
> > ... which does not seem to me to be desired behavior.
>
> The patch I suggested doesn't help with either of these.  I'd suggest
> collecting examples, and posting a bug report to bugs.r-project.org.
>
> Duncan Murdoch
>
>
> >
> >
> > On Tue, May 8, 2018 at 2:22 PM Duncan Murdoch  > > wrote:
> >
> > On 08/05/2018 2:58 PM, Duncan Murdoch wrote:
> >  > On 08/05/2018 1:48 PM, Steven Nydick wrote:
> >  >> Reproducible example:
> >  >>
> >  >> x <- list(list(list(), list()))
> >  >> unlist(x)
> >  >>
> >  >> *> Error in as.character.factor(x) : malformed factor*
> >  >
> >  > The error comes from the line
> >  >
> >  > structure(res, levels = lv, names = nm, class = "factor")
> >  >
> >  > which is called because unlist() thinks that some entry is a
> factor,
> >  > with NULL levels and NULL names.  It's not legal for a factor to
> have
> >  > NULL levels.  Probably it should never get here; the earlier test
> >  >
> >  > if (.Internal(islistfactor(x, recursive))) {
> >  >
> >  > should have been false, and then the result would have been
> >  >
> >  > .Internal(unlist(x, recursive, use.names))
> >  >
> >  > (with both recursive and use.names being TRUE), which returns
> NULL.
> >
> > And the problem is in the islistfactor function in src/main/apply.c,
> > which looks like this:
> >
> > static Rboolean islistfactor(SEXP X)
> > {
> >   int i, n = length(X);
> >
> >   switch(TYPEOF(X)) {
> >   case VECSXP:
> >   case EXPRSXP:
> >   if(n == 0) return NA_LOGICAL;
> >   for(i = 0; i < LENGTH(X); i++)
> >   if(!islistfactor(VECTOR_ELT(X, i))) return FALSE;
> >   return TRUE;
> >   break;
> >   }
> >   return isFactor(X);
> > }
> >
> > One of those deeply nested lists is length 0, so at the lowest level
> it
> > returns NA_LOGICAL.  But then it does C-style logical testing on the
> > results.  I think to C NA_LOGICAL counts as true, so at the next
> level
> > up we get the wrong answer.
> >
> > A fix would be to rewrite it like this:
> >
> > static Rboolean islistfactor(SEXP X)
> > {
> >   int i, n = length(X);
> >   Rboolean result = NA_LOGICAL, childresult;
> >   switch(TYPEOF(X)) {
> >   case VECSXP:
> >   case EXPRSXP:
> >   for(i = 0; i < LENGTH(X); i++) {
> >   childresult = islistfactor(VECTOR_ELT(X, i));
> >   if(childresult == FALSE) return FALSE;
> >   else if(childresult == TRUE) result = TRUE;
> >   }
> >   return result;
> >   break;
> >   }
> >   return isFactor(X);
> > }
> >
> >
> >
> > --
> > Steven Nydick
> > PhD, Quantitative Psychology
> > M.A., Psychology
> > M.S., Statistics
> > --
> > "Beware of the man who works hard to learn something, learns it, and
> > finds himself no wiser than before, Bokonon tells us. He is full of
> > murderous resentment of people who are ignorant without having come by
> > their ignorance the hard way."
> > -Kurt Vonnegut
>
>

-- 
Steven Nydick
PhD, Quantitative Psychology
M.A., Psychology
M.S., Statistics
--
"Beware of the man who works hard to learn something, learns it, and finds
himself no wiser than before, Bokonon tells us. He is full of murderous
resentment of people who are ignorant without having come by their
ignorance the hard way."
-Kurt Vonnegut

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] download.file does not process gz files correctly (truncates them?)

2018-05-09 Thread Dirk Eddelbuettel

On 9 May 2018 at 10:37, Tomas Kalibera wrote:
| And for that reason the behavior should be as intuitive as possible when 
| designed. What was intuitive 15-20 years ago may not be intuitive now, 
| but that should probably not be a justification for a change in 
| documented behavior.

Time for downloadFile() (or download_file()) to complement the existing
download.file() but providing what we now think of as intuitive behaviour?

Dirk

-- 
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel