Re: [R-pkg-devel] URL checks

2022-06-30 Thread Ivan Krylov
On Thu, 30 Jun 2022 10:00:53 +1000
Greg Hunt  wrote:

> With a little more experimenting, the 503 response from the wiley DOI
> lookup does seem to come from CloudFlare, there is a "server:
> cloudflare" header.

Unfortunately, CloudFlare also returns the 503 status code with the
"server: cloudflare" header in case of connectivity issues between
CloudFlare and the upstream server:
https://support.cloudflare.com/hc/en-us/articles/115003011431/#503error

-- 
Best regards,
Ivan

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] URL checks

2022-06-30 Thread Ivan Krylov
Greg,

I realise you are trying to solve the problem and I thank you for
trying to make the URL checks better for everyone. I probably sound
defeatist in my e-mails; sorry about that.

On Thu, 30 Jun 2022 17:49:49 +1000
Greg Hunt  wrote:

> Do you have evidence that even without the use of HEAD that
> CloudFlare is rejecting the CRAN checks?

Unfortunately, yes, I think it's possible:

$ curl -v 
https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages
# ...skipping TLS logs...
> GET /hc/en-us/articles/219949047-Installing-older-versions-of-packages HTTP/2 
> Host: support.rstudio.com
> User-Agent: curl/7.64.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 403
< date: Thu, 30 Jun 2022 08:13:01 GMT

CloudFlare blocks are probabilistic. I *think* the reason I got a 403
is because I didn't visit the page with my browser first. Switching
from HEAD to GET might also increase the traffic flow, leading to more
blocks from hosts not previously blocking the HEAD requests.

CloudFlare's suggested solution would be Private Access Tokens [*], but
that looks hard to implement (who would agree to sign those tokens?)
and leaves other CDNs.

> The CDN rejecting requests or flagging the service as temporarily
> unavailable when there is a communication failure with its upstream
> server is much the same behaviour that you would expect to see from
> the usual kinds of protection that you'd apply to a web server (some
> kind of filter/proxy/firewall) even without a CDN in place.

My point was different. If the upstream is actually down, the page
can't be served even to "valid" users, and the 503 error from
CloudFlare should fail the URL check. On the other hand, if the 503
error is due to the check tripping a bot detector, it could be
reasonable to give that page a free pass.

How can we distinguish those two situations? Could CloudFlare ask for a
CAPTCHA first, then realise that the upstream is down and return
another 503?

Yes, this is a sensitivity vs specificity question, and we can trade
some false positives (that we get now) for some false negatives
(letting a legitimate error status from a CDN pass the test) to make
life easier for package maintainers. Your suggestions are a step in the
right direction, but there has to be a way to make them less fragile.

-- 
Best regards,
Ivan

[*]
https://blog.cloudflare.com/eliminating-captchas-on-iphones-and-macs-using-new-standard/

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] URL checks

2022-06-30 Thread Greg Hunt
Ivan,
I am sure that we can make this work a bit better, and do I agree that it
working perfectly isn't going to happen.  I don't think that the behaviour
you're seeing is likely to be stateful: recording the fact that you have
made a previous request from a browser.  That type of protection is
implemented for some DDOS attacks, but it soaks up resources (money/speed)
to little point when there is no DDOS or when the DDOS is too large.
Remembering and looking up the request has a cost and isn't really
scalable.

I got errors for DOI and the .eu examples earlier today (but I didn't hit
the rstudio link) and I never accessed the pages using a browser. Removing
the -I from the curl request allowed them to succeed.

Without the http HEAD method I got 302 (redirects) from doi.org which seems
to indicate that the ID exists and 404 (not found) for an ID which did
not.   For DOI checks I suggest removing the nobody settings and treating
anything other than a 404 (not found) from the DOI.ORG web server as
success (or perhaps more precisely, regarding a 302 redirect as success, a
404 not found as a failure and anything else as potentially ambiguous or at
least we'd need to categorise them as temporary or permanent but I am not
sure how much better that extra complexity makes things) .  That would be
an improvement over the current behaviour for most references to doi.org.

I get the 403 code from rstudio that you do, I suspect that they are
checking the browser in a way that doi.org doesn't.  Thats probably to
protect their site text from content scraping and getting into an arms race
with them is likely to be pointless.  Forbidden does mean that the server
is there but we can't tell the difference between not found and any other
condition.  I'd suggest that a 403 (which means actively declined by the
web server) should be treated as success IF there is a cloudflare server
header as well and getting more CDNs added to the check over time.  You
aren't going to get access anyway.  It looks like the top three CDN vendors
are CloudFlare, Amazon and Akamai and getting coverage of those three would
get you about 90% coverage of CDN fronted hosts and CloudFlare are the
overwhelming market leader.

In summary:

   - removing nobody, which selects the HEAD method, may allow the
   composite indicators eu sites to work, meaning that sites that have removed
   support for head (not an uncommon thing to do at the prompting of IT
   auditors) will start to work.
   - removing nobody and then not following the redirect may allow the
   doi.org requests to work
   - seeing a 403 code and a cloudflare server header in response to a
   request should be regarded as success, its as positive as you are likely to
   see
   - check what the responses from Amazon and Akamai  look like to identify
   them (Amazon responses have a bunch of X-amzn-* headers in them and I
   looked at an Akamai site which included an x-akamai-transformed header in
   its response) - I would add logging to the build environment to collect the
   requests and response headers from failed requests to see what the overall
   behaviour is

I think its worth exploring to remove a bunch of the recurring questions
about URL lookup.  The question is whether the servers running the CRAN
checks see the same behaviour that I am seeing.  If we can get say two
thirds of errors resolved this way then everyone is better off.

None of this will increase the traffic rate from CRAN as a result of these
checks, and frankly I doubt that you are going to generate enough traffic
to show up in anyone's traffic analysis anyway.  The hits on doi.org are
likely to be the single largest group and doi.org clearly expect that
people will access the site from scripts, so I doubt that this will cause
more explicit blocking.  For myself I tend to get a bit antsy about sites
that submit failed requests over and over or the ones that seem to be
systematically scraping a site (meaning many thousands of requests and/or a
very organised pattern).


Greg


On Thu, 30 Jun 2022 at 18:36, Ivan Krylov  wrote:

> Greg,
>
> I realise you are trying to solve the problem and I thank you for
> trying to make the URL checks better for everyone. I probably sound
> defeatist in my e-mails; sorry about that.
>
> On Thu, 30 Jun 2022 17:49:49 +1000
> Greg Hunt  wrote:
>
> > Do you have evidence that even without the use of HEAD that
> > CloudFlare is rejecting the CRAN checks?
>
> Unfortunately, yes, I think it's possible:
>
> $ curl -v
> https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages
> # ...skipping TLS logs...
> > GET /hc/en-us/articles/219949047-Installing-older-versions-of-packages
> HTTP/2
> > Host: support.rstudio.com
> > User-Agent: curl/7.64.0
> > Accept: */*
> >
> * Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
> < HTTP/2 403
> < date: Thu, 30 Jun 2022 08:13:01 GMT
>
> CloudFlare blocks are probabilistic. I *think* the reason I got a 403
> is b

Re: [R-pkg-devel] Slowdown running examples since 4.2 on Windows

2022-06-30 Thread Barbara Lerner
Thanks for these suggestions.  I attempted to submit a slightly earlier 
version of my package a few weeks ago and saw the slowdown on winbuilder 
then, too.  I did not look into any further at that point.

I have tested it on a local Windows computer running 4.2.0 and I did not 
see the slowdown there.  Here are those results:

Name user    system   elapsed

prov.json 1.11   0.24 2.19

prov.run    0.99    0.06 1.93

version  R version 4.2.0 (2022-04-22 ucrt)

os   Windows 10 x64 (build 19044)

system   x86_64, mingw32


Tomas Kalibera wrote on 6/30/22 10:09 AM:
>
> On 6/29/22 22:25, Barbara Lerner wrote:
>> I have a package that I want to submit an updated version for but the
>> examples run too slowly on win-builder since 4.2 came out.  I just
>> submitted the exact same tar.gz file to all 3 versions of R available on
>> win-builder and got the results shown below.  Notice the dramatic
>> slowdown from 4.1.3 to 4.2.1.
>>
>> I don't know how to go about tracking down the cause of this slowdown.
>> The examples are quite small.  I am reluctant to use \dontrun, but I am
>> not sure what else to do.
>
> Could you perhaps submit to Winbuilder several times (with some 
> non-trivial delay between the runs) to see if the very long execution 
> is reproducible?
>
> If so, the next step could be trying on a Windows machine with 
> interactive access, to reproduce, and if it is still so slow, checking 
> where the time is spent, using an R profiler, using some C profiler 
> (e.g. VerySleepy is free), comparing possibly to 4.1.3. It might be 
> useful or necessary to also do the profiling with a debug build of 
> Windows and/or the package, while the performance numbers will be 
> skewed, one would see the symbol names.
>
> If you wanted specific help, please send a reproducible example - 
> instructions how to run the code and which code.
>
> Best
> Tomas
>
>>
>> June 29 2:33 PM  - old release
>> * using R version 4.1.3 (2022-03-10)
>> i386 timings
>> name    user    system    elapsed
>> prov.json    3.28    0.33    5.19
>> prov.run    2.70    0.27    4.18
>>
>> x64 timings
>> name    user    system    elapsed
>> prov.json    3.51    0.27    4.93
>> prov.run    3.05    0.28    4.48
>>
>>
>> June 29 2:19 PM  - release
>> * using R version 4.2.1 (2022-06-23 ucrt)
>> * using platform: x86_64-w64-mingw32 (64-bit)
>> name    user    system    elapsed
>> prov.json    16.98    8.81    26.82
>> prov.run    3.53    0.42    4.89
>>
>>
>> June 29  1:52 PM - devel
>> * using R Under development (unstable) (2022-06-28 r82534 ucrt)
>> * using platform: x86_64-w64-mingw32 (64-bit)
>> name    user    system    elapsed
>> prov.json    16.60 9.09    26.66
>> prov.run    3.57    0.22    4.70 I then ran the same timing script as 
>> win-builder uses on my Mac,
>> using Rscript and got these results: name    user    system  elapsed
>> prov.json   1.105   0.159   1.329 prov.run    0.890   0.103
>> 1.053 session_info reports:  version  R version 4.2.0 (2022-04-22)
>>    os   macOS Catalina 10.15.7  system   x86_64, darwin17.0 I then
>> installed 4.2.1 on my Mac. The time is a little slower but nothing like
>> the slowdown on Windows.
>> name    user    system  elapsed prov.json   1.286   0.230 3.080
>> prov.run    0.940   0.108   1.131  version  R version 4.2.1
>> (2022-06-23)  os   macOS Catalina 10.15.7  system   x86_64, 
>> darwin17.0
>>

-- 
Barbara Lerner (she / her)
Professor
Computer Science Department
Mount Holyoke College



[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel