Re: [R] md5sum issues

Ivan Calandra Thu, 04 Feb 2021 01:05:25 -0800

Dear Tim, Jeff, Duncan and Ivan,

Thank you all for your input! Actually, I am already doing what Timsuggested, and as Jeff said, using the checksums is redundant since Iuse Git already (which does a better job).

So I've decided to just remove the checksums from my scripts, and torevert the RStudio settings to default but use `git config --globalcore.autocrlf true` as mentioned in the Git guide for bettercompatibility with other platforms. I might change that in the futurebut I prefer adjusting Git rather than RStudio because I do not only useRStudio (I use it for R, but not necessarily or exclusively for otherprojects).


Best wishes,
Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra

On 03/02/2021 18:07, Ebert,Timothy Aaron wrote:

Dear Ivan,
Why not put your data file and analysis program into one folder and then 
provide enough description (including raw output) to enable someone to run the 
program (a read me file). This includes all the packages loaded into your 
version of R, file names, file types, and the logic behind the program. The raw 
output will help people check that their version is working (as you have 
indicated). If your data or results are critical then someone will figure out 
an update to keep things working on the computers of the future. Good 
documentation and labeling is a better path to long term reproducibility than 
trying to make everything generic. I tend to add documentation within a program 
as much as writing separate read me files.
    R lends itself to this paradigm more than others, but many programs require 
that the person first have a copy of some proprietary software. I am not about 
to buy SPSS, but I will agree on principle that the SPSS code represents 
reproducible science. If I really care I will translate the SPSS code into a 
language that I can use.

Tim

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Ivan Calandra
Sent: Wednesday, February 3, 2021 11:26 AM
To: r-help@r-project.org
Subject: Re: [R] md5sum issues

[External Email]

Dear Jeff,

If I understood you correctly, it makes sense that I explain more about my goal 
here:

I am trying to find ways to have analyses that are as reproducible as possible 
(knowing that it is not going to be perfect). One part is to show which file(s) 
I use as input and what output was created, so that potential readers/users of 
my analysis can check that the file they have is indeed the same that I use 
(and not a corrupted or modified version).
Does that make sense?

And for this purpose, I originally used file information (like creation time 
and so on), but I quickly realized this doesn't help much. Then I tried with 
MD5 and I thought it was solved, but it was obviously not solved.

Duncan solution seems to work (I have not fully checked yet, though), but I am 
really open to other, more robust alternatives.

Thanks for the input!
Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments MONREPOS 
Archaeological Research Centre and Museum for Human Behavioural Evolution 
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.researchgate.net_profile_Ivan-5FCalandra&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=ibNvotyI5H_gn8tI80BOPNvA_3I0hXaKn8B38TS_yLY&s=EiZdNX4gKQm1kXIBD7w1MyjaAT7f3s-tk9mpNNCKW2U&e=

On 03/02/2021 17:15, Jeff Newmiller wrote:

This CR vs LF vs CRLF newline discrepancy has been around since the 70s and the 
CP/M operating system. And it remains an issue in over-the-wire internet text 
protocols today, which actually use the CRLF version like Windows. Sorry, 
UNIX... world domination of LF encoding failed.

The problem with pretending there is no issue as Duncan is advocating is that text is 
treated differently than binary, and every time you pretend it isn't it comes back to 
bite you. Applying binary algorithms like MD5 to text is one of these areas where your 
expectation that this will be successful is what creates the problem in the first place. 
A similar issue occurs in file encoding.. two files may both contain the word 
"Hello" but if they are encoded in UCS16 and UTF8 respectively then the MD5 
results will be different.

Git does not (currently) support differences in encoding, but it does support 
text vs non-text (newline) differences because they are unavoidable. Pushing 
forward with your expectation that text files should compare the same in binary 
by assuming text will always be like UNIX text just defers the problem for 
another day.

Since I don't know what problem you are actually trying to solve, I cannot 
offer a concrete solution. But I would begin by not assuming that MD5 works the 
same on text and binary files... because it doesn't.

On February 3, 2021 2:48:56 AM PST, Duncan Murdoch <murdoch.dun...@gmail.com> 
wrote:

On 03/02/2021 4:42 a.m., Ivan Calandra wrote:

Thank you Ivan and Duncan for your help.

I understand your point Duncan, but the thing is that I do have an

issue

here.
Is it then due to RStudio or even Windows? If it is, I can forget

about

a solution on that end, so I would focus on what I can do, and this

Git

setting seems to be the best place to start.

In my opinion, you should run

   git config --global core.autocrlf false

in an RStudio terminal session.  That will set the git options so
they don't mess up the md5sum values.

You should also go to the RStudio options, and in the Code section,
Saving tab, choose Serialization to be Posix (LF) and default text
encoding to be UTF-8.

Unfortunately, RStudio will still mess up the .Rproj file (see
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rstud
io_rstudio_issues_1929&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVe
AsRzsn7AkP-g&m=ibNvotyI5H_gn8tI80BOPNvA_3I0hXaKn8B38TS_yLY&s=ruLgGdq-VfeMEeCAbfLgxe2bq6rlBB_wvO_A40iyuFk&e=
 ); there's not much you can do about that.  Just try not to commit the Windows version to 
the repository if any non-Windows users are sharing it.

But do note that other people have different opinions.  They argue
that

files should be converted to Windows native format by git.  That
works in some narrow use cases, but as soon as you try to extract a
file from

git on one system and work on it on another, it breaks.

Duncan Murdoch

Or am I missing something (I am still a newbie on these things...)?

Ivan C

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and Museum for Human
Behavioural Evolution Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.researchgat
e.net_profile_Ivan-5FCalandra&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9P
EhQh2kVeAsRzsn7AkP-g&m=ibNvotyI5H_gn8tI80BOPNvA_3I0hXaKn8B38TS_yLY&s
=EiZdNX4gKQm1kXIBD7w1MyjaAT7f3s-tk9mpNNCKW2U&e=

On 03/02/2021 10:06, Duncan Murdoch wrote:

On 03/02/2021 2:14 a.m., Ivan Krylov wrote:

On Tue, 2 Feb 2021 17:01:05 +0100
Ivan Calandra <calan...@rgzm.de> wrote:

This happens to all text-based files (Rmd, MD, CSV...) but not to
non-editable files (PDF, XLSX...).

This is probably caused by Git helpfully converting text files
from

LF

(0x10) line endings to CR LF (0x13 0x10) when checking out the
repository clone on Windows (and back when checking in).

This configuration option is described in Pro Git:

https://urldefense.proofpoint.com/v2/url?u=https-3A__git-2Dscm.com_bo
ok_en_v2_Customizing-2DGit-2DGit-2DConfiguration-23-5Fcore-5Fautocrlf
&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=ibNvoty
I5H_gn8tI80BOPNvA_3I0hXaKn8B38TS_yLY&s=ZRHfdIX9MHy0op06VvZuG01oy2zYte
5uCSDnCUmwQfk&e=

I agree with Ivan K, but don't agree with the advice in that book.

It's best to just leave files alone, not to convert between LF and
CR-LF.  I don't think this confuses many Windows editors these
days, but if your editor forces files into CR-LF form, you should
fix the editor, not try to work around it.

In my opinion everyone should run

    git config --global core.autocrlf false

Some more arguments for this (in the context of Github Actions) are

here:

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.community
_t_git-2Dconfig-2Dcore-2Dautocrlf-2Dshould-2Ddefault-2Dto-2Dfalse_161
40&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=ibNvo
tyI5H_gn8tI80BOPNvA_3I0hXaKn8B38TS_yLY&s=unUVOo2qDSwSwD9el-0TZK15ERlM
sU-fmqQ7TQ36LxE&e=

Duncan Murdoch

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_ma
ilman_listinfo_r-2Dhelp&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2k
VeAsRzsn7AkP-g&m=ibNvotyI5H_gn8tI80BOPNvA_3I0hXaKn8B38TS_yLY&s=65-81
xtWqRiX3c1BKHs8sst32wKQHJ4tdVabERDDGnE&e=
PLEASE do read the posting guide

https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.o
rg_posting-2Dguide.html&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kV
eAsRzsn7AkP-g&m=ibNvotyI5H_gn8tI80BOPNvA_3I0hXaKn8B38TS_yLY&s=m215LP3
f7naPvS2_dxjXoVfkxhuWcn3VpBDGrGlhKmY&e=

and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mai
lman_listinfo_r-2Dhelp&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVe
AsRzsn7AkP-g&m=ibNvotyI5H_gn8tI80BOPNvA_3I0hXaKn8B38TS_yLY&s=65-81xtW
qRiX3c1BKHs8sst32wKQHJ4tdVabERDDGnE&e=
PLEASE do read the posting guide
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.o
rg_posting-2Dguide.html&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kV
eAsRzsn7AkP-g&m=ibNvotyI5H_gn8tI80BOPNvA_3I0hXaKn8B38TS_yLY&s=m215LP3
f7naPvS2_dxjXoVfkxhuWcn3VpBDGrGlhKmY&e=
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=ibNvotyI5H_gn8tI80BOPNvA_3I0hXaKn8B38TS_yLY&s=65-81xtWqRiX3c1BKHs8sst32wKQHJ4tdVabERDDGnE&e=
PLEASE do read the posting guide 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=ibNvotyI5H_gn8tI80BOPNvA_3I0hXaKn8B38TS_yLY&s=m215LP3f7naPvS2_dxjXoVfkxhuWcn3VpBDGrGlhKmY&e=
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] md5sum issues

Reply via email to