Dear Jeff,
If I understood you correctly, it makes sense that I explain more about
my goal here:
I am trying to find ways to have analyses that are as reproducible as
possible (knowing that it is not going to be perfect). One part is to
show which file(s) I use as input and what output was created, so that
potential readers/users of my analysis can check that the file they have
is indeed the same that I use (and not a corrupted or modified version).
Does that make sense?
And for this purpose, I originally used file information (like creation
time and so on), but I quickly realized this doesn't help much. Then I
tried with MD5 and I thought it was solved, but it was obviously not solved.
Duncan solution seems to work (I have not fully checked yet, though),
but I am really open to other, more robust alternatives.
Thanks for the input!
Ivan
--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra
On 03/02/2021 17:15, Jeff Newmiller wrote:
This CR vs LF vs CRLF newline discrepancy has been around since the 70s and the
CP/M operating system. And it remains an issue in over-the-wire internet text
protocols today, which actually use the CRLF version like Windows. Sorry,
UNIX... world domination of LF encoding failed.
The problem with pretending there is no issue as Duncan is advocating is that text is
treated differently than binary, and every time you pretend it isn't it comes back to
bite you. Applying binary algorithms like MD5 to text is one of these areas where your
expectation that this will be successful is what creates the problem in the first place.
A similar issue occurs in file encoding.. two files may both contain the word
"Hello" but if they are encoded in UCS16 and UTF8 respectively then the MD5
results will be different.
Git does not (currently) support differences in encoding, but it does support
text vs non-text (newline) differences because they are unavoidable. Pushing
forward with your expectation that text files should compare the same in binary
by assuming text will always be like UNIX text just defers the problem for
another day.
Since I don't know what problem you are actually trying to solve, I cannot
offer a concrete solution. But I would begin by not assuming that MD5 works the
same on text and binary files... because it doesn't.
On February 3, 2021 2:48:56 AM PST, Duncan Murdoch <murdoch.dun...@gmail.com>
wrote:
On 03/02/2021 4:42 a.m., Ivan Calandra wrote:
Thank you Ivan and Duncan for your help.
I understand your point Duncan, but the thing is that I do have an
issue
here.
Is it then due to RStudio or even Windows? If it is, I can forget
about
a solution on that end, so I would focus on what I can do, and this
Git
setting seems to be the best place to start.
In my opinion, you should run
git config --global core.autocrlf false
in an RStudio terminal session. That will set the git options so they
don't mess up the md5sum values.
You should also go to the RStudio options, and in the Code section,
Saving tab, choose Serialization to be Posix (LF) and default text
encoding to be UTF-8.
Unfortunately, RStudio will still mess up the .Rproj file (see
https://github.com/rstudio/rstudio/issues/1929); there's not much you
can do about that. Just try not to commit the Windows version to the
repository if any non-Windows users are sharing it.
But do note that other people have different opinions. They argue that
files should be converted to Windows native format by git. That works
in some narrow use cases, but as soon as you try to extract a file from
git on one system and work on it on another, it breaks.
Duncan Murdoch
Or am I missing something (I am still a newbie on these things...)?
Ivan C
--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra
On 03/02/2021 10:06, Duncan Murdoch wrote:
On 03/02/2021 2:14 a.m., Ivan Krylov wrote:
On Tue, 2 Feb 2021 17:01:05 +0100
Ivan Calandra <calan...@rgzm.de> wrote:
This happens to all text-based files (Rmd, MD, CSV...) but not to
non-editable files (PDF, XLSX...).
This is probably caused by Git helpfully converting text files from
LF
(0x10) line endings to CR LF (0x13 0x10) when checking out the
repository clone on Windows (and back when checking in).
This configuration option is described in Pro Git:
https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration#_core_autocrlf
I agree with Ivan K, but don't agree with the advice in that book.
It's best to just leave files alone, not to convert between LF and
CR-LF. I don't think this confuses many Windows editors these days,
but if your editor forces files into CR-LF form, you should fix the
editor, not try to work around it.
In my opinion everyone should run
git config --global core.autocrlf false
Some more arguments for this (in the context of Github Actions) are
here:
https://github.community/t/git-config-core-autocrlf-should-default-to-false/16140
Duncan Murdoch
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.