Well, you can use binary input files like RDS, qs, or parquet. But you already 
have your code and data in Git, so checking your input is redundant... just put 
in a binary output reference file and a test that verifies it.

On February 3, 2021 8:25:33 AM PST, Ivan Calandra <calan...@rgzm.de> wrote:
>Dear Jeff,
>
>If I understood you correctly, it makes sense that I explain more about
>
>my goal here:
>
>I am trying to find ways to have analyses that are as reproducible as 
>possible (knowing that it is not going to be perfect). One part is to 
>show which file(s) I use as input and what output was created, so that 
>potential readers/users of my analysis can check that the file they
>have 
>is indeed the same that I use (and not a corrupted or modified
>version).
>Does that make sense?
>
>And for this purpose, I originally used file information (like creation
>
>time and so on), but I quickly realized this doesn't help much. Then I 
>tried with MD5 and I thought it was solved, but it was obviously not
>solved.
>
>Duncan solution seems to work (I have not fully checked yet, though), 
>but I am really open to other, more robust alternatives.
>
>Thanks for the input!
>Ivan
>
>--
>Dr. Ivan Calandra
>TraCEr, laboratory for Traceology and Controlled Experiments
>MONREPOS Archaeological Research Centre and
>Museum for Human Behavioural Evolution
>Schloss Monrepos
>56567 Neuwied, Germany
>+49 (0) 2631 9772-243
>https://www.researchgate.net/profile/Ivan_Calandra
>
>On 03/02/2021 17:15, Jeff Newmiller wrote:
>> This CR vs LF vs CRLF newline discrepancy has been around since the
>70s and the CP/M operating system. And it remains an issue in
>over-the-wire internet text protocols today, which actually use the
>CRLF version like Windows. Sorry, UNIX... world domination of LF
>encoding failed.
>>
>> The problem with pretending there is no issue as Duncan is advocating
>is that text is treated differently than binary, and every time you
>pretend it isn't it comes back to bite you. Applying binary algorithms
>like MD5 to text is one of these areas where your expectation that this
>will be successful is what creates the problem in the first place. A
>similar issue occurs in file encoding.. two files may both contain the
>word "Hello" but if they are encoded in UCS16 and UTF8 respectively
>then the MD5 results will be different.
>>
>> Git does not (currently) support differences in encoding, but it does
>support text vs non-text (newline) differences because they are
>unavoidable. Pushing forward with your expectation that text files
>should compare the same in binary by assuming text will always be like
>UNIX text just defers the problem for another day.
>>
>> Since I don't know what problem you are actually trying to solve, I
>cannot offer a concrete solution. But I would begin by not assuming
>that MD5 works the same on text and binary files... because it doesn't.
>>
>> On February 3, 2021 2:48:56 AM PST, Duncan Murdoch
><murdoch.dun...@gmail.com> wrote:
>>> On 03/02/2021 4:42 a.m., Ivan Calandra wrote:
>>>> Thank you Ivan and Duncan for your help.
>>>>
>>>> I understand your point Duncan, but the thing is that I do have an
>>> issue
>>>> here.
>>>> Is it then due to RStudio or even Windows? If it is, I can forget
>>> about
>>>> a solution on that end, so I would focus on what I can do, and this
>>> Git
>>>> setting seems to be the best place to start.
>>> In my opinion, you should run
>>>
>>>   git config --global core.autocrlf false
>>>
>>> in an RStudio terminal session.  That will set the git options so
>they
>>> don't mess up the md5sum values.
>>>
>>> You should also go to the RStudio options, and in the Code section,
>>> Saving tab, choose Serialization to be Posix (LF) and default text
>>> encoding to be UTF-8.
>>>
>>> Unfortunately, RStudio will still mess up the .Rproj file (see
>>> https://github.com/rstudio/rstudio/issues/1929); there's not much
>you
>>> can do about that.  Just try not to commit the Windows version to
>the
>>> repository if any non-Windows users are sharing it.
>>>
>>> But do note that other people have different opinions.  They argue
>that
>>>
>>> files should be converted to Windows native format by git.  That
>works
>>> in some narrow use cases, but as soon as you try to extract a file
>from
>>>
>>> git on one system and work on it on another, it breaks.
>>>
>>> Duncan Murdoch
>>>
>>>
>>>> Or am I missing something (I am still a newbie on these things...)?
>>>>
>>>> Ivan C
>>>>
>>>> --
>>>> Dr. Ivan Calandra
>>>> TraCEr, laboratory for Traceology and Controlled Experiments
>>>> MONREPOS Archaeological Research Centre and
>>>> Museum for Human Behavioural Evolution
>>>> Schloss Monrepos
>>>> 56567 Neuwied, Germany
>>>> +49 (0) 2631 9772-243
>>>> https://www.researchgate.net/profile/Ivan_Calandra
>>>>
>>>> On 03/02/2021 10:06, Duncan Murdoch wrote:
>>>>> On 03/02/2021 2:14 a.m., Ivan Krylov wrote:
>>>>>> On Tue, 2 Feb 2021 17:01:05 +0100
>>>>>> Ivan Calandra <calan...@rgzm.de> wrote:
>>>>>>
>>>>>>> This happens to all text-based files (Rmd, MD, CSV...) but not
>to
>>>>>>> non-editable files (PDF, XLSX...).
>>>>>> This is probably caused by Git helpfully converting text files
>from
>>> LF
>>>>>> (0x10) line endings to CR LF (0x13 0x10) when checking out the
>>>>>> repository clone on Windows (and back when checking in).
>>>>>>
>>>>>> This configuration option is described in Pro Git:
>>>>>>
>>>
>https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration#_core_autocrlf
>>>>> I agree with Ivan K, but don't agree with the advice in that book.
>>>>>
>>>>> It's best to just leave files alone, not to convert between LF and
>>>>> CR-LF.  I don't think this confuses many Windows editors these
>days,
>>>>> but if your editor forces files into CR-LF form, you should fix
>the
>>>>> editor, not try to work around it.
>>>>>
>>>>> In my opinion everyone should run
>>>>>
>>>>>    git config --global core.autocrlf false
>>>>>
>>>>> Some more arguments for this (in the context of Github Actions)
>are
>>> here:
>>>>>
>>>>>
>>>
>https://github.community/t/git-config-core-autocrlf-should-default-to-false/16140
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ______________________________________________
>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>> ______________________________________________
>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
>______________________________________________
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to