Well, you can use binary input files like RDS, qs, or parquet. But you already have your code and data in Git, so checking your input is redundant... just put in a binary output reference file and a test that verifies it.
On February 3, 2021 8:25:33 AM PST, Ivan Calandra <calan...@rgzm.de> wrote: >Dear Jeff, > >If I understood you correctly, it makes sense that I explain more about > >my goal here: > >I am trying to find ways to have analyses that are as reproducible as >possible (knowing that it is not going to be perfect). One part is to >show which file(s) I use as input and what output was created, so that >potential readers/users of my analysis can check that the file they >have >is indeed the same that I use (and not a corrupted or modified >version). >Does that make sense? > >And for this purpose, I originally used file information (like creation > >time and so on), but I quickly realized this doesn't help much. Then I >tried with MD5 and I thought it was solved, but it was obviously not >solved. > >Duncan solution seems to work (I have not fully checked yet, though), >but I am really open to other, more robust alternatives. > >Thanks for the input! >Ivan > >-- >Dr. Ivan Calandra >TraCEr, laboratory for Traceology and Controlled Experiments >MONREPOS Archaeological Research Centre and >Museum for Human Behavioural Evolution >Schloss Monrepos >56567 Neuwied, Germany >+49 (0) 2631 9772-243 >https://www.researchgate.net/profile/Ivan_Calandra > >On 03/02/2021 17:15, Jeff Newmiller wrote: >> This CR vs LF vs CRLF newline discrepancy has been around since the >70s and the CP/M operating system. And it remains an issue in >over-the-wire internet text protocols today, which actually use the >CRLF version like Windows. Sorry, UNIX... world domination of LF >encoding failed. >> >> The problem with pretending there is no issue as Duncan is advocating >is that text is treated differently than binary, and every time you >pretend it isn't it comes back to bite you. Applying binary algorithms >like MD5 to text is one of these areas where your expectation that this >will be successful is what creates the problem in the first place. A >similar issue occurs in file encoding.. two files may both contain the >word "Hello" but if they are encoded in UCS16 and UTF8 respectively >then the MD5 results will be different. >> >> Git does not (currently) support differences in encoding, but it does >support text vs non-text (newline) differences because they are >unavoidable. Pushing forward with your expectation that text files >should compare the same in binary by assuming text will always be like >UNIX text just defers the problem for another day. >> >> Since I don't know what problem you are actually trying to solve, I >cannot offer a concrete solution. But I would begin by not assuming >that MD5 works the same on text and binary files... because it doesn't. >> >> On February 3, 2021 2:48:56 AM PST, Duncan Murdoch ><murdoch.dun...@gmail.com> wrote: >>> On 03/02/2021 4:42 a.m., Ivan Calandra wrote: >>>> Thank you Ivan and Duncan for your help. >>>> >>>> I understand your point Duncan, but the thing is that I do have an >>> issue >>>> here. >>>> Is it then due to RStudio or even Windows? If it is, I can forget >>> about >>>> a solution on that end, so I would focus on what I can do, and this >>> Git >>>> setting seems to be the best place to start. >>> In my opinion, you should run >>> >>> git config --global core.autocrlf false >>> >>> in an RStudio terminal session. That will set the git options so >they >>> don't mess up the md5sum values. >>> >>> You should also go to the RStudio options, and in the Code section, >>> Saving tab, choose Serialization to be Posix (LF) and default text >>> encoding to be UTF-8. >>> >>> Unfortunately, RStudio will still mess up the .Rproj file (see >>> https://github.com/rstudio/rstudio/issues/1929); there's not much >you >>> can do about that. Just try not to commit the Windows version to >the >>> repository if any non-Windows users are sharing it. >>> >>> But do note that other people have different opinions. They argue >that >>> >>> files should be converted to Windows native format by git. That >works >>> in some narrow use cases, but as soon as you try to extract a file >from >>> >>> git on one system and work on it on another, it breaks. >>> >>> Duncan Murdoch >>> >>> >>>> Or am I missing something (I am still a newbie on these things...)? >>>> >>>> Ivan C >>>> >>>> -- >>>> Dr. Ivan Calandra >>>> TraCEr, laboratory for Traceology and Controlled Experiments >>>> MONREPOS Archaeological Research Centre and >>>> Museum for Human Behavioural Evolution >>>> Schloss Monrepos >>>> 56567 Neuwied, Germany >>>> +49 (0) 2631 9772-243 >>>> https://www.researchgate.net/profile/Ivan_Calandra >>>> >>>> On 03/02/2021 10:06, Duncan Murdoch wrote: >>>>> On 03/02/2021 2:14 a.m., Ivan Krylov wrote: >>>>>> On Tue, 2 Feb 2021 17:01:05 +0100 >>>>>> Ivan Calandra <calan...@rgzm.de> wrote: >>>>>> >>>>>>> This happens to all text-based files (Rmd, MD, CSV...) but not >to >>>>>>> non-editable files (PDF, XLSX...). >>>>>> This is probably caused by Git helpfully converting text files >from >>> LF >>>>>> (0x10) line endings to CR LF (0x13 0x10) when checking out the >>>>>> repository clone on Windows (and back when checking in). >>>>>> >>>>>> This configuration option is described in Pro Git: >>>>>> >>> >https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration#_core_autocrlf >>>>> I agree with Ivan K, but don't agree with the advice in that book. >>>>> >>>>> It's best to just leave files alone, not to convert between LF and >>>>> CR-LF. I don't think this confuses many Windows editors these >days, >>>>> but if your editor forces files into CR-LF form, you should fix >the >>>>> editor, not try to work around it. >>>>> >>>>> In my opinion everyone should run >>>>> >>>>> git config --global core.autocrlf false >>>>> >>>>> Some more arguments for this (in the context of Github Actions) >are >>> here: >>>>> >>>>> >>> >https://github.community/t/git-config-core-autocrlf-should-default-to-false/16140 >>>>> >>>>> Duncan Murdoch >>>>> >>>>> >>>>> >>>>> >>>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. > >______________________________________________ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.