Re: How could you load only once a Linux ultility without a batch --input-files kind of option and repeatedly use it on many files? . . .

David Christensen Fri, 15 May 2020 15:35:24 -0700

On 2020-05-14 03:47, Albretch Mueller wrote:

  The thing is that I have to call, say sha256sum, on millions of files


  Probably debian admin people dealing with packaging have to deal with
the same kinds of issues.

  lbrtchx

The need to checksum files is common; it is a good test case for tryingout different computing paradigms and/or programming languages.

As other people have mentioned, using find(1) or xargs(1) from thecommand line to invoke sha256sum(1) is one possibility. All of thesetools are mature and should produce predictable results. When usedcorrectly, their performance is good. For ad-hoc tasks, this is how itsdone.

If you find that you need to parameterize the invocation, such as to useone set of arguments and/or options for one set of files and another setfor other files, you can cut and paste an example invocation into a textfile, parameterize it with variables, and add code to make the file intoa script. I would start with a Bourne shell script. Of course, thereare many other scripting languages to choose from; pick your favorite.

Even if you do not need parameterization, typing './myscript' requiresfewer keystrokes and less mental effort than recalling a find(1)incantation over and over again. And, it provides consistency. Theseconsiderations are important when you are brain fried and heading forlog off, or crawling through the files months later.

As you plan to perform the SHA256 computation a great many times, youshould consider the cost of Unix process creation and tear-down -- e.g.CPU cycles (time) and memory usage. If you write a program thatcomputes many checksums per process, it will have less overhead andshould finish in less time than a program that creates one process perinput file. Benchmarking will tell.

The above is related to the desired output format. Obvious choicesinclude one checksum file for all input files vs. one checksum file perinput file. The plus sign in the '-exec command {} +' option to find(1)facilitates the former, and should be efficient.

Also, where to put the output file(s) -- in the current workingdirectory, within the input tree, within a parallel tree, or someplaceelse? One output file for everything is easiest, but my archive andimage scripts checksum the input files individually and touch(1) thechecksum file modification times to match.

Another consideration is concurrency. If you have a multi-coreprocessor and implement a solution that puts two or more cores to workat the same time, a concurrent program should finish sooner than asequential program. Again, benchmarking.

I find that Bourne shell scripts are comfortable only up to a certainlevel of complexity. Above that, I use Perl. That said, Go would bewell suited to this task and should be faster. Then there is C,assembly, and/or hardware acceleration.



David

Re: How could you load only once a Linux ultility without a batch --input-files kind of option and repeatedly use it on many files? . . .

Reply via email to