On 2020-05-14 03:47, Albretch Mueller wrote:
The thing is that I have to call, say sha256sum, on millions of files
Probably debian admin people dealing with packaging have to deal with
the same kinds of issues.
lbrtchx
The need to checksum files is common; it is a good test case for trying
out different computing paradigms and/or programming languages.
As other people have mentioned, using find(1) or xargs(1) from the
command line to invoke sha256sum(1) is one possibility. All of these
tools are mature and should produce predictable results. When used
correctly, their performance is good. For ad-hoc tasks, this is how its
done.
If you find that you need to parameterize the invocation, such as to use
one set of arguments and/or options for one set of files and another set
for other files, you can cut and paste an example invocation into a text
file, parameterize it with variables, and add code to make the file into
a script. I would start with a Bourne shell script. Of course, there
are many other scripting languages to choose from; pick your favorite.
Even if you do not need parameterization, typing './myscript' requires
fewer keystrokes and less mental effort than recalling a find(1)
incantation over and over again. And, it provides consistency. These
considerations are important when you are brain fried and heading for
log off, or crawling through the files months later.
As you plan to perform the SHA256 computation a great many times, you
should consider the cost of Unix process creation and tear-down -- e.g.
CPU cycles (time) and memory usage. If you write a program that
computes many checksums per process, it will have less overhead and
should finish in less time than a program that creates one process per
input file. Benchmarking will tell.
The above is related to the desired output format. Obvious choices
include one checksum file for all input files vs. one checksum file per
input file. The plus sign in the '-exec command {} +' option to find(1)
facilitates the former, and should be efficient.
Also, where to put the output file(s) -- in the current working
directory, within the input tree, within a parallel tree, or someplace
else? One output file for everything is easiest, but my archive and
image scripts checksum the input files individually and touch(1) the
checksum file modification times to match.
Another consideration is concurrency. If you have a multi-core
processor and implement a solution that puts two or more cores to work
at the same time, a concurrent program should finish sooner than a
sequential program. Again, benchmarking.
I find that Bourne shell scripts are comfortable only up to a certain
level of complexity. Above that, I use Perl. That said, Go would be
well suited to this task and should be faster. Then there is C,
assembly, and/or hardware acceleration.
David