I'm a few months into learning Clojure, and thought I'd put this
function out for comment.
I need to take a message digest of files on disk. I'm using a class in
java.security to do this. The class uses an update method which
accepts an array of bytes, and updates the hash. This calls for the
common read-update pattern, but in Clojure. So I decided to try my
hand at a lazy sequence of byte arrays:
(defn stream-block-seq
"A lazy sequence of blocks read from the given input-stream. Each
block is returned as a separately allocated Java byte array. The
maximum block size is given as the optional second argument; the
default is 1024. A returned block may be shorter than the blocksize.
Usually, the last block will be short. If the stream is exhausted,
the result is nil."
([s blocksize]
(let [buf (byte-array blocksize)
readlen (.read s buf)]
(if (>= readlen 0)
(lazy-seq
(let [newbuf (if (< readlen blocksize)
(copy-array buf (byte-array readlen) readlen)
buf)]
(cons newbuf (stream-block-seq s blocksize)))))))
([s] (stream-block-seq s 1024)))
Here's copy-array:
(defn copy-array
([src srcpos dest destpos len]
(do
(System/arraycopy src srcpos dest destpos len)
dest))
([src dest len]
(copy-array src 0 dest 0 len)))
And here's the message-digest function that uses it:
(defn message-digest
"Generates a digest of the given input plaintext. Input must be a
Java byte array, a Java ByteBuffer. hashname is optional and defaults
to \"SHA-256\". The result is a vector of bytes.
See
http://download.oracle.com/javase/1.5.0/docs/guide/security/CryptoSpec.html#AppA
for more information on the available hashes."
([input & opts]
(let [opts (merge { :hash "SHA-256" :blocksize 32768 } (apply
hash-map opts))
hashname (opts :hash)
blocksize (opts :blocksize)
md (MessageDigest/getInstance hashname)]
(doseq [buf (stream-block-seq (input-stream input) blocksize)]
(.update md buf))
(vec (.digest md)))))
This all seems to work, and the performance seems acceptable: with a
32k buffer size, on my Core 2 Duo Macbook it takes about 50ms to hash
a 1MiB file from disk, and 20ms from filesystem cache. However, I'm
sure there's plenty of room for improvement. Is there a cleaner or
more efficient way to do this?
I found two previous threads which deal with similar puzzles --
* Resource cleanup when lazy sequences are finalized:
http://groups.google.com/group/clojure/browse_thread/thread/caece062119de072/13c15c62c3397597?lnk=gst&q=lazy+buffered#13c15c62c3397597
* contrib mmap/duck_streams for binary data:
http://groups.google.com/group/clojure/browse_thread/thread/f5239c7e66e7fb54/813b70b68081456d?lnk=gst&q=lazy+binary+stream#813b70b68081456d
I must say that I find the lazy-sequence approach conceptually quite
attractive here, but my taste may not yet be properly formed :)
Comments welcome!
thanks
Michael Ashton.
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en