Code clinic: Idiomatic and performance; finding the closest match of words in two phases independent of order

Yu Shen Wed, 15 Oct 2014 01:13:21 -0700

In the following code, I'm struggling to be idiomatic and at the same time 
to consider save times of iterations. I sometimes find it conflicts between 
the two objectives of expressiveness and performance, for example, in order 
to save time of iterations over a sequence, I have to use (loop, and recur) 
a lot, but for readability and expressiveness, some of such loop structure 
could be more elegantly expressed as list expressions of filters, etc.


I'd like to seek your advise, whether my following code could be improved 
in both purposes?

Thanks a lot for your help!

Yu Shen 


(ns dw-pos-study.pairwise-match
  (:use incanter.stats)
  (:use utils-yushen.util)
  (:use clojure.test))


;; This package tries to find the most likely match between words 
independent of order in two phrases. 
;; Care is taken to be efficient in order to support plausible approximate 
match of names between users' text nad a database. 


(defn switch [[x y]]
  [y x])


(defn pairwise-distances-sorted
  "Returns sorted the pairwise distance between words of search and 
candidate in increasing distance order.
  The keyword parameter perfect-match is used to control how strict two 
words are considered to be perfect match.
  By default, it's 0, thus identical words are perfect match. Relax the 
degree for perfect match may reduce the number of pairwise distances, and 
the sorting them."


  [search candidate & {:keys [perfect-match] :or {perfect-match 0 }}]


  (->>
   (loop [pairs (for [x search y candidate] [x y])
          distance-map {}]
     (if (empty? pairs)
       distance-map
       (let [pair (first pairs)]
        (cond
        (get distance-map pair) (recur (rest pairs) distance-map)
        (get distance-map (switch pair)) (recur (rest pairs) (merge 
distance-map {pair (get distance-map (switch pair))}))
        :else (let [distance (apply incanter.stats/levenshtein-distance 
pair)
                    relative-distance (/ distance (apply max (map count 
pair))) ; use relative distance to allow comparable tolerance in 
determining prefect match across different pair.
                    ; the maximum length of the words in pair is the 
maximum of the distance possible for the pair.
                    perfect-match? (<= relative-distance perfect-match) ; 
try to optimize when there is perfect match found
                    ]
                (recur (if perfect-match? (filter (fn [[x y]] (and (not= x 
(first pair)) (not= y (second pair)))) (rest pairs)) (rest pairs)) ; 
further reduce the search space of pairwise distances.
                       (merge distance-map {pair distance})))))))
   (into [], ) ; use sequence, as map's order is not satble
   (sort-by second, )))


;; Copided from 
http://programming-puzzler.blogspot.com/2010/07/translating-code-from-python-and-scheme.html
 
thanks to mbAugust 2, 2010 at 10:59 PM
(defn remove-first
  [item coll]
  (lazy-seq
    (when-let [s (seq coll)] ; use when-let, it automatically takes care of 
returning nil when there are no more items in the sequence.
      ; caching of the result of the seq call is that you save one level of 
indirection for any following first, next or rest call.
      (let [fst (first s)]
        (if (= fst item)
          (next s)
          (cons fst (remove-first item (rest coll))))))))


(defn matched-closest
  "Find the closest matches of words between search and candidate 
independent of word order, returning the matches, and the residual of not 
matching.
  The match data contains the matched word pair, the levenshtein distance 
between them, and the maximum of word lengths of the words in the pair,
  which serves as the context to tell how significant the distance is.
  The keyword parameter perfect-match is used to control how strict two 
words are considered to be perfect match.
  By default, it's 0, thus identical words are perfect match."


  [search candidate & {:keys [perfect-match] :or {perfect-match 0}}]


  (let [word-distances-sorted (pairwise-distances-sorted search candidate 
:perfect-match perfect-match)]
    (loop [matched []
           filtered word-distances-sorted
           [not-matched-search not-matched-candidate] [search candidate]]
      (if (empty? filtered)
        [matched [not-matched-search not-matched-candidate]]
        (let [[[a b] d] (first filtered)
              matched-updated (conj matched [[a b] d (max (count a) (count 
b))])
              ; remove any element with key containing a or b in their 
respective position, as [a b] is already considered, there is no point to 
cornsider them anymore.
              filtered-updated (filter (fn [[[x y] _]] (and (not= a x) 
(not= b y)))
                                       filtered)]
          (recur matched-updated
                 filtered-updated
                 [(remove-first a not-matched-search) (remove-first b 
not-matched-candidate)]))))))


(is (matched-closest (clojure.string/split "garbage real stuff" #"\s") 
(clojure.string/split "real stuff" #"\s")) 
    [[[["stuff" "stuff"] 0 5] [["real" "real"] 0 4]] ['("garbage") '()]])


(is (matched-closest (clojure.string/split "garbage real stuff" #"\s") 
(clojure.string/split "real stuff nonsense" #"\s"))
 [[[["stuff" "stuff"] 0 5] [["real" "real"] 0 4] [["garbage" "nonsense"] 7 
8]] ['() '()]])


(is (matched-closest (clojure.string/split "garbage real staff" #"\s") 
(clojure.string/split "reel stuff nonsense" #"\s") :perfect-match 0.5)
 [[[["staff" "stuff"] 1 5] [["real" "reel"] 1 4] [["garbage" "nonsense"] 7 
8]] ['() '()]])


-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Code clinic: Idiomatic and performance; finding the closest match of words in two phases independent of order

Reply via email to