Here is the link to the gist for the
code https://gist.github.com/yubrshen/63ffda973aff27d39868
On Wednesday, October 15, 2014 4:12:56 PM UTC+8, Yu Shen wrote:
>
> In the following code, I'm struggling to be idiomatic and at the same time
> to consider save times of iterations. I sometimes find it conflicts between
> the two objectives of expressiveness and performance, for example, in order
> to save time of iterations over a sequence, I have to use (loop, and recur)
> a lot, but for readability and expressiveness, some of such loop structure
> could be more elegantly expressed as list expressions of filters, etc.
>
> I'd like to seek your advise, whether my following code could be improved
> in both purposes?
>
> Thanks a lot for your help!
>
> Yu Shen
>
>
> (ns dw-pos-study.pairwise-match
> (:use incanter.stats)
> (:use utils-yushen.util)
> (:use clojure.test))
>
>
> ;; This package tries to find the most likely match between words
> independent of order in two phrases.
> ;; Care is taken to be efficient in order to support plausible
> approximate match of names between users' text nad a database.
>
>
> (defn switch [[x y]]
> [y x])
>
>
> (defn pairwise-distances-sorted
> "Returns sorted the pairwise distance between words of search and
> candidate in increasing distance order.
> The keyword parameter perfect-match is used to control how strict two
> words are considered to be perfect match.
> By default, it's 0, thus identical words are perfect match. Relax the
> degree for perfect match may reduce the number of pairwise distances, and
> the sorting them."
>
>
> [search candidate & {:keys [perfect-match] :or {perfect-match 0 }}]
>
>
> (->>
> (loop [pairs (for [x search y candidate] [x y])
> distance-map {}]
> (if (empty? pairs)
> distance-map
> (let [pair (first pairs)]
> (cond
> (get distance-map pair) (recur (rest pairs) distance-map)
> (get distance-map (switch pair)) (recur (rest pairs) (merge
> distance-map {pair (get distance-map (switch pair))}))
> :else (let [distance (apply incanter.stats/levenshtein-distance
> pair)
> relative-distance (/ distance (apply max (map count
> pair))) ; use relative distance to allow comparable tolerance in
> determining prefect match across different pair.
> ; the maximum length of the words in pair is the
> maximum of the distance possible for the pair.
> perfect-match? (<= relative-distance perfect-match) ;
> try to optimize when there is perfect match found
> ]
> (recur (if perfect-match? (filter (fn [[x y]] (and (not= x
> (first pair)) (not= y (second pair)))) (rest pairs)) (rest pairs)) ;
> further reduce the search space of pairwise distances.
> (merge distance-map {pair distance})))))))
> (into [], ) ; use sequence, as map's order is not satble
> (sort-by second, )))
>
>
> ;; Copided from
> http://programming-puzzler.blogspot.com/2010/07/translating-code-from-python-and-scheme.html
>
> thanks to mbAugust 2, 2010 at 10:59 PM
> (defn remove-first
> [item coll]
> (lazy-seq
> (when-let [s (seq coll)] ; use when-let, it automatically takes care
> of returning nil when there are no more items in the sequence.
> ; caching of the result of the seq call is that you save one level
> of indirection for any following first, next or rest call.
> (let [fst (first s)]
> (if (= fst item)
> (next s)
> (cons fst (remove-first item (rest coll))))))))
>
>
> (defn matched-closest
> "Find the closest matches of words between search and candidate
> independent of word order, returning the matches, and the residual of not
> matching.
> The match data contains the matched word pair, the levenshtein distance
> between them, and the maximum of word lengths of the words in the pair,
> which serves as the context to tell how significant the distance is.
> The keyword parameter perfect-match is used to control how strict two
> words are considered to be perfect match.
> By default, it's 0, thus identical words are perfect match."
>
>
> [search candidate & {:keys [perfect-match] :or {perfect-match 0}}]
>
>
> (let [word-distances-sorted (pairwise-distances-sorted search candidate
> :perfect-match perfect-match)]
> (loop [matched []
> filtered word-distances-sorted
> [not-matched-search not-matched-candidate] [search candidate]]
> (if (empty? filtered)
> [matched [not-matched-search not-matched-candidate]]
> (let [[[a b] d] (first filtered)
> matched-updated (conj matched [[a b] d (max (count a) (count
> b))])
> ; remove any element with key containing a or b in their
> respective position, as [a b] is already considered, there is no point to
> cornsider them anymore.
> filtered-updated (filter (fn [[[x y] _]] (and (not= a x)
> (not= b y)))
> filtered)]
> (recur matched-updated
> filtered-updated
> [(remove-first a not-matched-search) (remove-first b
> not-matched-candidate)]))))))
>
>
> (is (matched-closest (clojure.string/split "garbage real stuff" #"\s")
> (clojure.string/split "real stuff" #"\s"))
> [[[["stuff" "stuff"] 0 5] [["real" "real"] 0 4]] ['("garbage") '()]])
>
>
> (is (matched-closest (clojure.string/split "garbage real stuff" #"\s")
> (clojure.string/split "real stuff nonsense" #"\s"))
> [[[["stuff" "stuff"] 0 5] [["real" "real"] 0 4] [["garbage" "nonsense"] 7
> 8]] ['() '()]])
>
>
> (is (matched-closest (clojure.string/split "garbage real staff" #"\s")
> (clojure.string/split "reel stuff nonsense" #"\s") :perfect-match 0.5)
> [[[["staff" "stuff"] 1 5] [["real" "reel"] 1 4] [["garbage" "nonsense"] 7
> 8]] ['() '()]])
>
>
>
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.