I have a plain text file containing an English-language essay that I'd like
to split into sentences, based on the presence of punctuation.
I wrote this function to determine if a given character is an English
punctuation mark:
(defn ispunc? [c]
(> (count (filter #(= % c) '("." "!" "?" ";"))) 0))
I know that this method is not grammatically perfect, in that acronyms such
as "U.S." will get mis-parsed, etc., but this is just an experiment and
does not need that level of precision.
Then, I tried applying it with partition-by on a file I've slurped:
(def my-text (slurp "mytext.txt"))
(def my-sentences (partition-by ispunc? my-text))
Unfortunately, this returns a sequence of 1, whose first and only element
contains the entire text, since ispunc? depends on looking at a single
character.
So I tried producing a list of chars from the string and passing it to
partition-by with ispunc? like this:
(def my-text-chars (partition (count my-text) my-text))
(def my-sentences (partition-by ispunc? (first my-text-chars)))
That worked, in that it's logically "correct", but when I try to access any
of the elements in my-sentences I get a java.lang.OutOfMemoryError (the
source text file, "mytext.txt" is 1.3 mb in size).
So is there a simpler and more idiomatic way of doing this without using up
all the heap space?
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.