On Mon, Jul 19, 2010 at 9:24 AM, David Virebayre
<[email protected]> wrote:

> A minor point: instead of removing the punctuation, you maybe should
> convert it to whitespace.
>
> Otherwise in texts like "there was a quick,brown fox" (notice the
> missing space after the comma) you'll have the word "quickbrown"
> instead of 2 words "quick" and "brown".

If you remove punctuation you

- run the risk of joining two valid words into one invalid word:
  "quick,brown" -> "quickbrown"

- run the risk of converting one word into a different word:
 "can't" -> "cant"
 "won't" -> "wont"

If you split at punctuation you create more semi-words:
 "can't" -> "can", "t"
 "shouldn't" -> "shouldn" "t"

It might be better regarding in-word apostrophes as letters in this case?

-- 
Dougal Stanton
[email protected] // http://www.dougalstanton.net
_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to