Package: libghc-regex-pcre-dev Version: 0.94.4-7 Severity: normal When I try to run a program that searches for a (ASCII) string in a UTF8-encoded string, I get a wierd "off-by-one" error: I expect title is |Page Title| I get title is |age Title'|
I can't find any docs that say I'm doing it wrong, or that I have to do things a specific way to get unicode support. (hopefully) A test-case haskell script is attached. -- System Information: Debian Release: 7.7 APT prefers stable APT policy: (990, 'stable'), (990, 'oldstable') Architecture: i386 (i686) Kernel: Linux 3.16.0-4-686-pae (SMP w/2 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages libghc-regex-pcre-dev depends on: ii ghc [libghc-containers-dev-0.5.6.2-5879d] 7.10.3-9 ii libc6 2.19-18 pn libghc-array-dev-0.5.1.0-98220 <none> pn libghc-base-dev-4.8.2.0-a3ce8 <none> pn libghc-bytestring-dev-0.10.6.0-89a6f <none> ii libghc-regex-base-dev [libghc-regex-base-dev-0.93.2-b11ef] 0.93.2-8 ii libpcre3 2:8.35-3.3 ii libpcre3-dev 2:8.35-3.3 libghc-regex-pcre-dev recommends no packages. Versions of packages libghc-regex-pcre-dev suggests: pn libghc-regex-pcre-doc <none> pn libghc-regex-pcre-prof <none> -- no debconf information
#!/usr/bin/runghc module Main where import Text.Regex.PCRE get_subgroup :: (String, String, String, [String]) -> Int -> String get_subgroup (_, m, _, _) 0 = m get_subgroup (_, _, _, tl) i = tl !! (i - 1) main = do let i = "<label for='blah'>Â </label><img title='Page Title' src='img.jpg'/>" let title = get_subgroup (i =~ "title='([^']*)'") 1 putStrLn ("title is |" ++ title ++ "|")