Re: [Virtuoso-users] Dumb strings in free text search

Ivan Mikhailov Tue, 03 Feb 2009 16:54:45 +0000

Kjetil,

One can declare a new "language", as a named set of callbacks that split
the text into words, normalize that words and decide which words should
be indexed and which are not. We did not re-invent bicycles in this area
and what was sufficient for other developers in other projects is
probably useful for you as well. The related code is in libsrc/langfunc
directory of VOS source tree. Unfortunately, that's C code, it makes
things a bit more costly than Virtuoso/PL coding. Even worse, there's no
way to change language used by default RDF storage, so now the trick is
useful only for plain old application-specific tables.


Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com

On Tue, 2009-02-03 at 17:01 +0100, Kjetil Kjernsmo wrote:
> All,
> 
> I'd like to put one of the harder problems we're struggling with to you all:
> 
> In freetext queries, our experience is that people tend to write the 
> "dumbest" 
> version of a string what they search. For example, they are likely to 
> write "Gothe" or "Goethe" rather than "Göthe". The problem is not smaller 
> with with accents, people tend to ignore them or get them wrong. This is 
> something we need to take into account, but we are unsure about how to do it. 
> 
> We could dumb down all strings when indexing, so that an "ö" becomes "oe", 
> but 
> the example of where this would be wrong is not hard to find: was Göthe a 
> great pöt or a was Goethe a great poet? 
> 
> Have anyone else encountered the same problem, and if so, what is your take 
> on 
> it? 
> 
> While acknowledging the obvious problems, our customer still feels that 
> dumbing down certain characters in both ends is the best solution. Therefore 
> right now, in our Jena-based solution, we have implemented a solution where 
> they can create a hash that makes it possible to map e.g. "ø" and "ö" to "o", 
> so that only Gothe is indexed. Then, if a user searches for Göthe, the query 
> will be written as Gothe. This sort of does the job, and it is relatively 
> simple to map several characters to one, but mapping one character is harder. 
> 
> We are in the process of migrating the whole solution and take Jena out of 
> the 
> mix for most components, so we are looking for a better solution to this 
> problem than we have ourselves. Additionally, our own solution requires Jena, 
> so we would prefer a Virtuoso-only solution. 
> 
> Kind regards 
> 
> Kjetil Kjernsmo

Re: [Virtuoso-users] Dumb strings in free text search

Reply via email to