Kjetil, One can declare a new "language", as a named set of callbacks that split the text into words, normalize that words and decide which words should be indexed and which are not. We did not re-invent bicycles in this area and what was sufficient for other developers in other projects is probably useful for you as well. The related code is in libsrc/langfunc directory of VOS source tree. Unfortunately, that's C code, it makes things a bit more costly than Virtuoso/PL coding. Even worse, there's no way to change language used by default RDF storage, so now the trick is useful only for plain old application-specific tables.
Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com On Tue, 2009-02-03 at 17:01 +0100, Kjetil Kjernsmo wrote: > All, > > I'd like to put one of the harder problems we're struggling with to you all: > > In freetext queries, our experience is that people tend to write the > "dumbest" > version of a string what they search. For example, they are likely to > write "Gothe" or "Goethe" rather than "Göthe". The problem is not smaller > with with accents, people tend to ignore them or get them wrong. This is > something we need to take into account, but we are unsure about how to do it. > > We could dumb down all strings when indexing, so that an "ö" becomes "oe", > but > the example of where this would be wrong is not hard to find: was Göthe a > great pöt or a was Goethe a great poet? > > Have anyone else encountered the same problem, and if so, what is your take > on > it? > > While acknowledging the obvious problems, our customer still feels that > dumbing down certain characters in both ends is the best solution. Therefore > right now, in our Jena-based solution, we have implemented a solution where > they can create a hash that makes it possible to map e.g. "ø" and "ö" to "o", > so that only Gothe is indexed. Then, if a user searches for Göthe, the query > will be written as Gothe. This sort of does the job, and it is relatively > simple to map several characters to one, but mapping one character is harder. > > We are in the process of migrating the whole solution and take Jena out of > the > mix for most components, so we are looking for a better solution to this > problem than we have ourselves. Additionally, our own solution requires Jena, > so we would prefer a Virtuoso-only solution. > > Kind regards > > Kjetil Kjernsmo