Wiadomość napisana w dniu 2008-10-16, o godz. 15:54, przez Erick Erickson:

Well, let me see. Your customers are telling you, in essence,
"for any random input, you cannot return false positives". Which
is nonsense, so I'd say you need to negotiate with your
customers. I flat guarantee that, for any algorithm you try,
you can write a counter-example in, oh, 15 seconds or so <G>.

They came to such expectations seeing Solr's own Spellcheck at work - if it can suggest correct versions, it should be able to sanitize broken words in documents and search them using sanitized input. For me, this seemed reasonable request (of course, if this can be achieved reasonably abusing solr's spellcheck component).

FuzzySearch tries to do some of this work for you, and that may be
acceptable, as this is a common issue. But it'll never be
perfect.

You might get some joy from ngrams, but I haven't
worked with it myself, just seen it recommended by people
whose opinions I respect...

Thank you for these suggestions.



Best
Erick


2008/10/16 Jarek Zgoda <[EMAIL PROTECTED]>

Hello, group.

I'm trying to create a search facility for documents in "broken" Polish (by
broken I mean "not language rules compliant"), searchable by terms in
"broken" Polish, but broken in many other ways than documents. See this
example:

document text: "włatcy móch" (in proper Polish this would be "władcy much") example terms that should match: "włatcy much", "wlatcy moch", "wladcy
much"

This double brokeness ruled out any Polish stemmers currently available for Lucene and now I am at point 0. The search results do not have to be 100% accurate - some missing results are acceptable, but "false positives" are not. Is it at all possible using machinery provided by Solr (I do not own
PHD in liguistics), or should I ask the business for lowering their
expectations?

--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
[EMAIL PROTECTED]



--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
[EMAIL PROTECTED]

Reply via email to