Wiadomość napisana w dniu 2008-10-16, o godz. 15:54, przez Erick
Erickson:
Well, let me see. Your customers are telling you, in essence,
"for any random input, you cannot return false positives". Which
is nonsense, so I'd say you need to negotiate with your
customers. I flat guarantee that, for any algorithm you try,
you can write a counter-example in, oh, 15 seconds or so <G>.
They came to such expectations seeing Solr's own Spellcheck at work -
if it can suggest correct versions, it should be able to sanitize
broken words in documents and search them using sanitized input. For
me, this seemed reasonable request (of course, if this can be achieved
reasonably abusing solr's spellcheck component).
FuzzySearch tries to do some of this work for you, and that may be
acceptable, as this is a common issue. But it'll never be
perfect.
You might get some joy from ngrams, but I haven't
worked with it myself, just seen it recommended by people
whose opinions I respect...
Thank you for these suggestions.
Best
Erick
2008/10/16 Jarek Zgoda <[EMAIL PROTECTED]>
Hello, group.
I'm trying to create a search facility for documents in "broken"
Polish (by
broken I mean "not language rules compliant"), searchable by terms in
"broken" Polish, but broken in many other ways than documents. See
this
example:
document text: "włatcy móch" (in proper Polish this would be
"władcy much")
example terms that should match: "włatcy much", "wlatcy moch",
"wladcy
much"
This double brokeness ruled out any Polish stemmers currently
available for
Lucene and now I am at point 0. The search results do not have to
be 100%
accurate - some missing results are acceptable, but "false
positives" are
not. Is it at all possible using machinery provided by Solr (I do
not own
PHD in liguistics), or should I ask the business for lowering their
expectations?
--
We read Knuth so you don't have to. - Tim Peters
Jarek Zgoda, R&D, Redefine
[EMAIL PROTECTED]
--
We read Knuth so you don't have to. - Tim Peters
Jarek Zgoda, R&D, Redefine
[EMAIL PROTECTED]