I presume by full text search you mean the Lucene search engine, which uses
the Vector Space Model?
If you know a bit about Lucene, you wouldn't be surprised by what they've
done.
If you log the output of lsearchd, you can see how it blows up queries:
original query: prefrontal cortex hippocampus
query=[prefrontal cortex hippocampus] parsed=[(+(contents:prefrontal
contents:prefront^0.5)
+contents:cortex +contents:hippocampus) ((+title:prefrontal^6.0 +title:
cortex^6.0 +title:hippocampus^6.0) (+(stemtitle:prefrontal^2.0
stemtitle:prefront^0.8) +stemtitle:cortex^2.0 +stemtitle:hippocampus^2.0))
((+alttitle1:prefrontal^4.0 +alttitle1:cortex^4.0 +alttitle1:hippocampus^4.0)
(+alttitle2:prefrontal^4.0 +alttitle2:cortex^4.0 +alttitle2:hippocampus^4.0)
(+alttitle3:prefrontal^4.0 +alttitle3:cortex^4.0 +alttitle3:hippocampus
^4.0))]
If you compare that to what pubmed does for the same query:
("prefrontal cortex"[MeSH Terms] OR ("prefrontal"[All Fields] AND "cortex"[All
Fields]) OR "prefrontal cortex"[All Fields]) AND ("hippocampus"[MeSH Terms]
OR "hippocampus"[All Fields])
On Thu, Jan 8, 2009 at 11:22 AM, Brion Vibber <[email protected]> wrote:
> On 1/8/09 7:47 AM, Uwe Baumbach wrote:
> > Hi,
> >
> > is there a comprehensive, reliable, more profound description of the
> > logical steps the internal search engine (or parser before the engine)
> > undertakes to define:
> > - what is recognized as a single word in an entered search string
> > (blanks - OK, but what about slash, back slash, hyphen, period?) ?
>
> Check MySQL's documentation; also try diving through SearchMySQL.php to
> check how it's breaking up the input when rendering its output. Also
> check Language.php for the horrid search tweaking code.
>
> > - what are "similar words" (closeness of words) ?
>
> No such metric exists afaik.
>
> > Different sources (www.mediawiki.org, xy.wikipedia.org/wiki/Help:Search,
> ...) tell more or less and then different things too.
>
> Note that Wikimedia's sites use a different search engine (MWSearch
> extension plus our Lucene-based backend), so descriptions of their
> behavior would not necessarily be what you want if you're looking for
> descriptions of the default MySQL backend. Note also that the PostgreSQL
> backend is different.
>
> -- brion
>
> _______________________________________________
> MediaWiki-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
--
You have successfully failed!
_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l