Re: Spell Check Handler

climbingrose Fri, 10 Aug 2007 20:50:20 -0700

OK, I just need to define 2 spellcheckers in solrconfig.xml for my purpose.


On 8/11/07, climbingrose <[EMAIL PROTECTED]> wrote:
>
> After looking the SpellChecker code, I realised that it only supports
> single-word. I made a very naive modification of SpellCheckerHandler to get
> multi-word support. Now the other problem that I have is how to have
> different fields in SpellChecker index. For example, since my query has two
> parts: "description" and "location", I don't want to build a spellchecker
> index which combines both "description" and "location" into one
> termSourceField. I want to check "description" part with the "description"
> field in the spellchecker index and "location" part with "location" field in
> the index. Otherwise I might have irrelevant suggestions for the "location"
> part since the number of terms in "location" is generally much smaller
> compared with that of "description". Any ideas?
>
> Thanks.
>
> On 8/11/07, climbingrose <[EMAIL PROTECTED]> wrote:
> >
> > The spellchecker handler doesn't seem to work with multi-word query. For
> > example, when I tried to spellcheck "Java developar", it returns nothing
> > while if I tried "developar", spellchecker correctly returns
> > "developer". I followed the setup on the wiki.
> >
> > Regards,
> >
> > Cuong Hoang
> >
> > On 7/10/07, Charles Hornberger < [EMAIL PROTECTED]> wrote:
> > >
> > > For what it's worth, I recently did a quick implementation of the
> > > spellchecker feature, and I simply created another field in my schema
> > > (Iike 'spell' in Tristan's example below). After feeding content into
> > > my search index, I used the spell field into add one single-field
> > > document for every distinct word in my document collection (I'm
> > > assuming the content folks have run spell-checkers :-)). E.g.:
> > >
> > > <doc><field name="spell">aardvark</field></doc>
> > > <doc><field name="spell">abacus</field></doc>
> > > <doc><field name="spell">abbot</field></doc>
> > > <doc><field name="spell">acacia</field></doc>
> > > etc.
> > >
> > > I also added some extra documents for proper names that appear in my
> > > documents. For instance, there are a couple fields that have
> > > comma-separated list of names, so I for each of those -- in addition
> > > to documents for "john", "doe", and "jane", which were generated by
> > > the naive word-splitting done in the first pass -- I added documents
> > > like so:
> > >
> > > <doc><field name="spell">john doe</field></doc>
> > > <doc><field name="spell">jane doe</field></doc>
> > > etc.
> > >
> > > You could do the same for other searchable multi-word tokens in your
> > > input -- song/album/book/movie titles, publisher names, geographic
> > > names (cities, neighborhoods, etc.), product names, and so on.
> > >
> > > -Charlie
> > >
> > > On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote:
> > > > I think there is some confusion regarding how the spell checker
> > > actually
> > > > uses the termSourceField.  It is suggested that you use a simple
> > > field type
> > > > such a "string", however since this field type does not tokenize or
> > > split
> > > > words, it is only useful in situations where the whole field is
> > > considered a
> > > > dictionary "word":
> > > >
> > > > <add>
> > > > <doc>
> > > > <field name="title">Accountant</field>
> > > > <http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand
> > > ><field
> > > > name="title">Auditor</field>
> > > > <field name="title">Solicitor</field>
> > > > </doc
> > > > </add>
> > > >
> > > > The follow example case will not work with spell checker since the
> > > whole
> > > > field is considered a single word or string:
> > > >
> > > > <add>
> > > > <doc>
> > > > <field name="title">Accountant reveals that Accounting is
> > > boring</field>
> > > > </doc
> > > > </add>
> > > >
> > > > I might suggest that you create an additional field in your schema
> > > that
> > > > takes advantage of the StandardTokenizer and StandardFilter which
> > > doesn't
> > > > perform a great deal of processing on the field yet should provide
> > > decent
> > > > results when used with the spell checker:
> > > >
> > > > <fieldType name="spell" class="solr.TextField"
> > > positionIncrementGap="100">
> > > >   <analyzer type="index">
> > > >     <tokenizer class="solr.StandardTokenizerFactory "/>
> > > >     <filter class="solr.StopFilterFactory" ignoreCase="true" words="
> > > > stopwords.txt"/>
> > > >     <filter class="solr.StandardFilterFactory"/>
> > > >     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > > >   </analyzer>
> > > >   <analyzer type="query">
> > > >     <tokenizer class="solr.StandardTokenizerFactory "/>
> > > >     <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt
> > > "
> > > > ignoreCase="true" expand="true"/>
> > > >     <filter class="solr.StopFilterFactory " ignoreCase="true"
> > > words="
> > > > stopwords.txt"/>
> > > >     <filter class="solr.StandardFilterFactory"/>
> > > >     <filter class="solr.RemoveDuplicatesTokenFilterFactory "/>
> > > >   </analyzer>
> > > > </fieldType>
> > > >
> > > > If you want this field to be automatically populated with the
> > > contents of
> > > > the title field when a document is added to the index, simply use a
> > > > copyField:
> > > >
> > > > <copyField source="title" dest="spell"/>
> > > >
> > > > Hope this helps, let me know if this is still not clear, I probably
> > > will add
> > > > it to the wiki page soon.
> > > >
> > > > cheers,
> > > > Tristan
> > > >
> > > >
> > > >
> > > > On 7/9/07, climbingrose <[EMAIL PROTECTED] > wrote:
> > > > >
> > > > > Thanks for the quick reply. However, I'm still not able to setup
> > > > > spellchecker. Solr does create spell directory under data but
> > > doesn't seem
> > > > > to build the spellchecker index. Here are snippets of my
> > > schema.xml:
> > > > >
> > > > > <field name="title" type="string" indexed="true" stored="true"/>
> > > > >
> > > > > <requestHandler name="spellchecker" class="
> > > solr.SpellCheckerRequestHandler
> > > > > "
> > > > > startup="lazy">
> > > > >     <!-- default values for query parameters -->
> > > > >      <lst name="defaults">
> > > > >        <int name="suggestionCount">1</int>
> > > > >        <float name="accuracy">0.5</float>
> > > > >      </lst>
> > > > >
> > > > >      <!-- Main init params for handler -->
> > > > >
> > > > >      <!-- The directory where your SpellChecker Index should
> > > live.   -->
> > > > >      <!-- May be absolute, or relative to the Solr "dataDir"
> > > directory.
> > > > > -->
> > > > >      <!-- If this option is not specified, a RAM directory will be
> > > used
> > > > > -->
> > > > >      <str name="spellcheckerIndexDir">spell</str>
> > > > >
> > > > >      <!-- the field in your schema that you want to be able to
> > > build -->
> > > > >      <!-- your spell index on. This should be a field that uses a
> > > very -->
> > > > >      <!-- simple FieldType without a lot of Analysis (ie: string)
> > > -->
> > > > >      <str name="termSourceField">title</str>
> > > > >
> > > > >    </requestHandler>
> > > > >
> > > > > I tried this url:
> > > > >
> > > > >
> > > http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand
> > > > > receive this:
> > > > >
> > > > > <response>
> > > > > <lst name="responseHeader">
> > > > > <int name="status">0</int>
> > > > > <int name="QTime">2</int>
> > > > > </lst>
> > > > > <str name="cmdExecuted">rebuild</str>
> > > > > <arr name="suggestions"/>
> > > > > </response>
> > > > >
> > > > >
> > > > > On 7/9/07, Tristan Vittorio < [EMAIL PROTECTED] > wrote:
> > > > > >
> > > > > > The spellchecker should be available in 1.2 release, your query
> > > is
> > > > > > incorrect, try the following:
> > > > > >
> > > > > >
> > > > > >
> > > > > http://localhost:8984/solr/select/?q=java&qt=spellchecker&termSourceField=title_text&cmd=rebuild
> > >
> > > > > >
> > > > > > the 'q' parameter must only contain the word being checked; you
> > > must
> > > > > > specify
> > > > > > the field separately.  You can set "termSourceField" in your
> > > > > > solrconfig.xmlfile so you do not need to explicitly set it each
> > > time
> > > > > > you want to run a
> > > > > > spell check query. Also make sure your field isn't heavily
> > > processed (
> > > > > i.e.
> > > > > > with porter stemmer analyzers) otherwise the suggestions will
> > > look a bit
> > > > > > weird / mangled.  Take a look at the wiki page for more info:
> > > > > >
> > > > > > http://wiki.apache.org/solr/SpellCheckerRequestHandler
> > > > > >
> > > > > > cheers,
> > > > > > Tristan
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 7/9/07, climbingrose < [EMAIL PROTECTED]> wrote:
> > > > > > >
> > > > > > > Hi Tristan,
> > > > > > >
> > > > > > > Is this spellchecker available in 1.2 release or I have to
> > > build the
> > > > > > > trunk.
> > > > > > > I tried your instructions but Solr returns nothing:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > http://localhost:8984/solr/select/?q=title_text:java&qt=spellchecker&cmd=rebuild
> > > > > > >
> > > > > > > Result:
> > > > > > >
> > > > > > > <response>
> > > > > > > <lst name="responseHeader">
> > > > > > > <int name="status">0</int>
> > > > > > > <int name="QTime">3</int>
> > > > > > > </lst>
> > > > > > > <str name="cmdExecuted">rebuild</str>
> > > > > > > <arr name="suggestions"/>
> > > > > > > </response>
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > >
> > > > > > > On 7/8/07, Tristan Vittorio < [EMAIL PROTECTED]>
> > > wrote:
> > > > > > > >
> > > > > > > > Hi Otis,
> > > > > > > >
> > > > > > > > I have written a draft wiki entry for the spell checker:
> > > > > > > > http://wiki.apache.org/solr/SpellCheckerRequestHandler
> > > > > > > >
> > > > > > > > I've learned that my initial observation about the
> > > suggestion
> > > > > ordering
> > > > > > > was
> > > > > > > > incorrect, it does in fact order the results by popularity
> > > (or term
> > > > > > > > frequency) of the word in the termSourceField, the problem I
> > >
> > > > > > experienced
> > > > > > > > was
> > > > > > > > caused by setting termSourceField to a field of type "text",
> > > which
> > > > > > > heavily
> > > > > > > > stemmed and analyzed the words.  I found that using the
> > > > > > > StandardTokenizer
> > > > > > > > and StandardFilter and removing the PorterStemmer and
> > > > > LowerCaseFilter
> > > > > > > from
> > > > > > > > the field schema really improved the spell checker
> > > performance.
> > > > > > > >
> > > > > > > > I haven't included this info on the wiki page yet, I'll try
> > > to
> > > > > update
> > > > > > it
> > > > > > > > soon when I have a bit more time.
> > > > > > > >
> > > > > > > > cheers,
> > > > > > > > Tristan
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 7/8/07, Otis Gospodnetic < [EMAIL PROTECTED]>
> > > wrote:
> > > > > > > > >
> > > > > > > > > Tristan - good summary - want to copy that to the Solr
> > > Wiki?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Otis
> > > > > > > > >
> > > > > > > > > . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > > .
> > > > > > > > > Simpy -- http://www.simpy.com/
> > >   -  Tag  -  Search  -  Share
> > > > > > > > >
> > > > > > > > > ----- Original Message ----
> > > > > > > > > From: Tristan Vittorio < [EMAIL PROTECTED]>
> > > > > > > > > To: solr-user@lucene.apache.org
> > > > > > > > > Sent: Saturday, July 7, 2007 1:51:15 AM
> > > > > > > > > Subject: Re: Spell Check Handler
> > > > > > > > >
> > > > > > > > > I couldn't find any documention on the spell check handler
> > > either
> > > > > > but
> > > > > > > > > found
> > > > > > > > > enough information from the solrconfig.xml file, simply
> > > search for
> > > > > > > > > "SpellCheckerRequestHandler" (online version here):
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml
> > > > > > > > >
> > > > > > > > > You can view the original development discussion from JIRA
> > > (not
> > > > > sure
> > > > > > > how
> > > > > > > > > helpful that will be for you though):
> > > > > > > > > https://issues.apache.org/jira/browse/SOLR-81
> > > > > > > > >
> > > > > > > > > In a nutshell, the configuration parameters available
> > > are::
> > > > > > > > >
> > > > > > > > > suggestionCount: determines how many spelling suggestions
> > > are
> > > > > > > returned.
> > > > > > > > > accuracy: a float value between 1.0 and 0.0 on how close
> > > the
> > > > > > suggested
> > > > > > > > > words
> > > > > > > > > should match the original word being checked.
> > > > > > > > > spellcheckerIndexDir and  termSourceField: check
> > > solrconfig.xmlfor
> > > > > > a
> > > > > > > > full
> > > > > > > > > explanation.
> > > > > > > > >
> > > > > > > > > In order to use the spell checking hander for the first
> > > time, you
> > > > > > need
> > > > > > > > to
> > > > > > > > > explicitly build the spelling index with a sample query
> > > something
> > > > > > like
> > > > > > > > > this:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker&cmd=rebuild
> > > > > > > > > <http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker
> > > >
> > > > > > > > > Depending on how large you main index is, this rebuild
> > > operation
> > > > > > could
> > > > > > > > > take
> > > > > > > > > a while.  Subsequent queries can omit '&cmd=rebuild' and
> > > will
> > > > > return
> > > > > > > > > results
> > > > > > > > > much faster:
> > > > > > > > >
> > > > > > > > >
> > > http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker
> > > > > > > > > <http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker
> > > >
> > > > > > > > > The order of the suggestions returned seems to be based on
> > > the
> > > > > > > accuracy
> > > > > > > > > figure (i.e. how close it matches the original word). it
> > > would be
> > > > > > > great
> > > > > > > > to
> > > > > > > > > be able to sort these suggested results based on term
> > > frequency /
> > > > > > > > document
> > > > > > > > > frequency of the suggested word in the main index, since
> > > the most
> > > > > > > > accurate
> > > > > > > > > suggestion may not always be the most relevant.
> > > > > > > > >
> > > > > > > > > As far as I can tell there is currently no way of doing
> > > this using
> > > > > > the
> > > > > > > > > spellchecker handler alone (you could always run seperate
> > > standard
> > > > > > > > queries
> > > > > > > > > on each word suggestion and order by numDocs, but that
> > > would be
> > > > > very
> > > > > > > > > inefficient), has anybody else tried to achieve this?
> > > > > > > > >
> > > > > > > > > cheers,
> > > > > > > > > Tristan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 7/7/07, Andrew Nagy < [EMAIL PROTECTED] >
> > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hello, is there any documentation on how to use the new
> > > spell
> > > > > > check
> > > > > > > > > > module?
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Andrew
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > Cuong Hoang
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Cuong Hoang
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Cuong Hoang
>
>
>
>
> --
> Regards,
>
> Cuong Hoang




-- 
Regards,

Cuong Hoang

Re: Spell Check Handler

Reply via email to