I'm in the process of incorporating Solr spellchecking in our product.
For that, I've created a new field:

 <field name="spell" type="spell"
indexed="true" stored="true" required="false" multiValued="false"/>

<copyField source="name" dest="spell" maxChars="30000" />

And in the
fieldType definitions:

 <fieldType name="spell" class="solr.TextField"
positionIncrementGap="100">
 <analyzer>
 <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
 </analyzer>

</fieldType>

Then I feed the names of products into the corresponding
core. They can have a lot of words (examples):

 door lock rear left

Door brake, door in front + rear fitting.

However, the names get pretty
long, and in the source data, they have been truncated. This sometimes
leaves parts of words at the end:

 The water pump can evacuate some
coo

I have created a spellcheck component, feeding of the `spell` field
defined earlier. Now for the problem.

Sometimes, when I look up a
slightly misspelled word, I get results I do not expect. Example
request:

 http://solr.url:8983/solr/en/spell?q=coole

This is (part of)
the response:

 <str name="word">cooler</str><int name="freq">21</int>

<str name="word">coo le</str><int name="freq">2</int>
 <str
name="word">cable</str><int name="freq">334</int>
 <str name="word">co o
le</str><int name="freq">4</int>
 [...]

Now, as you can see, the
misspelled `coole` should have been `cooler`, and it's the first
suggestion. However, the second and fourth suggestion baffle me. After a
bit of research, I found this to be multiple words clunked together. As
I described above, `coo` was a part of a name that was truncated. I
found `co` the same way, and the source data contains a small number of
`o` characters on their own (product number names).

Now, my question
is: Why is Solr suggesting `multiple words` pasted together for a
spellcheck for a single word? Is there a way to prevent Solr from
pasting together word parts to forge suggestions? 
 

Reply via email to