RE: Text Analysis and copyField

Herman Kiefus Thu, 25 Aug 2011 10:56:04 -0700

It had crossed my mind but for now we have a 'DictionarySource' field whose 
type utilizes the KeepWordFilterFactory that uses a text file containing all 
correctly spelled words (thanks to scrabble), location/last/first names 
(courtesy of the US census bureau) and a few other adds (month/day) names.  A 
file this large does not seem to have a material impact on indexing.


What we're seeing now (we also have a field 'TermsMisspelled' that utilizes the 
same text file with StopFilterFactory) is almost pure misspellings and some 
contractions (can't, won't, don't, etc.).

Thank you everyone for your help here, this is a truly fine community.

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, August 24, 2011 1:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Text Analysis and copyField

Have you considered having two dictionaries and using ajax to query them both 
and intermingling the results in your suggestions? It'd be some work, but I 
think it might accomplish what you want.

Best
Erick

On Tue, Aug 23, 2011 at 1:48 PM, Herman Kiefus <herm...@angieslist.com> wrote:
> To close, I found this article from Hoss: 
> http://lucene.472066.n3.nabble.com/CopyField-into-another-CopyField-td
> 3122408.html
>
> Since I cannot use one copyField directive to copy from another copyField's 
> dest[ination], I cannot achieve what I desire: some terms that are subject to 
> KeepWordFilterFactory and some that are not.
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Monday, August 22, 2011 1:16 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Text Analysis and copyField
>
> I suspect that the things going into TermsDictionary are from fields other 
> than CorrectlySpelledTerms.
>
> In other words I don't think that anything is getting into TermsDictionary 
> from CorrectlySpelledTerms...
>
> Be careful to remove the index between schema changes, just to be sure that 
> you're not seeing old data.
>
> Best
> Erick
>
> On Mon, Aug 22, 2011 at 11:41 AM, Herman Kiefus <herm...@angieslist.com> 
> wrote:
>> That's what I thought, but my experiments show differently.  In actuality:
>>
>> I have a number of fields that are of type "text" (the default as it is 
>> packaged).
>>
>> I have a type 'textCorrectlySpelled' that utilizes KeepWordFilterFactory in 
>> index-time analysis, using a file of terms which are known to be correctly 
>> spelled.
>>
>> I have a type 'textDictionary' that has no index-time analysis.
>>
>> I have the fields:
>> <field name="CorrectlySpelledTerms" type="textCorrectlySpelled"
>> indexed="false" stored="false" multiValued="true"/> <field 
>> name="TermsDictionary" type="textDictionary" indexed="true"
>> stored="false" multiValued="true"/>
>>
>> I want 'TermsDictionary' to contain only those terms from some fields that 
>> are correctly spelled plus those terms from a couple other fields 
>> (CompanyName and ContactName) as is.  I use several copyField directives as 
>> follows:
>>
>> <copyField source="Field1" dest="CorrectlySpelledTerms"/> <copyField 
>> source="Field2" dest="CorrectlySpelledTerms"/> <copyField 
>> source="Field3" dest="CorrectlySpelledTerms"/>
>>
>> <copyField source="Name" dest="TermsDictionary"/> <copyField 
>> source="Contact" dest="TermsDictionary"/> <copyField source 
>> ="CorrectlySpelledTerms" dest="TermsDictionary"/>
>>
>> If I query 'Field1' for a term that I know is misspelled (electical) it 
>> yields results.
>> If I query 'TermsDictionary' for the same term it yields no results.
>>
>> It would seem by these results that 'TermsDictionary' only contains those 
>> terms with misspellings stripped as a results of the text analysis on the 
>> field 'CorrectlySpelledTerms'.
>>
>> Asked another way, I think you can see what I'm getting at: a source for the 
>> spellchecker that only contains correct spelled terms plus proper names; 
>> should I have gone about this in a different way?
>>
>> -----Original Message-----
>> From: Stephen Duncan Jr [mailto:stephen.dun...@gmail.com]
>> Sent: Monday, August 22, 2011 9:30 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Text Analysis and copyField
>>
>> On Mon, Aug 22, 2011 at 9:25 AM, Herman Kiefus <herm...@angieslist.com> 
>> wrote:
>>> Is my thinking correct?
>>>
>>> I have a field 'F1' of type 'T1' whose index time analysis employs the 
>>> StopFilterFactory.
>>>
>>> I also have a field 'F2' of type 'T2' whose index time analysis does NOT 
>>> employ the StopFilterFactory.
>>>
>>> There is a copyField directive source="F1" dest="F2"
>>>
>>> F2 will not contain any stop words because they were filtered out as F1 was 
>>> populated.
>>>
>>
>> No, F2 will contain stop words.  Copy fields does not process input through 
>> a chain, it sends the original content to each field and therefore analysis 
>> is totally independent.
>>
>> --
>> Stephen Duncan Jr
>> www.stephenduncanjr.com
>>
>

RE: Text Analysis and copyField

Reply via email to