Re: termFreq always = 1 ?

Otis Gospodnetic Thu, 02 Oct 2008 08:50:42 -0700

You have:
;arm;arms;elbow;elbows;man;men;male;males;indoors;one;person;Men's;moods;


Note these two:
men
Men's

You probably tokenize that field and you probably lowercase it, and you 
probably stem it and you probably end up with 2 "men" tokens:
men ==> men
Men's ==> men

Hence your term freq of 2.  You could:
1) lowercase outside of Solr, before indexing
2) feed text with sorted words to Solr
3) use that token filter that removes duplicates after stemming

That could work.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: KLessou <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Thursday, October 2, 2008 4:41:12 AM
> Subject: Re: termFreq always = 1 ?
> 
> Yes, each one is a document.
> 
> A real example :
> 
> k1_en:men
> 
> 
>     0.81426066
> ...
>     846
> ...
>     
> 
> ;arm;arms;elbow;elbows;man;men;male;males;indoors;one;person;Men's;moods;
>     
> ...
> 
> 
> ...
> 
> 
> 0.6232885
> 
> ...
> 
>   652
> 
>   
>       
> ;portrait;portraits;young;adult;young;adults;*man*;*men*;male;males;male;males;young;*men*;young;*man*;identity;identities;self-confidence;assertiveness;male;beauty;masculine;beauty;*men's*;beauty;indoors;inside;day;daytime;one;person;one;individual;northern;european;caucasian
>   
> 
> ...
> 
> 
> 
> .;.
> 
> 
>   
> 0.81426066 = (MATCH) weight(k1_en:men in 35050), product of:
>   0.99999994 = queryWeight(k1_en:men), product of:
>     2.3030772 = idf(docFreq=17576, numDocs=64694)
>     0.43420166 = queryNorm
>   0.8142607 = (MATCH) fieldWeight(k1_en:men in 35050), product of:
>     *1.4142135 = tf(termFreq(k1_en:men)=2)*
>     2.3030772 = idf(docFreq=17576, numDocs=64694)
>     0.25 = fieldNorm(field=k1_en, doc=35050)
> 
> ...
> 
> 0.62328845 = (MATCH) weight(k1_en:men in 13312), product of:
>   0.99999994 = queryWeight(k1_en:men), product of:
>     2.3030772 = idf(docFreq=17576, numDocs=64694)
>     0.43420166 = queryNorm
>   0.6232885 = (MATCH) fieldWeight(k1_en:men in 13312), product of:
>     *1.7320508 = tf(termFreq(k1_en:men)=3)*
>     2.3030772 = idf(docFreq=17576, numDocs=64694)
>     0.15625 = fieldNorm(field=k1_en, doc=13312)
> 
> ...
> 
> You can see here for the first document termFreq = 2 and for the
> second document termFreq = 3 ...
> 
> And I would like to have termFreq = 1 in each case for this field (k1_en).
> 
> Thanks for in advance your help,
> 
> 
> 
> 
> 
> 
> 
> On Wed, Oct 1, 2008 at 8:45 PM, Otis Gospodnetic 
> > wrote:
> 
> > In each of your examples (is each one a documen?) I see only 1 "men"
> > instance, so "men" term frequency should be 1 for that document.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: KLessou 
> > > To: [email protected]
> > > Sent: Wednesday, October 1, 2008 11:43:59 AM
> > > Subject: Re: termFreq always = 1 ?
> > >
> > > Yes this may be my problem,
> > >
> > > But is there any solution to have only one "men" keyword indexed when
> > i''ve
> > > got something like this :
> > >
> > > 1 - k1_en = men;business;Men
> > > or :
> > > 2 - k1_en = man,business,men
> > > or :
> > > 3 - k1_en = Man,men,business,Men,man
> > > ...
> > >
> > > Thx in advance,
> > >
> > > On Wed, Oct 1, 2008 at 5:12 PM, Otis Gospodnetic
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Note that RemoveDuplicatesTokenFilterFactory "filters out any tokens
> > which
> > > > are at the same logical position in the tokenstream as a previous token
> > with
> > > > the same text."
> > > >
> > > > So if you have "men in black are real men" then
> > > > RemoveDuplicatesTokenFilterFactory will not remove duplicate "men".
> > > >
> > > > This may or may not be your problem.
> > > >
> > > > Otis
> > > > --
> > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > >
> > > >
> > > >
> > > > ----- Original Message ----
> > > > > From: KLessou
> > > > > To: [email protected]
> > > > > Sent: Wednesday, October 1, 2008 9:48:28 AM
> > > > > Subject: termFreq always = 1 ?
> > > > >
> > > > > Hi,
> > > > >
> > > > > I want to index a list of keywords.
> > > > >
> > > > > When I search "k1_en:men", I find a lot of documents like that :
> > > > >
> > > > > DocA :
> > > > > (k1_en = man;men;Men;business... termFreq=2)
> > > > > DocB :
> > > > > (k1_en = man;Men;business... termFreq=1)
> > > > > DocC :
> > > > > ...
> > > > > DocD :
> > > > > ...
> > > > > DocE :
> > > > > ...
> > > > >
> > > > > But I don't want to have a different termFreq for DocA & DocB.
> > > > >
> > > > > I try RemoveDuplicatesTokenFilterFactory but it doesn't seem to help
> > me
> > > > :-/
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ignoreCase="true"/>
> > > > >
> > > > > protected="protwords.txt" />
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >                     generateWordParts="0"
> > > > >                     generateNumberParts="0"
> > > > >                     catenateWords="0"
> > > > >                     catenateNumbers="0"
> > > > >                     catenateAll="0"
> > > > >                     />
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > />
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ignoreCase="true"/>
> > > > >
> > > > > protected="protwords.txt" />
> > > > >
> > > > >
> > > > >
> > > > >                     generateWordParts="0"
> > > > >                     generateNumberParts="0"
> > > > >                     catenateWords="0"
> > > > >                     catenateNumbers="0"
> > > > >                     catenateAll="0"
> > > > >                     />
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ...
> > > > >
> > > > >
> > > > >
> > > > > required="false" />
> > > > >
> > > > >
> > > > > If you have any idea, thx in advance.
> > > > >
> > > > > --
> > > > > ~~~~~
> > > > > | klessou |
> > > > > ~~~~~
> > > >
> > > >
> > >
> > >
> > > --
> > > ~~~~~
> > > | klessou |
> > > ~~~~~
> >
> >
> 
> 
> -- 
> ~~~~~
> | klessou |
> ~~~~~

Re: termFreq always = 1 ?

Reply via email to