Re: Terms and termscomponent questions

Erick Erickson Thu, 03 Feb 2011 05:14:56 -0800

There are a couple of things going on here. First,
WordDelimiterFilterFactory is
splitting things up on letter/number boundaries. Take a look at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


for a list of *some* of the available tokenizers. You may want to just use
one of the others, or change the parameters to
WordDelimiterFilterFilterFactory
to not split as it is.

See the page: http://localhost:8983/solr/admin/analysis.jsp and check the
"verbose"
box to see what the effects of the various elements in your analysis chain
are.
This is a very important page for understanding the analysis part of the
whole
operation.

Second, if you've been trying different things out, you may well have some
old stuff in your index. When you delete documents, the terms are still in
the index until an optimize. I'd advise starting with a clean slate for your
experiments each time. The cheap way to do this is stop your server and
delete <solr_home>/data/index. Delete the index directory too, not just the
contents. So it's possible your TermsComponent is returning data from
previous
attempts, because I sure don't see how the concatenated terms would be
in this index given the definition you've posted.

And if none of that works, well, we'll try something else <G>..

Best
Erick

On Tue, Feb 1, 2011 at 10:07 AM, openvictor Open <openvic...@gmail.com>wrote:

> Dear Erick,
>
> Thank you for your answer, here is my fieldtype definition. I took the
> standard one because I don't need a better one for this field
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
> </analyzer>
> </fieldType>
>
> Now my field :
>
> <field name="p_field" type="text" indexed="true" stored="true"/>
>
> But I have a doubt now... Do I really put a space between words or is it
> just a coma... If I only put a coma then the whole process is going to be
> impacted ? What I don't really understand is that I find the separate
> words,
> but also their concatenation (but again in one direction only). Let me
> explain : if a have "man" "bear" "pig" I will find :
> "manbearpig" "bearpig" but never pigman or anyother combination in a
> different order.
>
> Thank you very much
> Best Regards,
> Victor
>
> 2011/2/1 Erick Erickson <erickerick...@gmail.com>
>
> > Nope, this isn't what I'd expect. There are a couple of possibilities:
> > 1> check out what WordDelimiterFilterFactory is doing, although
> >     if you're really sending spaces that's probably not it.
> > 2> Let's see the <field> and <fieldType> definitions for the field
> >     in question. type="text" doesn't say anything about analysis,
> >     and that's where I'd expect you're having trouble. In particular
> >     if your analysis chain uses KeywordTokenizerFactory for instance.
> > 3> Look at the admin/schema browse page, look at your field and
> >     see what the actual tokens are. That'll tell you what TermsComponents
> >     is returning, perhaps the concatenation is happening somewhere
> >     else.
> >
> > Bottom line: Solr will not concatenate terms like this unless you tell it
> > to,
> > so I suspect you're telling it to, you just don't realize it <G>...
> >
> > Best
> > Erick
> >
> > On Tue, Feb 1, 2011 at 1:33 AM, openvictor Open <openvic...@gmail.com
> > >wrote:
> >
> > > Dear Solr users,
> > >
> > > I am currently using SolR and TermsComponents to make an auto suggest
> for
> > > my
> > > website.
> > >
> > > I have a field called p_field indexed and stored with type="text" in
> the
> > > schema xml. Nothing out of the usual.
> > > I feed to Solr a set of words separated by a coma and a space such as
> > (for
> > > two documents) :
> > >
> > > Document 1:
> > > word11, word12, word13. word14
> > >
> > > Document 2:
> > > word21, word22, word23. word24
> > >
> > >
> > > When I use my newly designed field I get things for the prefix "word1"
> :
> > > word11, word12, word13. word14 word11word12 word11word13 etc...
> > > Is it normal to have the concatenation of words and not only the words
> > > indexed ? Did I miss something about Terms ?
> > >
> > > Thank you very much,
> > > Best regards all,
> > > Victor
> > >
> >
>

Re: Terms and termscomponent questions

Reply via email to