Re: Terms and termscomponent questions

Erick Erickson Thu, 03 Feb 2011 10:00:22 -0800

Ah, good. Good luck with the rest of your app! WordDelimiterFilterFactory
is powerful, but tricky <G>...


Best
Erick

On Thu, Feb 3, 2011 at 9:51 AM, openvictor Open <openvic...@gmail.com>wrote:

> Dear Erick,
>
> You were totally right about the fact that I didn't use any space to
> separate words, cause SolR to concatenate words !
> Everything is solved now. Thank you very much for your help !
>
> Best regards,
> Victor Kabdebon
>
> 2011/2/3 Erick Erickson <erickerick...@gmail.com>
>
> > There are a couple of things going on here. First,
> > WordDelimiterFilterFactory is
> > splitting things up on letter/number boundaries. Take a look at:
> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> >
> > for a list of *some* of the available tokenizers. You may want to just
> use
> > one of the others, or change the parameters to
> > WordDelimiterFilterFilterFactory
> > to not split as it is.
> >
> > See the page: http://localhost:8983/solr/admin/analysis.jsp and check
> the
> > "verbose"
> > box to see what the effects of the various elements in your analysis
> chain
> > are.
> > This is a very important page for understanding the analysis part of the
> > whole
> > operation.
> >
> > Second, if you've been trying different things out, you may well have
> some
> > old stuff in your index. When you delete documents, the terms are still
> in
> > the index until an optimize. I'd advise starting with a clean slate for
> > your
> > experiments each time. The cheap way to do this is stop your server and
> > delete <solr_home>/data/index. Delete the index directory too, not just
> the
> > contents. So it's possible your TermsComponent is returning data from
> > previous
> > attempts, because I sure don't see how the concatenated terms would be
> > in this index given the definition you've posted.
> >
> > And if none of that works, well, we'll try something else <G>..
> >
> > Best
> > Erick
> >
> > On Tue, Feb 1, 2011 at 10:07 AM, openvictor Open <openvic...@gmail.com
> > >wrote:
> >
> > > Dear Erick,
> > >
> > > Thank you for your answer, here is my fieldtype definition. I took the
> > > standard one because I don't need a better one for this field
> > >
> > > <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> > > <analyzer type="index">
> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt" enablePositionIncrements="true"/>
> > > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > > catenateAll="0" splitOnCaseChange="1"/>
> > > <filter class="solr.LowerCaseFilterFactory"/>
> > > <filter class="solr.SnowballPorterFilterFactory" language="English"
> > > protected="protwords.txt"/>
> > > </analyzer>
> > > <analyzer type="query">
> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > > ignoreCase="true" expand="true"/>
> > > <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt" enablePositionIncrements="true"/>
> > > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> > > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> > > catenateAll="0" splitOnCaseChange="1"/>
> > > <filter class="solr.LowerCaseFilterFactory"/>
> > > <filter class="solr.SnowballPorterFilterFactory" language="English"
> > > protected="protwords.txt"/>
> > > </analyzer>
> > > </fieldType>
> > >
> > > Now my field :
> > >
> > > <field name="p_field" type="text" indexed="true" stored="true"/>
> > >
> > > But I have a doubt now... Do I really put a space between words or is
> it
> > > just a coma... If I only put a coma then the whole process is going to
> be
> > > impacted ? What I don't really understand is that I find the separate
> > > words,
> > > but also their concatenation (but again in one direction only). Let me
> > > explain : if a have "man" "bear" "pig" I will find :
> > > "manbearpig" "bearpig" but never pigman or anyother combination in a
> > > different order.
> > >
> > > Thank you very much
> > > Best Regards,
> > > Victor
> > >
> > > 2011/2/1 Erick Erickson <erickerick...@gmail.com>
> > >
> > > > Nope, this isn't what I'd expect. There are a couple of
> possibilities:
> > > > 1> check out what WordDelimiterFilterFactory is doing, although
> > > >     if you're really sending spaces that's probably not it.
> > > > 2> Let's see the <field> and <fieldType> definitions for the field
> > > >     in question. type="text" doesn't say anything about analysis,
> > > >     and that's where I'd expect you're having trouble. In particular
> > > >     if your analysis chain uses KeywordTokenizerFactory for instance.
> > > > 3> Look at the admin/schema browse page, look at your field and
> > > >     see what the actual tokens are. That'll tell you what
> > TermsComponents
> > > >     is returning, perhaps the concatenation is happening somewhere
> > > >     else.
> > > >
> > > > Bottom line: Solr will not concatenate terms like this unless you
> tell
> > it
> > > > to,
> > > > so I suspect you're telling it to, you just don't realize it <G>...
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On Tue, Feb 1, 2011 at 1:33 AM, openvictor Open <
> openvic...@gmail.com
> > > > >wrote:
> > > >
> > > > > Dear Solr users,
> > > > >
> > > > > I am currently using SolR and TermsComponents to make an auto
> suggest
> > > for
> > > > > my
> > > > > website.
> > > > >
> > > > > I have a field called p_field indexed and stored with type="text"
> in
> > > the
> > > > > schema xml. Nothing out of the usual.
> > > > > I feed to Solr a set of words separated by a coma and a space such
> as
> > > > (for
> > > > > two documents) :
> > > > >
> > > > > Document 1:
> > > > > word11, word12, word13. word14
> > > > >
> > > > > Document 2:
> > > > > word21, word22, word23. word24
> > > > >
> > > > >
> > > > > When I use my newly designed field I get things for the prefix
> > "word1"
> > > :
> > > > > word11, word12, word13. word14 word11word12 word11word13 etc...
> > > > > Is it normal to have the concatenation of words and not only the
> > words
> > > > > indexed ? Did I miss something about Terms ?
> > > > >
> > > > > Thank you very much,
> > > > > Best regards all,
> > > > > Victor
> > > > >
> > > >
> > >
> >
>

Re: Terms and termscomponent questions

Reply via email to