Re: solr tokenizer not splitting unbreakable expressions

elisabeth benoit Tue, 22 May 2012 23:58:50 -0700

Hello Tanguy,

I guess you're right, maybe this shouldn't be done in Solr but inside of
the front-end.


Thanks a lot for your answer.

Elisabeth

2012/5/22 Tanguy Moal <tanguy.m...@gmail.com>

> Hello Elisabeth,
>
> Wouldn't it be more simple to have a custom component inside of the
> front-end to your search server that would transform a query like <<hotel
> de ville paris>> into <<"hotel de ville" paris>> (I.e. turning each
> occurence of the sequence "hotel de ville" into a phrase query ) ?
>
> Concerning protections inside of the tokenizer, I think that is not
> possible actually.
> The main reason for this could be that the QueryParser will break the query
> on each space before passing each query-part through the analysis of every
> searched field. Hence all the smart things you would put at indexing time
> to wrap a sequence of tokens into a single one is not reproducible at query
> time.
>
> Please someone correct me if I'm wrong!
>
> Alternatively, I think you might do so with a custom query parser (in order
> to have phrases sent to the analyzers instead of words). But since
> tokenizers don't have support for protected words list, you would need an
> additional custom token filter that would consume the tokens stream and
> annotate those matching an entry in the protection list.
> Unfortunately, if your protected list is long, you will have performance
> issues. Unless you rely on a dedicated data structure, like Trie-based
> structures (Patricia-trie, ...) You can find solid implementations on the
> Internet (see https://github.com/rkapsi/patricia-trie).
>
> Then you could make your filter consume a "sliding window" of tokens while
> the window matches in your trie.
> Once you have a complete match in your trie, the filter can set an
> attribute of the type your choice (e.g. MyCustomKeywordAttribute) on the
> first matching token, and make the attribute be the complete match (e.g.
> "Hotel de ville").
> If you don't have a complete match, drop the unmatched tokens leaving them
> unmodified.
>
> I Hope this helps...
>
> --
> Tanguy
>
>
> 2012/5/22 elisabeth benoit <elisaelisael...@gmail.com>
>
> > Hello,
> >
> > Does someone know if there is a way to configure a tokenizer to split on
> > white spaces, all words excluding a bunch of expressions listed in a
> file?
> >
> > For instance, if a want "hotel de ville" not to be split in words, a
> > request like "hotel de ville paris" would be split into two tokens:
> >
> > "hotel de ville" and "paris" instead of 4 tokens
> >
> > "hotel"
> > "de"
> > "ville"
> > "paris"
> >
> > I imagine something like
> >
> > <tokenizer class="solr.StandardTokenizerFactory"
> > protected="protoexpressions.txt"/>
> >
> > Thanks a lot,
> > Elisabeth
> >
>

Re: solr tokenizer not splitting unbreakable expressions

Reply via email to