Hello Tanguy, I guess you're right, maybe this shouldn't be done in Solr but inside of the front-end.
Thanks a lot for your answer. Elisabeth 2012/5/22 Tanguy Moal <tanguy.m...@gmail.com> > Hello Elisabeth, > > Wouldn't it be more simple to have a custom component inside of the > front-end to your search server that would transform a query like <<hotel > de ville paris>> into <<"hotel de ville" paris>> (I.e. turning each > occurence of the sequence "hotel de ville" into a phrase query ) ? > > Concerning protections inside of the tokenizer, I think that is not > possible actually. > The main reason for this could be that the QueryParser will break the query > on each space before passing each query-part through the analysis of every > searched field. Hence all the smart things you would put at indexing time > to wrap a sequence of tokens into a single one is not reproducible at query > time. > > Please someone correct me if I'm wrong! > > Alternatively, I think you might do so with a custom query parser (in order > to have phrases sent to the analyzers instead of words). But since > tokenizers don't have support for protected words list, you would need an > additional custom token filter that would consume the tokens stream and > annotate those matching an entry in the protection list. > Unfortunately, if your protected list is long, you will have performance > issues. Unless you rely on a dedicated data structure, like Trie-based > structures (Patricia-trie, ...) You can find solid implementations on the > Internet (see https://github.com/rkapsi/patricia-trie). > > Then you could make your filter consume a "sliding window" of tokens while > the window matches in your trie. > Once you have a complete match in your trie, the filter can set an > attribute of the type your choice (e.g. MyCustomKeywordAttribute) on the > first matching token, and make the attribute be the complete match (e.g. > "Hotel de ville"). > If you don't have a complete match, drop the unmatched tokens leaving them > unmodified. > > I Hope this helps... > > -- > Tanguy > > > 2012/5/22 elisabeth benoit <elisaelisael...@gmail.com> > > > Hello, > > > > Does someone know if there is a way to configure a tokenizer to split on > > white spaces, all words excluding a bunch of expressions listed in a > file? > > > > For instance, if a want "hotel de ville" not to be split in words, a > > request like "hotel de ville paris" would be split into two tokens: > > > > "hotel de ville" and "paris" instead of 4 tokens > > > > "hotel" > > "de" > > "ville" > > "paris" > > > > I imagine something like > > > > <tokenizer class="solr.StandardTokenizerFactory" > > protected="protoexpressions.txt"/> > > > > Thanks a lot, > > Elisabeth > > >