Hello Elisabeth, Wouldn't it be more simple to have a custom component inside of the front-end to your search server that would transform a query like <<hotel de ville paris>> into <<"hotel de ville" paris>> (I.e. turning each occurence of the sequence "hotel de ville" into a phrase query ) ?
Concerning protections inside of the tokenizer, I think that is not possible actually. The main reason for this could be that the QueryParser will break the query on each space before passing each query-part through the analysis of every searched field. Hence all the smart things you would put at indexing time to wrap a sequence of tokens into a single one is not reproducible at query time. Please someone correct me if I'm wrong! Alternatively, I think you might do so with a custom query parser (in order to have phrases sent to the analyzers instead of words). But since tokenizers don't have support for protected words list, you would need an additional custom token filter that would consume the tokens stream and annotate those matching an entry in the protection list. Unfortunately, if your protected list is long, you will have performance issues. Unless you rely on a dedicated data structure, like Trie-based structures (Patricia-trie, ...) You can find solid implementations on the Internet (see https://github.com/rkapsi/patricia-trie). Then you could make your filter consume a "sliding window" of tokens while the window matches in your trie. Once you have a complete match in your trie, the filter can set an attribute of the type your choice (e.g. MyCustomKeywordAttribute) on the first matching token, and make the attribute be the complete match (e.g. "Hotel de ville"). If you don't have a complete match, drop the unmatched tokens leaving them unmodified. I Hope this helps... -- Tanguy 2012/5/22 elisabeth benoit <elisaelisael...@gmail.com> > Hello, > > Does someone know if there is a way to configure a tokenizer to split on > white spaces, all words excluding a bunch of expressions listed in a file? > > For instance, if a want "hotel de ville" not to be split in words, a > request like "hotel de ville paris" would be split into two tokens: > > "hotel de ville" and "paris" instead of 4 tokens > > "hotel" > "de" > "ville" > "paris" > > I imagine something like > > <tokenizer class="solr.StandardTokenizerFactory" > protected="protoexpressions.txt"/> > > Thanks a lot, > Elisabeth >