Hi Erick, Thanks for your comments! In fact, I started with Solr one month ago, so I am still learning! =)
I understand the differences between the Solr tokenizers, but there are so many options that take some time to find the one that fits our need. I found a solution to my problem with the configuration below: <analyzer type="index"> <charFilter class="solr.HTMLStripCharFilterFactory" /> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" /> <filter class="solr.ReversedWildcardFilterFactory" /> </analyzer> And search using the Complex Phrase Query Parser, like below, now returns the desired document: {!complexphrase df=title}"teste de texto*" I think that the problem with my last field setup was the StopFilterFactory, as the Complex Phrase Query Parser documentation states: "It is recommended not to use stopword elimination with this query parser." [1] I've done some tests and, so far, this setup fits my needs (queries). As I commented, I am new to Solr, so I would like your/Solr community input to know if there is a better way/other way to achieve the same or if you see any problem with the setup above?!? Thanks a lot for your help! Regards, Felipe. [1] https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser On Tue, Jun 28, 2016 at 2:22 AM, Erick Erickson <erickerick...@gmail.com> wrote: > OK, you really have to get familiar with the > admin/analysis page. Whitespace tokenizer > is really simple, it breaks up on whitespace. So > punctuation is kept in the index. Which is very > rarely what you want. Use something like > StandardTokenizer or maybe a filter that > removes all non-alpha-num characters ( > see one of the regex filters). > > ComplexPhrase should do what you want, but if > (and only if) you've indexed stuff appropriately. So > I'd concentrate on getting the indexing to do > what you need, then worry about querying. > > KeywordTokenizer is pretty much inappropriate for > any kind of free-text search, it doesn't break the input > up at _all_. > > And you need to completely re-index all your docs when > you change the schema. There are a _few_ cases > where that's not necessary, but until you're very > familiar with the nuances it's much safer just > to re-index from scratch. It _will_ work to > > shut down Solr > > rm -r the_data_directory > > restart solr > > That'll wipe everything out. If you're in Solr Cloud > I'd recommend deleting and recreating the collection > on schema change. > > Best, > Erick > > On Mon, Jun 27, 2016 at 2:21 PM, Felipe Vinturini > <felipe.vintur...@gmail.com> wrote: > > Hi *all*! > > > > First time posting! I have been struggling with Solr v4.10.2 with a > > PhraseQuery with wildcard! > > > > My field definition is below: > > <!-- Search field --> > > <field name="title" type="text_pt_en" indexed="true" stored="true" /> > > <!-- Field definition --> > > <fieldType name="text_pt_en" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <charFilter class="solr.HTMLStripCharFilterFactory" /> > > > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="lang/stopwords_pt.txt" format="snowball" > > enablePositionIncrements="true" /> > > <tokenizer class="solr.WhitespaceTokenizerFactory" /> > > <!-- <tokenizer class="solr.KeywordTokenizerFactory" /> --> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" > /> > > <filter class="solr.ReversedWildcardFilterFactory" /> > > </analyzer> > > > > <analyzer type="query"> > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="lang/stopwords_pt.txt" format="snowball" > > enablePositionIncrements="true" /> > > <tokenizer class="solr.WhitespaceTokenizerFactory" /> > > <!-- <tokenizer class="solr.KeywordTokenizerFactory" /> --> > > <filter class="solr.LowerCaseFilterFactory" /> > > <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" > /> > > </analyzer> > > </fieldType> > > > > Let's suppose I have the following value added to the index of the field > > above (portuguese): > > Teste de texto; Será quebrado em espaços em branco! > > > > And the values added to the index, based on the analyzer chain will be > > (from Solr "Analysis"): > > etset teste ;otxet texto; odarbeuq quebrado socapse espacos !ocnarb > branco! > > Today, I can search, for example: > > title:teste > > title:(teste texto) > > title:(teste de texto) > > title:("teste de texto;") // (PhraseQuery) matches because of ";" in the > > end of the string > > But, if I try to search (PhraseQuery): > > title:("teste de texto") > > "parsedquery": "PhraseQuery(title:\"teste ? texto\")" > > title:("teste de texto*") > > "parsedquery": "PhraseQuery(title:\"teste ? texto*\")" > > No results are returned. > > > > I have read about possible solutions to this, but none of them seems to > > work: > > MultitermQueryAnalysis > > Complex Phrase Query Parser > > > > And I just can't understand why the query with the wildcard in the end: > "*" > > does not work, no results are returned. > > Some comments: > > - I don't have control over what is entered in the search, I would like > it > > to work like a "file listing", like a "glob"; > > - Today I can't change my tokenizer to: "StandardTokenizerFactory" (that > in > > this case would work), because I need to search for e-mails, words with > > colon, for example; > > - I tried the: "KeywordTokenizer", but I have the same behavior as above; > > - I read about: "ShingleFilterFactory", but my index would be huge, > because > > I need to index full texts (with more than 30000 chars); > > - One person in stackoverflow pointed me to the documentation where it > says > > it is not possible to use a wildcard in a phrase query using the standard > > query parser. > > I tried to use the *complexphrase: **{!complexphrase}title:"teste de > > texto*"*, but no results still. Am I doing something wrong? Is there > > anything wrong with my schema analysis? > > - I could make it work using: "KeywordTokenizerFactory", but it only > works > > with "RegexpQuery": *title:(/.*teste de texto.*/)*. Do I have other > options? > > > > Could you please help me understand what happens, if there is a way to > make > > a PhraseQuery with a wildcard work and what are my options? > > > > Please, let me know if you need further information and thanks a lot for > > your attention and help! > > *Felipe*. > > > > PS: I have added the same question to stackoverflow: > > > http://stackoverflow.com/questions/38061980/solr-phrasequery-with-wildcard >