Solr PhraseQuery With Wildcard

Felipe Vinturini Mon, 27 Jun 2016 14:22:38 -0700

Hi *all*!

First time posting! I have been struggling with Solr v4.10.2 with a
PhraseQuery with wildcard!


My field definition is below:
<!-- Search field -->
<field name="title" type="text_pt_en" indexed="true" stored="true" />
<!-- Field definition -->
<fieldType name="text_pt_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory" />

<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_pt.txt" format="snowball"
enablePositionIncrements="true" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<!-- <tokenizer class="solr.KeywordTokenizerFactory" /> -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
<filter class="solr.ReversedWildcardFilterFactory" />
</analyzer>

<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_pt.txt" format="snowball"
enablePositionIncrements="true" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<!-- <tokenizer class="solr.KeywordTokenizerFactory" /> -->
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
</analyzer>
</fieldType>

Let's suppose I have the following value added to the index of the field
above (portuguese):
Teste de texto; Será quebrado em espaços em branco!

And the values added to the index, based on the analyzer chain will be
(from Solr "Analysis"):
etset teste ;otxet texto; odarbeuq quebrado socapse espacos !ocnarb branco!
Today, I can search, for example:
title:teste
title:(teste texto)
title:(teste de texto)
title:("teste de texto;") // (PhraseQuery) matches because of ";" in the
end of the string
But, if I try to search (PhraseQuery):
title:("teste de texto")
"parsedquery": "PhraseQuery(title:\"teste ? texto\")"
title:("teste de texto*")
"parsedquery": "PhraseQuery(title:\"teste ? texto*\")"
No results are returned.

I have read about possible solutions to this, but none of them seems to
work:
MultitermQueryAnalysis
Complex Phrase Query Parser

And I just can't understand why the query with the wildcard in the end: "*"
does not work, no results are returned.
Some comments:
- I don't have control over what is entered in the search, I would like it
to work like a "file listing", like a "glob";
- Today I can't change my tokenizer to: "StandardTokenizerFactory" (that in
this case would work), because I need to search for e-mails, words with
colon, for example;
- I tried the: "KeywordTokenizer", but I have the same behavior as above;
- I read about: "ShingleFilterFactory", but my index would be huge, because
I need to index full texts (with more than 30000 chars);
- One person in stackoverflow pointed me to the documentation where it says
it is not possible to use a wildcard in a phrase query using the standard
query parser.
I tried to use the *complexphrase: **{!complexphrase}title:"teste de
texto*"*, but no results still. Am I doing something wrong? Is there
anything wrong with my schema analysis?
- I could make it work using: "KeywordTokenizerFactory", but it only works
with "RegexpQuery": *title:(/.*teste de texto.*/)*. Do I have other options?

Could you please help me understand what happens, if there is a way to make
a PhraseQuery with a wildcard work and what are my options?

Please, let me know if you need further information and thanks a lot for
your attention and help!
*Felipe*.

PS: I have added the same question to stackoverflow:
http://stackoverflow.com/questions/38061980/solr-phrasequery-with-wildcard

Solr PhraseQuery With Wildcard

Reply via email to