Re: Solr PhraseQuery With Wildcard

Erick Erickson Mon, 27 Jun 2016 22:23:30 -0700

OK, you really have to get familiar with the
admin/analysis page. Whitespace tokenizer
is really simple, it breaks up on whitespace. So
punctuation is kept in the index. Which is very
rarely what you want. Use something like
StandardTokenizer or maybe a filter that
removes all non-alpha-num characters (
see one of the regex filters).


ComplexPhrase should do what you want, but if
(and only if) you've indexed stuff appropriately. So
I'd concentrate on getting the indexing to do
what you need, then worry about querying.

KeywordTokenizer is pretty much inappropriate for
any kind of free-text search, it doesn't break the input
up at _all_.

And you need to completely re-index all your docs when
you change the schema. There are a _few_ cases
where that's not necessary, but until you're very
familiar with the nuances it's much safer just
to re-index from scratch. It _will_ work to
> shut down Solr
> rm -r the_data_directory
> restart solr

That'll wipe everything out. If you're in Solr Cloud
I'd recommend deleting and recreating the collection
on schema change.

Best,
Erick

On Mon, Jun 27, 2016 at 2:21 PM, Felipe Vinturini
<felipe.vintur...@gmail.com> wrote:
> Hi *all*!
>
> First time posting! I have been struggling with Solr v4.10.2 with a
> PhraseQuery with wildcard!
>
> My field definition is below:
> <!-- Search field -->
> <field name="title" type="text_pt_en" indexed="true" stored="true" />
> <!-- Field definition -->
> <fieldType name="text_pt_en" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <charFilter class="solr.HTMLStripCharFilterFactory" />
>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_pt.txt" format="snowball"
> enablePositionIncrements="true" />
> <tokenizer class="solr.WhitespaceTokenizerFactory" />
> <!-- <tokenizer class="solr.KeywordTokenizerFactory" /> -->
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
> <filter class="solr.ReversedWildcardFilterFactory" />
> </analyzer>
>
> <analyzer type="query">
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_pt.txt" format="snowball"
> enablePositionIncrements="true" />
> <tokenizer class="solr.WhitespaceTokenizerFactory" />
> <!-- <tokenizer class="solr.KeywordTokenizerFactory" /> -->
> <filter class="solr.LowerCaseFilterFactory" />
> <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
> </analyzer>
> </fieldType>
>
> Let's suppose I have the following value added to the index of the field
> above (portuguese):
> Teste de texto; Será quebrado em espaços em branco!
>
> And the values added to the index, based on the analyzer chain will be
> (from Solr "Analysis"):
> etset teste ;otxet texto; odarbeuq quebrado socapse espacos !ocnarb branco!
> Today, I can search, for example:
> title:teste
> title:(teste texto)
> title:(teste de texto)
> title:("teste de texto;") // (PhraseQuery) matches because of ";" in the
> end of the string
> But, if I try to search (PhraseQuery):
> title:("teste de texto")
> "parsedquery": "PhraseQuery(title:\"teste ? texto\")"
> title:("teste de texto*")
> "parsedquery": "PhraseQuery(title:\"teste ? texto*\")"
> No results are returned.
>
> I have read about possible solutions to this, but none of them seems to
> work:
> MultitermQueryAnalysis
> Complex Phrase Query Parser
>
> And I just can't understand why the query with the wildcard in the end: "*"
> does not work, no results are returned.
> Some comments:
> - I don't have control over what is entered in the search, I would like it
> to work like a "file listing", like a "glob";
> - Today I can't change my tokenizer to: "StandardTokenizerFactory" (that in
> this case would work), because I need to search for e-mails, words with
> colon, for example;
> - I tried the: "KeywordTokenizer", but I have the same behavior as above;
> - I read about: "ShingleFilterFactory", but my index would be huge, because
> I need to index full texts (with more than 30000 chars);
> - One person in stackoverflow pointed me to the documentation where it says
> it is not possible to use a wildcard in a phrase query using the standard
> query parser.
> I tried to use the *complexphrase: **{!complexphrase}title:"teste de
> texto*"*, but no results still. Am I doing something wrong? Is there
> anything wrong with my schema analysis?
> - I could make it work using: "KeywordTokenizerFactory", but it only works
> with "RegexpQuery": *title:(/.*teste de texto.*/)*. Do I have other options?
>
> Could you please help me understand what happens, if there is a way to make
> a PhraseQuery with a wildcard work and what are my options?
>
> Please, let me know if you need further information and thanks a lot for
> your attention and help!
> *Felipe*.
>
> PS: I have added the same question to stackoverflow:
> http://stackoverflow.com/questions/38061980/solr-phrasequery-with-wildcard

Re: Solr PhraseQuery With Wildcard

Reply via email to