Re: Solr PhraseQuery With Wildcard

Felipe Vinturini Tue, 28 Jun 2016 06:33:03 -0700

Hi Erick,

Thanks for your comments! In fact, I started with Solr one month ago, so I
am still learning! =)


I understand the differences between the Solr tokenizers, but there are so
many options that take some time to find the one that fits our need.

I found a solution to my problem with the configuration below:
<analyzer type="index"> <charFilter class="solr.HTMLStripCharFilterFactory"
/> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter
class="solr.LowerCaseFilterFactory"/> <filter
class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" /> <filter
class="solr.ReversedWildcardFilterFactory" /> </analyzer>
And search using the Complex Phrase Query Parser, like below, now returns
the desired document:
{!complexphrase df=title}"teste de texto*"

I think that the problem with my last field setup was the
StopFilterFactory, as the Complex Phrase Query Parser documentation states:
"It is recommended not to use stopword elimination with this query parser."
[1]

I've done some tests and, so far, this setup fits my needs (queries).

As I commented, I am new to Solr, so I would like your/Solr community input
to know if there is a better way/other way to achieve the same or if you
see any problem with the setup above?!?

Thanks a lot for your help!

Regards,
Felipe.

[1]
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser




On Tue, Jun 28, 2016 at 2:22 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> OK, you really have to get familiar with the
> admin/analysis page. Whitespace tokenizer
> is really simple, it breaks up on whitespace. So
> punctuation is kept in the index. Which is very
> rarely what you want. Use something like
> StandardTokenizer or maybe a filter that
> removes all non-alpha-num characters (
> see one of the regex filters).
>
> ComplexPhrase should do what you want, but if
> (and only if) you've indexed stuff appropriately. So
> I'd concentrate on getting the indexing to do
> what you need, then worry about querying.
>
> KeywordTokenizer is pretty much inappropriate for
> any kind of free-text search, it doesn't break the input
> up at _all_.
>
> And you need to completely re-index all your docs when
> you change the schema. There are a _few_ cases
> where that's not necessary, but until you're very
> familiar with the nuances it's much safer just
> to re-index from scratch. It _will_ work to
> > shut down Solr
> > rm -r the_data_directory
> > restart solr
>
> That'll wipe everything out. If you're in Solr Cloud
> I'd recommend deleting and recreating the collection
> on schema change.
>
> Best,
> Erick
>
> On Mon, Jun 27, 2016 at 2:21 PM, Felipe Vinturini
> <felipe.vintur...@gmail.com> wrote:
> > Hi *all*!
> >
> > First time posting! I have been struggling with Solr v4.10.2 with a
> > PhraseQuery with wildcard!
> >
> > My field definition is below:
> > <!-- Search field -->
> > <field name="title" type="text_pt_en" indexed="true" stored="true" />
> > <!-- Field definition -->
> > <fieldType name="text_pt_en" class="solr.TextField"
> > positionIncrementGap="100">
> > <analyzer type="index">
> > <charFilter class="solr.HTMLStripCharFilterFactory" />
> >
> > <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="lang/stopwords_pt.txt" format="snowball"
> > enablePositionIncrements="true" />
> > <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > <!-- <tokenizer class="solr.KeywordTokenizerFactory" /> -->
> > <filter class="solr.LowerCaseFilterFactory"/>
> > <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"
> />
> > <filter class="solr.ReversedWildcardFilterFactory" />
> > </analyzer>
> >
> > <analyzer type="query">
> > <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="lang/stopwords_pt.txt" format="snowball"
> > enablePositionIncrements="true" />
> > <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > <!-- <tokenizer class="solr.KeywordTokenizerFactory" /> -->
> > <filter class="solr.LowerCaseFilterFactory" />
> > <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"
> />
> > </analyzer>
> > </fieldType>
> >
> > Let's suppose I have the following value added to the index of the field
> > above (portuguese):
> > Teste de texto; Será quebrado em espaços em branco!
> >
> > And the values added to the index, based on the analyzer chain will be
> > (from Solr "Analysis"):
> > etset teste ;otxet texto; odarbeuq quebrado socapse espacos !ocnarb
> branco!
> > Today, I can search, for example:
> > title:teste
> > title:(teste texto)
> > title:(teste de texto)
> > title:("teste de texto;") // (PhraseQuery) matches because of ";" in the
> > end of the string
> > But, if I try to search (PhraseQuery):
> > title:("teste de texto")
> > "parsedquery": "PhraseQuery(title:\"teste ? texto\")"
> > title:("teste de texto*")
> > "parsedquery": "PhraseQuery(title:\"teste ? texto*\")"
> > No results are returned.
> >
> > I have read about possible solutions to this, but none of them seems to
> > work:
> > MultitermQueryAnalysis
> > Complex Phrase Query Parser
> >
> > And I just can't understand why the query with the wildcard in the end:
> "*"
> > does not work, no results are returned.
> > Some comments:
> > - I don't have control over what is entered in the search, I would like
> it
> > to work like a "file listing", like a "glob";
> > - Today I can't change my tokenizer to: "StandardTokenizerFactory" (that
> in
> > this case would work), because I need to search for e-mails, words with
> > colon, for example;
> > - I tried the: "KeywordTokenizer", but I have the same behavior as above;
> > - I read about: "ShingleFilterFactory", but my index would be huge,
> because
> > I need to index full texts (with more than 30000 chars);
> > - One person in stackoverflow pointed me to the documentation where it
> says
> > it is not possible to use a wildcard in a phrase query using the standard
> > query parser.
> > I tried to use the *complexphrase: **{!complexphrase}title:"teste de
> > texto*"*, but no results still. Am I doing something wrong? Is there
> > anything wrong with my schema analysis?
> > - I could make it work using: "KeywordTokenizerFactory", but it only
> works
> > with "RegexpQuery": *title:(/.*teste de texto.*/)*. Do I have other
> options?
> >
> > Could you please help me understand what happens, if there is a way to
> make
> > a PhraseQuery with a wildcard work and what are my options?
> >
> > Please, let me know if you need further information and thanks a lot for
> > your attention and help!
> > *Felipe*.
> >
> > PS: I have added the same question to stackoverflow:
> >
> http://stackoverflow.com/questions/38061980/solr-phrasequery-with-wildcard
>

Re: Solr PhraseQuery With Wildcard

Reply via email to