Re: Solr PhraseQuery With Wildcard

Erick Erickson Tue, 28 Jun 2016 07:59:48 -0700

There certainly is a lot to learn!

Right, the only problem I have with your analysis chain is that
the WhitespaceTokenizer doesn't strip punctuation so you'll
have terms like "texto." (note the period).


Something like PatternReplaceFilterFactory would help here.

Best,
Erick

On Tue, Jun 28, 2016 at 6:30 AM, Felipe Vinturini
<felipe.vintur...@gmail.com> wrote:
> Hi Erick,
>
> Thanks for your comments! In fact, I started with Solr one month ago, so I
> am still learning! =)
>
> I understand the differences between the Solr tokenizers, but there are so
> many options that take some time to find the one that fits our need.
>
> I found a solution to my problem with the configuration below:
> <analyzer type="index"> <charFilter class="solr.HTMLStripCharFilterFactory"
> /> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter
> class="solr.LowerCaseFilterFactory"/> <filter
> class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" /> <filter
> class="solr.ReversedWildcardFilterFactory" /> </analyzer>
> And search using the Complex Phrase Query Parser, like below, now returns
> the desired document:
> {!complexphrase df=title}"teste de texto*"
>
> I think that the problem with my last field setup was the
> StopFilterFactory, as the Complex Phrase Query Parser documentation states:
> "It is recommended not to use stopword elimination with this query parser."
> [1]
>
> I've done some tests and, so far, this setup fits my needs (queries).
>
> As I commented, I am new to Solr, so I would like your/Solr community input
> to know if there is a better way/other way to achieve the same or if you
> see any problem with the setup above?!?
>
> Thanks a lot for your help!
>
> Regards,
> Felipe.
>
> [1]
> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser
>
>
>
>
> On Tue, Jun 28, 2016 at 2:22 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> OK, you really have to get familiar with the
>> admin/analysis page. Whitespace tokenizer
>> is really simple, it breaks up on whitespace. So
>> punctuation is kept in the index. Which is very
>> rarely what you want. Use something like
>> StandardTokenizer or maybe a filter that
>> removes all non-alpha-num characters (
>> see one of the regex filters).
>>
>> ComplexPhrase should do what you want, but if
>> (and only if) you've indexed stuff appropriately. So
>> I'd concentrate on getting the indexing to do
>> what you need, then worry about querying.
>>
>> KeywordTokenizer is pretty much inappropriate for
>> any kind of free-text search, it doesn't break the input
>> up at _all_.
>>
>> And you need to completely re-index all your docs when
>> you change the schema. There are a _few_ cases
>> where that's not necessary, but until you're very
>> familiar with the nuances it's much safer just
>> to re-index from scratch. It _will_ work to
>> > shut down Solr
>> > rm -r the_data_directory
>> > restart solr
>>
>> That'll wipe everything out. If you're in Solr Cloud
>> I'd recommend deleting and recreating the collection
>> on schema change.
>>
>> Best,
>> Erick
>>
>> On Mon, Jun 27, 2016 at 2:21 PM, Felipe Vinturini
>> <felipe.vintur...@gmail.com> wrote:
>> > Hi *all*!
>> >
>> > First time posting! I have been struggling with Solr v4.10.2 with a
>> > PhraseQuery with wildcard!
>> >
>> > My field definition is below:
>> > <!-- Search field -->
>> > <field name="title" type="text_pt_en" indexed="true" stored="true" />
>> > <!-- Field definition -->
>> > <fieldType name="text_pt_en" class="solr.TextField"
>> > positionIncrementGap="100">
>> > <analyzer type="index">
>> > <charFilter class="solr.HTMLStripCharFilterFactory" />
>> >
>> > <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > words="lang/stopwords_pt.txt" format="snowball"
>> > enablePositionIncrements="true" />
>> > <tokenizer class="solr.WhitespaceTokenizerFactory" />
>> > <!-- <tokenizer class="solr.KeywordTokenizerFactory" /> -->
>> > <filter class="solr.LowerCaseFilterFactory"/>
>> > <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"
>> />
>> > <filter class="solr.ReversedWildcardFilterFactory" />
>> > </analyzer>
>> >
>> > <analyzer type="query">
>> > <charFilter class="solr.HTMLStripCharFilterFactory"/>
>> > <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > words="lang/stopwords_pt.txt" format="snowball"
>> > enablePositionIncrements="true" />
>> > <tokenizer class="solr.WhitespaceTokenizerFactory" />
>> > <!-- <tokenizer class="solr.KeywordTokenizerFactory" /> -->
>> > <filter class="solr.LowerCaseFilterFactory" />
>> > <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"
>> />
>> > </analyzer>
>> > </fieldType>
>> >
>> > Let's suppose I have the following value added to the index of the field
>> > above (portuguese):
>> > Teste de texto; Será quebrado em espaços em branco!
>> >
>> > And the values added to the index, based on the analyzer chain will be
>> > (from Solr "Analysis"):
>> > etset teste ;otxet texto; odarbeuq quebrado socapse espacos !ocnarb
>> branco!
>> > Today, I can search, for example:
>> > title:teste
>> > title:(teste texto)
>> > title:(teste de texto)
>> > title:("teste de texto;") // (PhraseQuery) matches because of ";" in the
>> > end of the string
>> > But, if I try to search (PhraseQuery):
>> > title:("teste de texto")
>> > "parsedquery": "PhraseQuery(title:\"teste ? texto\")"
>> > title:("teste de texto*")
>> > "parsedquery": "PhraseQuery(title:\"teste ? texto*\")"
>> > No results are returned.
>> >
>> > I have read about possible solutions to this, but none of them seems to
>> > work:
>> > MultitermQueryAnalysis
>> > Complex Phrase Query Parser
>> >
>> > And I just can't understand why the query with the wildcard in the end:
>> "*"
>> > does not work, no results are returned.
>> > Some comments:
>> > - I don't have control over what is entered in the search, I would like
>> it
>> > to work like a "file listing", like a "glob";
>> > - Today I can't change my tokenizer to: "StandardTokenizerFactory" (that
>> in
>> > this case would work), because I need to search for e-mails, words with
>> > colon, for example;
>> > - I tried the: "KeywordTokenizer", but I have the same behavior as above;
>> > - I read about: "ShingleFilterFactory", but my index would be huge,
>> because
>> > I need to index full texts (with more than 30000 chars);
>> > - One person in stackoverflow pointed me to the documentation where it
>> says
>> > it is not possible to use a wildcard in a phrase query using the standard
>> > query parser.
>> > I tried to use the *complexphrase: **{!complexphrase}title:"teste de
>> > texto*"*, but no results still. Am I doing something wrong? Is there
>> > anything wrong with my schema analysis?
>> > - I could make it work using: "KeywordTokenizerFactory", but it only
>> works
>> > with "RegexpQuery": *title:(/.*teste de texto.*/)*. Do I have other
>> options?
>> >
>> > Could you please help me understand what happens, if there is a way to
>> make
>> > a PhraseQuery with a wildcard work and what are my options?
>> >
>> > Please, let me know if you need further information and thanks a lot for
>> > your attention and help!
>> > *Felipe*.
>> >
>> > PS: I have added the same question to stackoverflow:
>> >
>> http://stackoverflow.com/questions/38061980/solr-phrasequery-with-wildcard
>>

Re: Solr PhraseQuery With Wildcard

Reply via email to