Re: DisMaxQParserPlugin and Tokenization

Jan Kurella Thu, 25 Nov 2010 08:03:44 -0800

Ok, I think I found it: the Queryparser used in the background "chunks"by whitespaces (and {}). Each of these chunks are then treated as"Phrases". This is complete useless for non-whitespace tokenizing languages.

So I started a simple DisMaxQueryParser. Can someone verify, that thiscodes produces a DisMaxQuery? (Theroy taken from here:http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/)


{code}
            stream = analyzer.reusableTokenStream("all", input);

TermAttribute oTermAtt =stream.addAttribute(TermAttribute.class);

            int clauses = 0;
            BooleanQuery result = new BooleanQuery();
            while (stream.incrementToken()) {
                DisjunctionMaxQuery clause = new DisjunctionMaxQuery(0.1f);
                String oTermText = oTermAtt.term();
                for (int iF = 0; iF < fields.length; ++iF) {

Query oQuery = new SpanTermQuery(newTerm(fields[iF], oTermText));

                    clause.add(oQuery);
                    ++clauses;
                }
                result.add(new BooleanClause(clause, Occur.SHOULD));
            }

result.setMinimumNumberShouldMatch((int) Math.ceil(0.75*clauses)); // mm=75%

            return result;
{code}

Is this, (basically, what the DisMaxQueryparser would do, if it wouldtokenize the full query without parsing for any of [+"{}] ?


Jan


On 24.11.2010 09:20, ext jan.kure...@nokia.com wrote:

Sorry for the double post. Is there someone, that can point me where the 
original query given to the DisMaxHandler/QParser is splitted?

Jan

-----Original Message-----
From: Kurella Jan (Nokia-MS/Berlin)
Sent: Montag, 22. November 2010 14:49
To: solr-user@lucene.apache.org
Subject: DisMaxQParserPlugin and Tokenization

Hi,

Using the SearchHandler with the deftype=”dismax” option enables the 
DisMaxQParserPlugin. From investigating it seems, it is just tokenizing by 
whitespace.

Although by looking in the code I could not find the place, where this behavior 
is enforced? I only found, that for each field the getFieldQuery() method is 
called, which either throws an “unknownField” exception or returns the correct 
analyzer including tokenizer and filter for the given field.

We want to use a more fancier Tokenizer/filter setting with the DisMaxQuery 
stuff.

Where to hook in best?

Jan

Re: DisMaxQParserPlugin and Tokenization

Reply via email to