Re: How to extend the behavior of a common text field (such as text_general) to recognize regex

Vinay B, Wed, 25 Jun 2014 13:56:38 -0700

Thanks, I tried your suggestion today

1. Define a text_num fieldType
    <fieldType name="text_num" class="solr.TextField" >
      <analyzer>
<tokenizer class="solr.PatternTokenizerFactory"
pattern="\s*[0-9][0-9-\s]*[0-9]?\s*" group="0"/>
        <filter class="solr.TrimFilterFactory"/>
      </analyzer>
    </fieldType>

2. Define a new text field to capture numerical data and link it to the
text field via a copyField
<field name="text_il" type="text_num" indexed="true" stored="false"
multiValued="true" />
<copyField source="text" dest="text_il" maxChars="30000" />

3. Restart the server and reindex my test data

As you can see from a simple analysis test on text copied from my test
document (see screenshot), the field and the regex work as expected
http://i.imgur.com/o4y2Q9u.png  However, when I try and use the same query
(for the text_il field, not even trying to combine queries across fields)
using the edismax parser, I don't get any hits. Also. when I searched the
forums and JIRA, I came across these two

https://issues.apache.org/jira/browse/SOLR-6009
http://lucene.472066.n3.nabble.com/Regex-with-local-params-is-not-working-tt4138257.html

So my questions are:
1. Do the dismax / edismax parser even support regex syntax ?
2. Am I doing something wrong ?

Results

Regex usiing the default parser works
{ "responseHeader": { "status": 0, "QTime": 4, "params": { "indent": "true",
"q": "text_il:/.*[7-8].*/", "_": "1403729219835", "wt": "json" } }, "
response": { "numFound": 1, "start": 0, "docs": [ { "id": "1", "content_type":
"parentDocument", "_version_": 1471911225402065000 } ] } }

Whereas using the edismax parser, it doesn't return any hits. I used this
link as a guide to forming my queries
http://lucidworks.lucidimagination.com/display/solr/The+Extended+DisMax+Query+Parser

{ "responseHeader": { "status": 0, "QTime": 3, "params": { "
lowercaseOperators": "true", "pf": "text_il", "indent": "true", "q":
"/.*[7-8].*/", "qf": "text_il", "_": "1403729594057", "stopwords": "true", "
wt": "json", "defType": "edismax" } }, "response": { "numFound": 0, "start":
0, "docs": [] } }

Debug-enabled query at
https://gist.github.com/anonymous/625e7669918deba4a071

Thanks

On Tue, Jun 24, 2014 at 7:35 PM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> What about copyField'ing the content into the second field where you
> apply the alternative processing. Than eDismax searching both. Don't
> have to store the other field, just index.
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Wed, Jun 25, 2014 at 5:55 AM, Vinay B, <vybe3...@gmail.com> wrote:
> > Sorry, previous post got sent prematurely.
> >
> > Here is the complete post:
> >
> > This is easy if I only reqdefine a custom field to identify the desired
> > patterns (numbers, in my case)
> >
> > For example, I could define a field thus:
> >     <!-- A text field that identifies numberical entities-->
> >     <fieldType name="text_num" class="solr.TextField" >
> >       <analyzer>
> > <tokenizer class="solr.PatternTokenizerFactory"
> > pattern="\s*[0-9][0-9-]*[0-9]?\s*" group="0"/>
> >       </analyzer>
> >     </fieldType>
> >
> > Input:
> > hello, world bye 123-45 abcd 5555 sdfssdf --- aaa
> >
> > Output:
> > 123-45 , 5555
> >
> > However, I also want to retain the behavior of the default text_general
> > field , that is recognize the usual text tokens (hello, world, bye etc
> > ...). What is the best way to achieve this.
> > I've looked at PatternCaptureGroupFilterFactory (
> >
> http://lucene.apache.org/core/4_7_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupFilterFactory.html
> > ) but I suspect that it too is subject to the behavior of the prior
> > tokenizer (which for text_general is StandardTokenizerFactory ).
> >
> > Thanks
> >
> >>
> >>
>

Re: How to extend the behavior of a common text field (such as text_general) to recognize regex

Reply via email to