Thanks, I tried your suggestion today 1. Define a text_num fieldType <fieldType name="text_num" class="solr.TextField" > <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*[0-9][0-9-\s]*[0-9]?\s*" group="0"/> <filter class="solr.TrimFilterFactory"/> </analyzer> </fieldType>
2. Define a new text field to capture numerical data and link it to the text field via a copyField <field name="text_il" type="text_num" indexed="true" stored="false" multiValued="true" /> <copyField source="text" dest="text_il" maxChars="30000" /> 3. Restart the server and reindex my test data As you can see from a simple analysis test on text copied from my test document (see screenshot), the field and the regex work as expected http://i.imgur.com/o4y2Q9u.png However, when I try and use the same query (for the text_il field, not even trying to combine queries across fields) using the edismax parser, I don't get any hits. Also. when I searched the forums and JIRA, I came across these two https://issues.apache.org/jira/browse/SOLR-6009 http://lucene.472066.n3.nabble.com/Regex-with-local-params-is-not-working-tt4138257.html So my questions are: 1. Do the dismax / edismax parser even support regex syntax ? 2. Am I doing something wrong ? Results Regex usiing the default parser works { "responseHeader": { "status": 0, "QTime": 4, "params": { "indent": "true", "q": "text_il:/.*[7-8].*/", "_": "1403729219835", "wt": "json" } }, " response": { "numFound": 1, "start": 0, "docs": [ { "id": "1", "content_type": "parentDocument", "_version_": 1471911225402065000 } ] } } Whereas using the edismax parser, it doesn't return any hits. I used this link as a guide to forming my queries http://lucidworks.lucidimagination.com/display/solr/The+Extended+DisMax+Query+Parser { "responseHeader": { "status": 0, "QTime": 3, "params": { " lowercaseOperators": "true", "pf": "text_il", "indent": "true", "q": "/.*[7-8].*/", "qf": "text_il", "_": "1403729594057", "stopwords": "true", " wt": "json", "defType": "edismax" } }, "response": { "numFound": 0, "start": 0, "docs": [] } } Debug-enabled query at https://gist.github.com/anonymous/625e7669918deba4a071 Thanks On Tue, Jun 24, 2014 at 7:35 PM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > What about copyField'ing the content into the second field where you > apply the alternative processing. Than eDismax searching both. Don't > have to store the other field, just index. > > Regards, > Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > On Wed, Jun 25, 2014 at 5:55 AM, Vinay B, <vybe3...@gmail.com> wrote: > > Sorry, previous post got sent prematurely. > > > > Here is the complete post: > > > > This is easy if I only reqdefine a custom field to identify the desired > > patterns (numbers, in my case) > > > > For example, I could define a field thus: > > <!-- A text field that identifies numberical entities--> > > <fieldType name="text_num" class="solr.TextField" > > > <analyzer> > > <tokenizer class="solr.PatternTokenizerFactory" > > pattern="\s*[0-9][0-9-]*[0-9]?\s*" group="0"/> > > </analyzer> > > </fieldType> > > > > Input: > > hello, world bye 123-45 abcd 5555 sdfssdf --- aaa > > > > Output: > > 123-45 , 5555 > > > > However, I also want to retain the behavior of the default text_general > > field , that is recognize the usual text tokens (hello, world, bye etc > > ...). What is the best way to achieve this. > > I've looked at PatternCaptureGroupFilterFactory ( > > > http://lucene.apache.org/core/4_7_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupFilterFactory.html > > ) but I suspect that it too is subject to the behavior of the prior > > tokenizer (which for text_general is StandardTokenizerFactory ). > > > > Thanks > > > >> > >> >