RE: How to copy and extract information from a multi-line text before the tokenizer

Jaeger, Jay - DOT Thu, 25 Aug 2011 06:37:05 -0700

"A programmer had a problem. He tried to solve it with regular expressions.
Now he has two problems" :).


Awwww.  That just isn't fair...  8^)

(I can't think of very many things that have allowed me to perform more magic 
over my career than regular expressions, starting with SNOBOL.  Uh oh:  I just 
dated myself.  8^) ).

JRJ

-----Original Message-----
From: Erick Erickson [mailto:[email protected]] 
Sent: Thursday, August 25, 2011 7:54 AM
To: [email protected]
Subject: Re: How to copy and extract information from a multi-line text before 
the tokenizer

You could consider writing your own UpdateHandler. It allows you to
get access to the underlying SolrInputDocument, and freely modify
the fields before it even gets to the analysis chain in defined in your
schema. So you can get your "AllData" out of the doc, split it apart as
many ways as you want and put fields back in the SolrInputDocument.
You can even remove the AllData field if you want and not even define
it in your schema....

"A programmer had a problem. He tried to solve it with regular expressions.
Now he has two problems" :).

Best
Erick

On Tue, Aug 23, 2011 at 6:28 AM, Michael Kliewe <[email protected]> wrote:
> Hello all,
>
> I have a custom schema which has a few fields, and I would like to create a 
> new field in the schema that only has one special line of another field 
> indexed. Lets use this example:
>
> field AllData (TextField) has for example this data:
> Title: exampleTitle of the book
> Author: Example Author
> Date: 01.01.1980
>
> Each line is separated by a line break.
> I now need a new field named OnlyAuthor which only has the Author information 
> in it, so I can search and facet for specific Author information. I added 
> this to my schema:
>
> <fieldType name="authorField" class="solr.TextField">
>  <analyzer type="index">
>    <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="^.*\nAuthor: (.*?)\n.*$" replacement="$1" replace="all" />
>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>    <filter class="solr.LowerCaseFilterFactory"/>
>    <filter class="solr.TrimFilterFactory"/>
>  </analyzer>
>  <analyzer type="query">
>    <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="^.*\nAuthor: (.*?)\n.*$" replacement="$1" replace="all" />
>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>    <filter class="solr.LowerCaseFilterFactory"/>
>    <filter class="solr.TrimFilterFactory"/>
>  </analyzer>
> </fieldType>
>
> <field name="OnlyAuthor" type="authorField" indexed="true" stored="true" />
>
> <copyField source="AllData" dest="OnlyAuthor"/>
>
>
> But this is not working, the new AuthorOnly field contains all data, because 
> the regex didn't match. But I need "Example Author" in that field (I think) 
> to be able to search and facet only author information.
>
> I don't know where the problem is, perhaps someone of you can give me a hint, 
> or a totally different method to achieve my goal to extract a single line 
> from this multi-line-text.
>
> Kind regards and thanks for any help
> Michael
>
>
>

RE: How to copy and extract information from a multi-line text before the tokenizer

Reply via email to