Re: Filter Factory question

Erick Erickson Thu, 28 Sep 2017 08:33:19 -0700

PatternCaptureGroupTokenFilter has been around since 2013 (at least
that's the earliest revision in Git). I located it even in 5x so it
should be there in
...lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern


Best,
Erick

On Thu, Sep 28, 2017 at 7:45 AM, Webster Homer <webster.ho...@sial.com> wrote:
> It's still buggy, so not ready to share.
>
> I keep a copy of Solr source which I use for this type of development. I
> don't see PatternCaptureGroupTokenFilterFactory in the Solr 6.2 code base
> at all. I was thinking of seeing how it treated the positions etc...
>
> My code now looks reasonable in the Analysis tool,  but doesn't seem to
> create searchable lucene data. I've changed it considerably since my first
> post so I see output in the tool which was an improvement
>
>
> On Wed, Sep 27, 2017 at 10:30 AM, Stefan Matheis <matheis.ste...@gmail.com>
> wrote:
>
>> > In any case I figured out my problem. I was over thinking it.
>>
>> Mind to share?
>>
>> -Stefan
>>
>> On Sep 27, 2017 4:34 PM, "Webster Homer" <webster.ho...@sial.com> wrote:
>>
>> > There is a need for a special filter since the input has to be
>> normalized.
>> > That is the main requirement, splitting into pieces is optional. As far
>> as
>> > I know there is nothing in solr that knows about molecular formulas.
>> >
>> > In any case I figured out my problem. I was over thinking it.
>> >
>> > On Wed, Sep 27, 2017 at 3:52 AM, Emir Arnautović <
>> > emir.arnauto...@sematext.com> wrote:
>> >
>> > > Hi Homer,
>> > > There is no need for special filter, there is one that is for some
>> reason
>> > > not part of documentation (will ask why so follow that thread if
>> decided
>> > to
>> > > go this way): You can use something like:
>> > > <filter class=“solr.PatternCaptureGroupTokenFilterFactory”
>> > > pattern=“([A-Z][a-z]?\d+)” preserveOriginal=“true” />
>> > >
>> > > This will capture all atom counts as a separate tokens.
>> > >
>> > > HTH,
>> > > Emir
>> > >
>> > > > On 26 Sep 2017, at 23:14, Webster Homer <webster.ho...@sial.com>
>> > wrote:
>> > > >
>> > > > I am trying to create a filter that normalizes an input token, but
>> also
>> > > > splits it inot multiple pieces. Sort of like what the
>> > WordDelimiterFilter
>> > > > does.
>> > > >
>> > > > It's meant to take a molecular formula like C2H6O and normalize it to
>> > > C2H6O1
>> > > >
>> > > > That part works. However I was also going to have it put out the
>> > > individual
>> > > > atom counts as tokens.
>> > > > C2H6O1
>> > > > C2
>> > > > H6
>> > > > O1
>> > > >
>> > > > When I enable this feature in the factory, I don't get any output at
>> > all.
>> > > >
>> > > > I looked over a couple of filters that do what I want and it's not
>> > > entirely
>> > > > clear what they're doing. So I have some questions:
>> > > > Looking at ShingleFilter and WordDelimitierFilter
>> > > > They both set several attributes:
>> > > > CharTermAttribute : Seems to be the actual terms being set. Seemed
>> > > straight
>> > > > forward, works fine when I only have one term to add.
>> > > >
>> > > > PositionIncrementAttribute: What does this do? It appears that
>> > > > WordDelimiterFilter sets this to 0 most of the time. This has decent
>> > > > documentation.
>> > > >
>> > > > OffsetAttribute: I think that this tracks offsets for each term being
>> > > > processed. Not really sure though. The documentation mentions tokens.
>> > So
>> > > if
>> > > > I have multiple variations for for a token is this for each
>> variation?
>> > > >
>> > > > TypeAttribute: default is "word". Don't know what this is for.
>> > > >
>> > > > PositionLengthAttribute: WordDelimiterFilter doesn' use this but
>> > Shingle
>> > > > does. It defaults to 1. What's it good for when should I use it?
>> > > >
>> > > > Here is my incrementToken method.
>> > > >
>> > > >    @Override
>> > > >    public boolean incrementToken() throws IOException {
>> > > >    while(true) {
>> > > >    if (!hasSavedState) {
>> > > >    if (! input.incrementToken()) {
>> > > >    return false;
>> > > >    }
>> > > >    if (! generateFragments) { // This part works fine!
>> > > >        String normalizedFormula = molFormula.normalize(new
>> > > > String(termAttribute.buffer()));
>> > > >        char[]newBuffer = normalizedFormula.toCharArray();
>> > > >        termAttribute.setEmpty();
>> > > >        termAttribute.copyBuffer(newBuffer, 0, newBuffer.length);
>> > > >        return true;
>> > > >    }
>> > > >    formulas = molFormula.normalizeToList(new
>> > > > String(termAttribute.buffer()));
>> > > >    iterator = formulas.listIterator();
>> > > >    savedPositionIncrement += posIncAttribute.getPositionIncrement();
>> > > >    hasSavedState = true;
>> > > >    first = true;
>> > > >    saveState();
>> > > >    }
>> > > >    if (!iterator.hasNext()) {
>> > > >    posIncAttribute.setPositionIncrement(savedPositionIncrement);
>> > > >    savedPositionIncrement = 0;
>> > > >    hasSavedState = false;
>> > > >    continue;
>> > > >    }
>> > > >    String formula = iterator.next();
>> > > >        int startOffset = savedStartOffset;
>> > > >
>> > > >        if (first) {
>> > > >        termAttribute.setEmpty();
>> > > >        }
>> > > >        int endOffset = savedStartOffset + formula.length();
>> > > >        System.out.printf("Writing formula %s %d to %d%n", formula,
>> > > > startOffset, endOffset);;
>> > > >        termAttribute.append(formula);
>> > > >            offsetAttribute.setOffset(startOffset, endOffset);
>> > > >            savedStartOffset = endOffset + 1;
>> > > >            if (first) {
>> > > >            posIncAttribute.setPositionIncrement(0);
>> > > >            } else {
>> > > >            first = false;
>> > > >                posIncAttribute.setPositionIncrement(0);
>> > > >            }
>> > > >            typeAttribute.setType(savedType);
>> > > >            return true;
>> > > >    }
>> > > >    }
>> > > >
>> > > > --
>> > > >
>> > > >
>> > > > This message and any attachment are confidential and may be
>> privileged
>> > or
>> > > > otherwise protected from disclosure. If you are not the intended
>> > > recipient,
>> > > > you must not copy this message or attachment or disclose the contents
>> > to
>> > > > any other person. If you have received this transmission in error,
>> > please
>> > > > notify the sender immediately and delete the message and any
>> attachment
>> > > > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > > > subsidiaries do not accept liability for any omissions or errors in
>> > this
>> > > > message which may arise as a result of E-Mail-transmission or for
>> > damages
>> > > > resulting from any unauthorized changes of the content of this
>> message
>> > > and
>> > > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> > > > subsidiaries do not guarantee that this message is free of viruses
>> and
>> > > does
>> > > > not accept liability for any damages caused by any virus transmitted
>> > > > therewith.
>> > > >
>> > > > Click http://www.emdgroup.com/disclaimer to access the German,
>> French,
>> > > > Spanish and Portuguese versions of this disclaimer.
>> > >
>> > >
>> >
>> > --
>> >
>> >
>> > This message and any attachment are confidential and may be privileged or
>> > otherwise protected from disclosure. If you are not the intended
>> recipient,
>> > you must not copy this message or attachment or disclose the contents to
>> > any other person. If you have received this transmission in error, please
>> > notify the sender immediately and delete the message and any attachment
>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not accept liability for any omissions or errors in this
>> > message which may arise as a result of E-Mail-transmission or for damages
>> > resulting from any unauthorized changes of the content of this message
>> and
>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not guarantee that this message is free of viruses and
>> does
>> > not accept liability for any damages caused by any virus transmitted
>> > therewith.
>> >
>> > Click http://www.emdgroup.com/disclaimer to access the German, French,
>> > Spanish and Portuguese versions of this disclaimer.
>> >
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.

Re: Filter Factory question

Reply via email to