> In any case I figured out my problem. I was over thinking it.

Mind to share?

-Stefan

On Sep 27, 2017 4:34 PM, "Webster Homer" <webster.ho...@sial.com> wrote:

> There is a need for a special filter since the input has to be normalized.
> That is the main requirement, splitting into pieces is optional. As far as
> I know there is nothing in solr that knows about molecular formulas.
>
> In any case I figured out my problem. I was over thinking it.
>
> On Wed, Sep 27, 2017 at 3:52 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
>
> > Hi Homer,
> > There is no need for special filter, there is one that is for some reason
> > not part of documentation (will ask why so follow that thread if decided
> to
> > go this way): You can use something like:
> > <filter class=“solr.PatternCaptureGroupTokenFilterFactory”
> > pattern=“([A-Z][a-z]?\d+)” preserveOriginal=“true” />
> >
> > This will capture all atom counts as a separate tokens.
> >
> > HTH,
> > Emir
> >
> > > On 26 Sep 2017, at 23:14, Webster Homer <webster.ho...@sial.com>
> wrote:
> > >
> > > I am trying to create a filter that normalizes an input token, but also
> > > splits it inot multiple pieces. Sort of like what the
> WordDelimiterFilter
> > > does.
> > >
> > > It's meant to take a molecular formula like C2H6O and normalize it to
> > C2H6O1
> > >
> > > That part works. However I was also going to have it put out the
> > individual
> > > atom counts as tokens.
> > > C2H6O1
> > > C2
> > > H6
> > > O1
> > >
> > > When I enable this feature in the factory, I don't get any output at
> all.
> > >
> > > I looked over a couple of filters that do what I want and it's not
> > entirely
> > > clear what they're doing. So I have some questions:
> > > Looking at ShingleFilter and WordDelimitierFilter
> > > They both set several attributes:
> > > CharTermAttribute : Seems to be the actual terms being set. Seemed
> > straight
> > > forward, works fine when I only have one term to add.
> > >
> > > PositionIncrementAttribute: What does this do? It appears that
> > > WordDelimiterFilter sets this to 0 most of the time. This has decent
> > > documentation.
> > >
> > > OffsetAttribute: I think that this tracks offsets for each term being
> > > processed. Not really sure though. The documentation mentions tokens.
> So
> > if
> > > I have multiple variations for for a token is this for each variation?
> > >
> > > TypeAttribute: default is "word". Don't know what this is for.
> > >
> > > PositionLengthAttribute: WordDelimiterFilter doesn' use this but
> Shingle
> > > does. It defaults to 1. What's it good for when should I use it?
> > >
> > > Here is my incrementToken method.
> > >
> > >    @Override
> > >    public boolean incrementToken() throws IOException {
> > >    while(true) {
> > >    if (!hasSavedState) {
> > >    if (! input.incrementToken()) {
> > >    return false;
> > >    }
> > >    if (! generateFragments) { // This part works fine!
> > >        String normalizedFormula = molFormula.normalize(new
> > > String(termAttribute.buffer()));
> > >        char[]newBuffer = normalizedFormula.toCharArray();
> > >        termAttribute.setEmpty();
> > >        termAttribute.copyBuffer(newBuffer, 0, newBuffer.length);
> > >        return true;
> > >    }
> > >    formulas = molFormula.normalizeToList(new
> > > String(termAttribute.buffer()));
> > >    iterator = formulas.listIterator();
> > >    savedPositionIncrement += posIncAttribute.getPositionIncrement();
> > >    hasSavedState = true;
> > >    first = true;
> > >    saveState();
> > >    }
> > >    if (!iterator.hasNext()) {
> > >    posIncAttribute.setPositionIncrement(savedPositionIncrement);
> > >    savedPositionIncrement = 0;
> > >    hasSavedState = false;
> > >    continue;
> > >    }
> > >    String formula = iterator.next();
> > >        int startOffset = savedStartOffset;
> > >
> > >        if (first) {
> > >        termAttribute.setEmpty();
> > >        }
> > >        int endOffset = savedStartOffset + formula.length();
> > >        System.out.printf("Writing formula %s %d to %d%n", formula,
> > > startOffset, endOffset);;
> > >        termAttribute.append(formula);
> > >            offsetAttribute.setOffset(startOffset, endOffset);
> > >            savedStartOffset = endOffset + 1;
> > >            if (first) {
> > >            posIncAttribute.setPositionIncrement(0);
> > >            } else {
> > >            first = false;
> > >                posIncAttribute.setPositionIncrement(0);
> > >            }
> > >            typeAttribute.setType(savedType);
> > >            return true;
> > >    }
> > >    }
> > >
> > > --
> > >
> > >
> > > This message and any attachment are confidential and may be privileged
> or
> > > otherwise protected from disclosure. If you are not the intended
> > recipient,
> > > you must not copy this message or attachment or disclose the contents
> to
> > > any other person. If you have received this transmission in error,
> please
> > > notify the sender immediately and delete the message and any attachment
> > > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > > subsidiaries do not accept liability for any omissions or errors in
> this
> > > message which may arise as a result of E-Mail-transmission or for
> damages
> > > resulting from any unauthorized changes of the content of this message
> > and
> > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > > subsidiaries do not guarantee that this message is free of viruses and
> > does
> > > not accept liability for any damages caused by any virus transmitted
> > > therewith.
> > >
> > > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > > Spanish and Portuguese versions of this disclaimer.
> >
> >
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
>

Reply via email to