I am trying to create a filter that normalizes an input token, but also
splits it inot multiple pieces. Sort of like what the WordDelimiterFilter
does.

It's meant to take a molecular formula like C2H6O and normalize it to C2H6O1

That part works. However I was also going to have it put out the individual
atom counts as tokens.
 C2H6O1
C2
H6
O1

When I enable this feature in the factory, I don't get any output at all.

I looked over a couple of filters that do what I want and it's not entirely
clear what they're doing. So I have some questions:
Looking at ShingleFilter and WordDelimitierFilter
They both set several attributes:
CharTermAttribute : Seems to be the actual terms being set. Seemed straight
forward, works fine when I only have one term to add.

PositionIncrementAttribute: What does this do? It appears that
WordDelimiterFilter sets this to 0 most of the time. This has decent
documentation.

OffsetAttribute: I think that this tracks offsets for each term being
processed. Not really sure though. The documentation mentions tokens. So if
I have multiple variations for for a token is this for each variation?

TypeAttribute: default is "word". Don't know what this is for.

PositionLengthAttribute: WordDelimiterFilter doesn' use this but Shingle
does. It defaults to 1. What's it good for when should I use it?

Here is my incrementToken method.

    @Override
    public boolean incrementToken() throws IOException {
    while(true) {
    if (!hasSavedState) {
    if (! input.incrementToken()) {
    return false;
    }
    if (! generateFragments) { // This part works fine!
        String normalizedFormula = molFormula.normalize(new
String(termAttribute.buffer()));
        char[]newBuffer = normalizedFormula.toCharArray();
        termAttribute.setEmpty();
        termAttribute.copyBuffer(newBuffer, 0, newBuffer.length);
        return true;
    }
    formulas = molFormula.normalizeToList(new
String(termAttribute.buffer()));
    iterator = formulas.listIterator();
    savedPositionIncrement += posIncAttribute.getPositionIncrement();
    hasSavedState = true;
    first = true;
    saveState();
    }
    if (!iterator.hasNext()) {
    posIncAttribute.setPositionIncrement(savedPositionIncrement);
    savedPositionIncrement = 0;
    hasSavedState = false;
    continue;
    }
    String formula = iterator.next();
        int startOffset = savedStartOffset;

        if (first) {
        termAttribute.setEmpty();
        }
        int endOffset = savedStartOffset + formula.length();
        System.out.printf("Writing formula %s %d to %d%n", formula,
startOffset, endOffset);;
        termAttribute.append(formula);
            offsetAttribute.setOffset(startOffset, endOffset);
            savedStartOffset = endOffset + 1;
            if (first) {
            posIncAttribute.setPositionIncrement(0);
            } else {
            first = false;
                posIncAttribute.setPositionIncrement(0);
            }
            typeAttribute.setType(savedType);
            return true;
    }
    }

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Reply via email to