> In any case I figured out my problem. I was over thinking it. Mind to share?
-Stefan On Sep 27, 2017 4:34 PM, "Webster Homer" <webster.ho...@sial.com> wrote: > There is a need for a special filter since the input has to be normalized. > That is the main requirement, splitting into pieces is optional. As far as > I know there is nothing in solr that knows about molecular formulas. > > In any case I figured out my problem. I was over thinking it. > > On Wed, Sep 27, 2017 at 3:52 AM, Emir Arnautović < > emir.arnauto...@sematext.com> wrote: > > > Hi Homer, > > There is no need for special filter, there is one that is for some reason > > not part of documentation (will ask why so follow that thread if decided > to > > go this way): You can use something like: > > <filter class=“solr.PatternCaptureGroupTokenFilterFactory” > > pattern=“([A-Z][a-z]?\d+)” preserveOriginal=“true” /> > > > > This will capture all atom counts as a separate tokens. > > > > HTH, > > Emir > > > > > On 26 Sep 2017, at 23:14, Webster Homer <webster.ho...@sial.com> > wrote: > > > > > > I am trying to create a filter that normalizes an input token, but also > > > splits it inot multiple pieces. Sort of like what the > WordDelimiterFilter > > > does. > > > > > > It's meant to take a molecular formula like C2H6O and normalize it to > > C2H6O1 > > > > > > That part works. However I was also going to have it put out the > > individual > > > atom counts as tokens. > > > C2H6O1 > > > C2 > > > H6 > > > O1 > > > > > > When I enable this feature in the factory, I don't get any output at > all. > > > > > > I looked over a couple of filters that do what I want and it's not > > entirely > > > clear what they're doing. So I have some questions: > > > Looking at ShingleFilter and WordDelimitierFilter > > > They both set several attributes: > > > CharTermAttribute : Seems to be the actual terms being set. Seemed > > straight > > > forward, works fine when I only have one term to add. > > > > > > PositionIncrementAttribute: What does this do? It appears that > > > WordDelimiterFilter sets this to 0 most of the time. This has decent > > > documentation. > > > > > > OffsetAttribute: I think that this tracks offsets for each term being > > > processed. Not really sure though. The documentation mentions tokens. > So > > if > > > I have multiple variations for for a token is this for each variation? > > > > > > TypeAttribute: default is "word". Don't know what this is for. > > > > > > PositionLengthAttribute: WordDelimiterFilter doesn' use this but > Shingle > > > does. It defaults to 1. What's it good for when should I use it? > > > > > > Here is my incrementToken method. > > > > > > @Override > > > public boolean incrementToken() throws IOException { > > > while(true) { > > > if (!hasSavedState) { > > > if (! input.incrementToken()) { > > > return false; > > > } > > > if (! generateFragments) { // This part works fine! > > > String normalizedFormula = molFormula.normalize(new > > > String(termAttribute.buffer())); > > > char[]newBuffer = normalizedFormula.toCharArray(); > > > termAttribute.setEmpty(); > > > termAttribute.copyBuffer(newBuffer, 0, newBuffer.length); > > > return true; > > > } > > > formulas = molFormula.normalizeToList(new > > > String(termAttribute.buffer())); > > > iterator = formulas.listIterator(); > > > savedPositionIncrement += posIncAttribute.getPositionIncrement(); > > > hasSavedState = true; > > > first = true; > > > saveState(); > > > } > > > if (!iterator.hasNext()) { > > > posIncAttribute.setPositionIncrement(savedPositionIncrement); > > > savedPositionIncrement = 0; > > > hasSavedState = false; > > > continue; > > > } > > > String formula = iterator.next(); > > > int startOffset = savedStartOffset; > > > > > > if (first) { > > > termAttribute.setEmpty(); > > > } > > > int endOffset = savedStartOffset + formula.length(); > > > System.out.printf("Writing formula %s %d to %d%n", formula, > > > startOffset, endOffset);; > > > termAttribute.append(formula); > > > offsetAttribute.setOffset(startOffset, endOffset); > > > savedStartOffset = endOffset + 1; > > > if (first) { > > > posIncAttribute.setPositionIncrement(0); > > > } else { > > > first = false; > > > posIncAttribute.setPositionIncrement(0); > > > } > > > typeAttribute.setType(savedType); > > > return true; > > > } > > > } > > > > > > -- > > > > > > > > > This message and any attachment are confidential and may be privileged > or > > > otherwise protected from disclosure. If you are not the intended > > recipient, > > > you must not copy this message or attachment or disclose the contents > to > > > any other person. If you have received this transmission in error, > please > > > notify the sender immediately and delete the message and any attachment > > > from your system. Merck KGaA, Darmstadt, Germany and any of its > > > subsidiaries do not accept liability for any omissions or errors in > this > > > message which may arise as a result of E-Mail-transmission or for > damages > > > resulting from any unauthorized changes of the content of this message > > and > > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its > > > subsidiaries do not guarantee that this message is free of viruses and > > does > > > not accept liability for any damages caused by any virus transmitted > > > therewith. > > > > > > Click http://www.emdgroup.com/disclaimer to access the German, French, > > > Spanish and Portuguese versions of this disclaimer. > > > > > > -- > > > This message and any attachment are confidential and may be privileged or > otherwise protected from disclosure. If you are not the intended recipient, > you must not copy this message or attachment or disclose the contents to > any other person. If you have received this transmission in error, please > notify the sender immediately and delete the message and any attachment > from your system. Merck KGaA, Darmstadt, Germany and any of its > subsidiaries do not accept liability for any omissions or errors in this > message which may arise as a result of E-Mail-transmission or for damages > resulting from any unauthorized changes of the content of this message and > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its > subsidiaries do not guarantee that this message is free of viruses and does > not accept liability for any damages caused by any virus transmitted > therewith. > > Click http://www.emdgroup.com/disclaimer to access the German, French, > Spanish and Portuguese versions of this disclaimer. >