PatternCaptureGroupTokenFilter has been around since 2013 (at least that's the earliest revision in Git). I located it even in 5x so it should be there in ...lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern
Best, Erick On Thu, Sep 28, 2017 at 7:45 AM, Webster Homer <webster.ho...@sial.com> wrote: > It's still buggy, so not ready to share. > > I keep a copy of Solr source which I use for this type of development. I > don't see PatternCaptureGroupTokenFilterFactory in the Solr 6.2 code base > at all. I was thinking of seeing how it treated the positions etc... > > My code now looks reasonable in the Analysis tool, but doesn't seem to > create searchable lucene data. I've changed it considerably since my first > post so I see output in the tool which was an improvement > > > On Wed, Sep 27, 2017 at 10:30 AM, Stefan Matheis <matheis.ste...@gmail.com> > wrote: > >> > In any case I figured out my problem. I was over thinking it. >> >> Mind to share? >> >> -Stefan >> >> On Sep 27, 2017 4:34 PM, "Webster Homer" <webster.ho...@sial.com> wrote: >> >> > There is a need for a special filter since the input has to be >> normalized. >> > That is the main requirement, splitting into pieces is optional. As far >> as >> > I know there is nothing in solr that knows about molecular formulas. >> > >> > In any case I figured out my problem. I was over thinking it. >> > >> > On Wed, Sep 27, 2017 at 3:52 AM, Emir Arnautović < >> > emir.arnauto...@sematext.com> wrote: >> > >> > > Hi Homer, >> > > There is no need for special filter, there is one that is for some >> reason >> > > not part of documentation (will ask why so follow that thread if >> decided >> > to >> > > go this way): You can use something like: >> > > <filter class=“solr.PatternCaptureGroupTokenFilterFactory” >> > > pattern=“([A-Z][a-z]?\d+)” preserveOriginal=“true” /> >> > > >> > > This will capture all atom counts as a separate tokens. >> > > >> > > HTH, >> > > Emir >> > > >> > > > On 26 Sep 2017, at 23:14, Webster Homer <webster.ho...@sial.com> >> > wrote: >> > > > >> > > > I am trying to create a filter that normalizes an input token, but >> also >> > > > splits it inot multiple pieces. Sort of like what the >> > WordDelimiterFilter >> > > > does. >> > > > >> > > > It's meant to take a molecular formula like C2H6O and normalize it to >> > > C2H6O1 >> > > > >> > > > That part works. However I was also going to have it put out the >> > > individual >> > > > atom counts as tokens. >> > > > C2H6O1 >> > > > C2 >> > > > H6 >> > > > O1 >> > > > >> > > > When I enable this feature in the factory, I don't get any output at >> > all. >> > > > >> > > > I looked over a couple of filters that do what I want and it's not >> > > entirely >> > > > clear what they're doing. So I have some questions: >> > > > Looking at ShingleFilter and WordDelimitierFilter >> > > > They both set several attributes: >> > > > CharTermAttribute : Seems to be the actual terms being set. Seemed >> > > straight >> > > > forward, works fine when I only have one term to add. >> > > > >> > > > PositionIncrementAttribute: What does this do? It appears that >> > > > WordDelimiterFilter sets this to 0 most of the time. This has decent >> > > > documentation. >> > > > >> > > > OffsetAttribute: I think that this tracks offsets for each term being >> > > > processed. Not really sure though. The documentation mentions tokens. >> > So >> > > if >> > > > I have multiple variations for for a token is this for each >> variation? >> > > > >> > > > TypeAttribute: default is "word". Don't know what this is for. >> > > > >> > > > PositionLengthAttribute: WordDelimiterFilter doesn' use this but >> > Shingle >> > > > does. It defaults to 1. What's it good for when should I use it? >> > > > >> > > > Here is my incrementToken method. >> > > > >> > > > @Override >> > > > public boolean incrementToken() throws IOException { >> > > > while(true) { >> > > > if (!hasSavedState) { >> > > > if (! input.incrementToken()) { >> > > > return false; >> > > > } >> > > > if (! generateFragments) { // This part works fine! >> > > > String normalizedFormula = molFormula.normalize(new >> > > > String(termAttribute.buffer())); >> > > > char[]newBuffer = normalizedFormula.toCharArray(); >> > > > termAttribute.setEmpty(); >> > > > termAttribute.copyBuffer(newBuffer, 0, newBuffer.length); >> > > > return true; >> > > > } >> > > > formulas = molFormula.normalizeToList(new >> > > > String(termAttribute.buffer())); >> > > > iterator = formulas.listIterator(); >> > > > savedPositionIncrement += posIncAttribute.getPositionIncrement(); >> > > > hasSavedState = true; >> > > > first = true; >> > > > saveState(); >> > > > } >> > > > if (!iterator.hasNext()) { >> > > > posIncAttribute.setPositionIncrement(savedPositionIncrement); >> > > > savedPositionIncrement = 0; >> > > > hasSavedState = false; >> > > > continue; >> > > > } >> > > > String formula = iterator.next(); >> > > > int startOffset = savedStartOffset; >> > > > >> > > > if (first) { >> > > > termAttribute.setEmpty(); >> > > > } >> > > > int endOffset = savedStartOffset + formula.length(); >> > > > System.out.printf("Writing formula %s %d to %d%n", formula, >> > > > startOffset, endOffset);; >> > > > termAttribute.append(formula); >> > > > offsetAttribute.setOffset(startOffset, endOffset); >> > > > savedStartOffset = endOffset + 1; >> > > > if (first) { >> > > > posIncAttribute.setPositionIncrement(0); >> > > > } else { >> > > > first = false; >> > > > posIncAttribute.setPositionIncrement(0); >> > > > } >> > > > typeAttribute.setType(savedType); >> > > > return true; >> > > > } >> > > > } >> > > > >> > > > -- >> > > > >> > > > >> > > > This message and any attachment are confidential and may be >> privileged >> > or >> > > > otherwise protected from disclosure. If you are not the intended >> > > recipient, >> > > > you must not copy this message or attachment or disclose the contents >> > to >> > > > any other person. If you have received this transmission in error, >> > please >> > > > notify the sender immediately and delete the message and any >> attachment >> > > > from your system. Merck KGaA, Darmstadt, Germany and any of its >> > > > subsidiaries do not accept liability for any omissions or errors in >> > this >> > > > message which may arise as a result of E-Mail-transmission or for >> > damages >> > > > resulting from any unauthorized changes of the content of this >> message >> > > and >> > > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its >> > > > subsidiaries do not guarantee that this message is free of viruses >> and >> > > does >> > > > not accept liability for any damages caused by any virus transmitted >> > > > therewith. >> > > > >> > > > Click http://www.emdgroup.com/disclaimer to access the German, >> French, >> > > > Spanish and Portuguese versions of this disclaimer. >> > > >> > > >> > >> > -- >> > >> > >> > This message and any attachment are confidential and may be privileged or >> > otherwise protected from disclosure. If you are not the intended >> recipient, >> > you must not copy this message or attachment or disclose the contents to >> > any other person. If you have received this transmission in error, please >> > notify the sender immediately and delete the message and any attachment >> > from your system. Merck KGaA, Darmstadt, Germany and any of its >> > subsidiaries do not accept liability for any omissions or errors in this >> > message which may arise as a result of E-Mail-transmission or for damages >> > resulting from any unauthorized changes of the content of this message >> and >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its >> > subsidiaries do not guarantee that this message is free of viruses and >> does >> > not accept liability for any damages caused by any virus transmitted >> > therewith. >> > >> > Click http://www.emdgroup.com/disclaimer to access the German, French, >> > Spanish and Portuguese versions of this disclaimer. >> > >> > > -- > > > This message and any attachment are confidential and may be privileged or > otherwise protected from disclosure. If you are not the intended recipient, > you must not copy this message or attachment or disclose the contents to > any other person. If you have received this transmission in error, please > notify the sender immediately and delete the message and any attachment > from your system. Merck KGaA, Darmstadt, Germany and any of its > subsidiaries do not accept liability for any omissions or errors in this > message which may arise as a result of E-Mail-transmission or for damages > resulting from any unauthorized changes of the content of this message and > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its > subsidiaries do not guarantee that this message is free of viruses and does > not accept liability for any damages caused by any virus transmitted > therewith. > > Click http://www.emdgroup.com/disclaimer to access the German, French, > Spanish and Portuguese versions of this disclaimer.