[jira] [Updated] (LUCENE-10522) issue with pattern capture group token filter

Dishant Sharma (Jira) Mon, 18 Apr 2022 21:50:05 -0700


     [ 
https://issues.apache.org/jira/browse/LUCENE-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dishant Sharma updated LUCENE-10522:
------------------------------------
    Description: 
|The default pattern capture token filter in elastic search gives the same 
start and end offset for each generated token: the start and end offset as that 
of the input string. Is there any way by which I can change the start and end 
offset of an input string to the positions at which they are found in the input 
string? The issue that I'm currently facing is that in case of highlighting, it 
highlights enter string instead of the match.|

The code inside my token filter factory file is:
 
{{package pl.allegro.tech.elasticsearch.index.analysis.pl;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.pattern.PatternCaptureGroupTokenFilter;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;

import java.util.regex.Pattern;

public class PuAlPuTokenFilterFactory extends AbstractTokenFilterFactory \{

    public PuAlPuTokenFilterFactory(IndexSettings indexSettings, Environment 
environment, String name, Settings settings) {
        super(indexSettings, name, settings);

    }

    @Override
    public TokenStream create(TokenStream tokenStream) \{
        return new PatternCaptureGroupTokenFilter(tokenStream, true, 
Pattern.compile("(?<![^\\p{Alnum}\\p\{Punct}])(\\p\{Punct}\\p\{Alnum}+\\p\{Punct})"));
    }
}}}
 
I have multiple such token filter files inside my code containing the same code 
as above but having different pattern used in each file inside the 
PatternCaptureGroupTokenFilter method call. Each pattern is used as to get the 
different set of tokens as per my requirement.

I am using the lucene's default PatternCaptureGroupTokenFilter.

I am not using any mapping but, I am using the below index settings as per my 
use case:
"settings" : \{
      "analysis" : {
         "analyzer" : {
            "special_analyzer" : {
               "tokenizer" : "whitespace",
               "filter" : [ "url-filter-1", "url-filter-2", "url-filter-3", 
"url-filter-4", "url-filter-5", "url-filter-6", "url-filter-7", "url-filter-8", 
"url-filter-9", "url-filter-10", "url-filter-11", "unique" ]
            }
         }
      }
   }
 

I am getting all the tokens using the regexes that I have created but the only 
issue is that all the tokens have the same start and end offsets as that of the 
input string.
I am using the pattern token filter alongwith the whitespace tokenizer. Suppose 
I have a text: "Website url is [https://www.google.com/]";
Then, the desired tokens are:
Website, url, is, [https://www.google.com/], https, www, google, com, https:, 
https:/, [https://|https:], /www, .google, .com, [www|http://www/]., google., 
com/, www [google.com|http://google.com/] etc.
I am getting all these tokens through my regexes the only issue is with the 
offsets. Suppose the start and end offsets of the entire url 
"[https://www.google.com/]"; are 0 and 23, so it is giving 0 and 23 for all the 
generated tokens.
But, as per my use case, I'm using the highlighting functionality where I have 
to use it to highlight all the generated tokens inside the text. But, the issue 
here is that I instead of highlighting only the match inside the text, it is 
highlighting the entire input text.|

  was:
|The default pattern capture token filter in elastic search gives the same 
start and end offset for each generated token: the start and end offset as that 
of the input string. Is there any way by which I can change the start and end 
offset of an input string to the positions at which they are found in the input 
string? The issue that I'm currently facing is that in case of highlighting, it 
highlights enter string instead of the match.|

 

I am not using any mapping but, I am using the below index settings as per my 
use case:
 {{   "settings" : \{
      "analysis" : {
         "analyzer" : {
            "special_analyzer" : {
               "tokenizer" : "whitespace",
               "filter" : [ "url-filter-1", "url-filter-2", "url-filter-3", 
"url-filter-4", "url-filter-5", "url-filter-6", "url-filter-7", "url-filter-8", 
"url-filter-9", "url-filter-10", "url-filter-11", "unique" ]
            }
         }
      }
   }}}
 

I am getting all the tokens using the regexes that I have created but the only 
issue is that all the tokens have the same start and end offsets as that of the 
input string.
I am using the pattern token filter alongwith the whitespace tokenizer. Suppose 
I have a text:
"Website url is [https://www.google.com/]";
Then, the desired tokens are:
Website, url, is, [https://www.google.com/], https, www, google, com, https:, 
https:/, [https://|https:], /www, .google, .com, [www|http://www/]., google., 
com/, www [google.com|http://google.com/] etc.
I am getting all these tokens through my regexes the only issue is with the 
offsets. Suppose the start and end offsets of the entire url 
"[https://www.google.com/]"; are 0 and 23, so it is giving 0 and 23 for all the 
generated tokens.
But, as per my use case, I'm using the highlighting functionality where I have 
to use it to highlight all the generated tokens inside the text. But, the issue 
here is that I instead of highlighting only the match inside the text, it is 
highlighting the entire input text.|


> issue with pattern capture group token filter
> ---------------------------------------------
>
>                 Key: LUCENE-10522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10522
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Dishant Sharma
>            Priority: Critical
>
> |The default pattern capture token filter in elastic search gives the same 
> start and end offset for each generated token: the start and end offset as 
> that of the input string. Is there any way by which I can change the start 
> and end offset of an input string to the positions at which they are found in 
> the input string? The issue that I'm currently facing is that in case of 
> highlighting, it highlights enter string instead of the match.|
> The code inside my token filter factory file is:
>  
> {{package pl.allegro.tech.elasticsearch.index.analysis.pl;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.pattern.PatternCaptureGroupTokenFilter;
> import org.elasticsearch.common.settings.Settings;
> import org.elasticsearch.env.Environment;
> import org.elasticsearch.index.IndexSettings;
> import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
> import java.util.regex.Pattern;
> public class PuAlPuTokenFilterFactory extends AbstractTokenFilterFactory \{
>     public PuAlPuTokenFilterFactory(IndexSettings indexSettings, Environment 
> environment, String name, Settings settings) {
>         super(indexSettings, name, settings);
>     }
>     @Override
>     public TokenStream create(TokenStream tokenStream) \{
>         return new PatternCaptureGroupTokenFilter(tokenStream, true, 
> Pattern.compile("(?<![^\\p{Alnum}\\p\{Punct}])(\\p\{Punct}\\p\{Alnum}+\\p\{Punct})"));
>     }
> }}}
>  
> I have multiple such token filter files inside my code containing the same 
> code as above but having different pattern used in each file inside the 
> PatternCaptureGroupTokenFilter method call. Each pattern is used as to get 
> the different set of tokens as per my requirement.
> I am using the lucene's default PatternCaptureGroupTokenFilter.
> I am not using any mapping but, I am using the below index settings as per my 
> use case:
> "settings" : \{
>       "analysis" : {
>          "analyzer" : {
>             "special_analyzer" : {
>                "tokenizer" : "whitespace",
>                "filter" : [ "url-filter-1", "url-filter-2", "url-filter-3", 
> "url-filter-4", "url-filter-5", "url-filter-6", "url-filter-7", 
> "url-filter-8", "url-filter-9", "url-filter-10", "url-filter-11", "unique" ]
>             }
>          }
>       }
>    }
>  
> I am getting all the tokens using the regexes that I have created but the 
> only issue is that all the tokens have the same start and end offsets as that 
> of the input string.
> I am using the pattern token filter alongwith the whitespace tokenizer. 
> Suppose I have a text: "Website url is [https://www.google.com/]";
> Then, the desired tokens are:
> Website, url, is, [https://www.google.com/], https, www, google, com, https:, 
> https:/, [https://|https:], /www, .google, .com, [www|http://www/]., google., 
> com/, www [google.com|http://google.com/] etc.
> I am getting all these tokens through my regexes the only issue is with the 
> offsets. Suppose the start and end offsets of the entire url 
> "[https://www.google.com/]"; are 0 and 23, so it is giving 0 and 23 for all 
> the generated tokens.
> But, as per my use case, I'm using the highlighting functionality where I 
> have to use it to highlight all the generated tokens inside the text. But, 
> the issue here is that I instead of highlighting only the match inside the 
> text, it is highlighting the entire input text.|



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10522) issue with pattern capture group token filter

Reply via email to