[jira] [Updated] (LUCENE-10522) issue with pattern capture group token filter

Dishant Sharma (Jira) Mon, 18 Apr 2022 21:43:03 -0700


     [ 
https://issues.apache.org/jira/browse/LUCENE-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dishant Sharma updated LUCENE-10522:
------------------------------------
    Description: 
|The default pattern capture token filter in elastic search gives the same 
start and end offset for each generated token: the start and end offset as that 
of the input string. Is there any way by which I can change the start and end 
offset of an input string to the positions at which they are found in the input 
string? The issue that I'm currently facing is that in case of highlighting, it 
highlights enter string instead of the match.|

 

I am not using any mapping but, I am using the below index settings as per my 
use case:
 {{   "settings" : \{
      "analysis" : {
         "analyzer" : {
            "special_analyzer" : {
               "tokenizer" : "whitespace",
               "filter" : [ "url-filter-1", "url-filter-2", "url-filter-3", 
"url-filter-4", "url-filter-5", "url-filter-6", "url-filter-7", "url-filter-8", 
"url-filter-9", "url-filter-10", "url-filter-11", "unique" ]
            }
         }
      }
   }}}
 

I am getting all the tokens using the regexes that I have created but the only 
issue is that all the tokens have the same start and end offsets as that of the 
input string.
I am using the pattern token filter alongwith the whitespace tokenizer. Suppose 
I have a text:
"Website url is [https://www.google.com/]";
Then, the desired tokens are:
Website, url, is, [https://www.google.com/], https, www, google, com, https:, 
https:/, [https://|https:], /www, .google, .com, [www|http://www/]., google., 
com/, www [google.com|http://google.com/] etc.
I am getting all these tokens through my regexes the only issue is with the 
offsets. Suppose the start and end offsets of the entire url 
"[https://www.google.com/]"; are 0 and 23, so it is giving 0 and 23 for all the 
generated tokens.
But, as per my use case, I'm using the highlighting functionality where I have 
to use it to highlight all the generated tokens inside the text. But, the issue 
here is that I instead of highlighting only the match inside the text, it is 
highlighting the entire input text.|

  was:
|The default pattern capture token filter in elastic search gives the same 
start and end offset for each generated token: the start and end offset as that 
of the input string. Is there any way by which I can change the start and end 
offset of an input string to the positions at which they are found in the input 
string? The issue that I'm currently facing is that in case of highlighting, it 
highlights enter string instead of the match.


I am getting all the tokens using the regexes that I have created but the only 
issue is that all the tokens have the same start and end offsets as that of the 
input string.
I am using the pattern token filter alongwith the whitespace tokenizer. Suppose 
I have a text:
"Website url is [https://www.google.com/]";
Then, the desired tokens are:
Website, url, is, [https://www.google.com/], https, www, google, com, https:, 
https:/, https://, /www, .google, .com, [www|http://www/]., google., com/, www 
[google.com|http://google.com/] etc.
I am getting all these tokens through my regexes the only issue is with the 
offsets. Suppose the start and end offsets of the entire url 
"[https://www.google.com/]"; are 0 and 23, so it is giving 0 and 23 for all the 
generated tokens.
But, as per my use case, I'm using the highlighting functionality where I have 
to use it to highlight all the generated tokens inside the text. But, the issue 
here is that I instead of highlighting only the match inside the text, it is 
highlighting the entire input text.|


> issue with pattern capture group token filter
> ---------------------------------------------
>
>                 Key: LUCENE-10522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10522
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Dishant Sharma
>            Priority: Critical
>
> |The default pattern capture token filter in elastic search gives the same 
> start and end offset for each generated token: the start and end offset as 
> that of the input string. Is there any way by which I can change the start 
> and end offset of an input string to the positions at which they are found in 
> the input string? The issue that I'm currently facing is that in case of 
> highlighting, it highlights enter string instead of the match.|
>  
> I am not using any mapping but, I am using the below index settings as per my 
> use case:
>  {{   "settings" : \{
>       "analysis" : {
>          "analyzer" : {
>             "special_analyzer" : {
>                "tokenizer" : "whitespace",
>                "filter" : [ "url-filter-1", "url-filter-2", "url-filter-3", 
> "url-filter-4", "url-filter-5", "url-filter-6", "url-filter-7", 
> "url-filter-8", "url-filter-9", "url-filter-10", "url-filter-11", "unique" ]
>             }
>          }
>       }
>    }}}
>  
> I am getting all the tokens using the regexes that I have created but the 
> only issue is that all the tokens have the same start and end offsets as that 
> of the input string.
> I am using the pattern token filter alongwith the whitespace tokenizer. 
> Suppose I have a text:
> "Website url is [https://www.google.com/]";
> Then, the desired tokens are:
> Website, url, is, [https://www.google.com/], https, www, google, com, https:, 
> https:/, [https://|https:], /www, .google, .com, [www|http://www/]., google., 
> com/, www [google.com|http://google.com/] etc.
> I am getting all these tokens through my regexes the only issue is with the 
> offsets. Suppose the start and end offsets of the entire url 
> "[https://www.google.com/]"; are 0 and 23, so it is giving 0 and 23 for all 
> the generated tokens.
> But, as per my use case, I'm using the highlighting functionality where I 
> have to use it to highlight all the generated tokens inside the text. But, 
> the issue here is that I instead of highlighting only the match inside the 
> text, it is highlighting the entire input text.|



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-10522) issue with pattern capture group token filter

Reply via email to