kotharironak opened a new issue, #10374:
URL: https://github.com/apache/pinot/issues/10374

   In the latest release, there is a way to use the text search index: 
https://docs.pinot.apache.org/basics/indexing/text-search-support#text-parsing-and-tokenization
   
   However, currently, it provides only `Lucene's standard english text 
tokenizer` and configuration options for including/excluding of stop words.
   
   There are certain domain-specific use cases where the above standard 
tokenizer won't suffice. 
   As an example, 
   - for the text `abc.pqr.xyz`, would like to split tokens using `.` along 
with existing `space` or `tab`. Here, the expectation is to get three tokens - 
`abc`, `pqr`,`xyz`
   - for the text `GET /api/v1/customer`, would like split using `/`, and 
expect `GET`, `api`, `v1`, `customer`
   
   However,  currently, there is no way to include additional split chars for 
generating tokens in the existing tokenizer along with existing or to use 
another tokenizer.
   
   
   As part of this ticket:
   - Can we provide a way of extending the existing tokenizer?
   - Can you also consider providing a way to configure a different tokenizer 
or hooking to a custom tokenizer?
   
   Some discussion: 
https://apache-pinot.slack.com/archives/CDRCA57FC/p1677766557802739


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to