kotharironak opened a new issue, #10374: URL: https://github.com/apache/pinot/issues/10374
In the latest release, there is a way to use the text search index: https://docs.pinot.apache.org/basics/indexing/text-search-support#text-parsing-and-tokenization However, currently, it provides only `Lucene's standard english text tokenizer` and configuration options for including/excluding of stop words. There are certain domain-specific use cases where the above standard tokenizer won't suffice. As an example, - for the text `abc.pqr.xyz`, would like to split tokens using `.` along with existing `space` or `tab`. Here, the expectation is to get three tokens - `abc`, `pqr`,`xyz` - for the text `GET /api/v1/customer`, would like split using `/`, and expect `GET`, `api`, `v1`, `customer` However, currently, there is no way to include additional split chars for generating tokens in the existing tokenizer along with existing or to use another tokenizer. As part of this ticket: - Can we provide a way of extending the existing tokenizer? - Can you also consider providing a way to configure a different tokenizer or hooking to a custom tokenizer? Some discussion: https://apache-pinot.slack.com/archives/CDRCA57FC/p1677766557802739 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org