dadoonet opened a new issue, #15196:
URL: https://github.com/apache/lucene/issues/15196

   ### Description
   
   As I reported at https://github.com/elastic/elasticsearch/issues/133989, I'd 
love to have a way to support multiple delimiters for the Path Hierarchy 
Tokenizer.
   
   Currently, it only supports a single pattern for the `delimiter` parameter 
(default is `/`). This makes it difficult to tokenize both Windows (`\\`) and 
Linux (`/`) paths efficiently in the same index. Supporting multiple delimiters 
(such as both `/` and `\\`) would greatly improve usability for systems dealing 
with cross-platform file paths. 
   
   For example, a user may need to index file paths from both Windows and Linux 
environments and expects the analysis to work seamlessly regardless of path 
format. At the moment, the only workaround I found is to preprocess the data to 
normalize delimiters, which adds extra complexity.
   
   **Feature Request**: Allow the `path_hierarchy` tokenizer  to accept 
multiple delimiter patterns (e.g., an array of delimiters) so both `/` and `\\` 
can be handled simultaneously.
   
   Another possible implementation would be to create a new 
`PathsHierarchyTokenizer` (note the `s`) which implements this behavior.
   
   Before working a such PR, I'd like to get your views about this proposal... 
May be I'm just wrong trying to do so.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to