rmuir commented on pull request #740:
URL: https://github.com/apache/lucene/pull/740#issuecomment-1078547862


   also, sorry about the review slowness. i didn't want to just click "approve" 
without taking another pass through the comments and code. Again, I like the 
way the concerns were split apart, the explanation you gave about +/- LOC from 
github is exactly how I feel, too.
   
   The overall algorithm is the same one here for nori and kuromoji, so it is a 
shame that we have duplicated implementation code (the holy grail will be 
factoring the actual tokenization logic!). At the same time, different 
languages have quirks about them and need different encoding/compression to be 
efficient. Different dictionaries might have quirks, too. It would be great to 
give all reasonable options compatible with the apache2 license to the user, 
without forking thousands of lines of Tokenizer code, each time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to