rmuir commented on pull request #740: URL: https://github.com/apache/lucene/pull/740#issuecomment-1078547862
also, sorry about the review slowness. i didn't want to just click "approve" without taking another pass through the comments and code. Again, I like the way the concerns were split apart, the explanation you gave about +/- LOC from github is exactly how I feel, too. The overall algorithm is the same one here for nori and kuromoji, so it is a shame that we have duplicated implementation code (the holy grail will be factoring the actual tokenization logic!). At the same time, different languages have quirks about them and need different encoding/compression to be efficient. Different dictionaries might have quirks, too. It would be great to give all reasonable options compatible with the apache2 license to the user, without forking thousands of lines of Tokenizer code, each time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
