GitHub user sebas00 opened a pull request:
https://github.com/apache/incubator-predictionio-template-text-classifier/pull/8
Changed tokenizer to use Apache Lucene StandardAnalyzer
The standard tokenizer uses the Unicode Text Segmentation algorithm (as
defined in Unicode Standard Annex #29) to find the boundaries between words,
and emits everything in-between. Its knowledge of Unicode allows it to
successfully tokenize text containing a mixture of languages.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sebas00/Text-classifier-Unicode master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-predictionio-template-text-classifier/pull/8.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8
----
commit 2bcbdae63326873d996da9b4e1aa9afd952ecd67
Author: Sebastiaan de Man <[email protected]>
Date: 2016-10-30T20:31:48Z
Changed tokenizer to use Apache Luce StandardAnalyzer for non-western
languages
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---