GitHub user sebas00 opened a pull request:

    
https://github.com/apache/incubator-predictionio-template-text-classifier/pull/8

    Changed tokenizer to use Apache Lucene StandardAnalyzer

    The standard tokenizer uses the Unicode Text Segmentation algorithm (as 
defined in Unicode Standard Annex #29) to find the boundaries between words, 
and emits everything in-between. Its knowledge of Unicode allows it to 
successfully tokenize text containing a mixture of languages.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sebas00/Text-classifier-Unicode master

Alternatively you can review and apply these changes as the patch at:

    
https://github.com/apache/incubator-predictionio-template-text-classifier/pull/8.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8
    
----
commit 2bcbdae63326873d996da9b4e1aa9afd952ecd67
Author: Sebastiaan de Man <[email protected]>
Date:   2016-10-30T20:31:48Z

    Changed tokenizer to use Apache Luce StandardAnalyzer for non-western 
languages

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to