Re: WikipediaTokenizer for Removing Unnecesary Parts

Jack Krupansky Tue, 23 Jul 2013 08:39:35 -0700

Are you actually seeing that output from the WikipediaTokenizerFactory??Really? Even if you use the Solr Admin UI analysis page?


You should just see the text tokens plus the URLs for links.


-- Jack Krupansky

-----Original Message-----From: Furkan KAMACI

Sent: Tuesday, July 23, 2013 10:53 AM
To: solr-user@lucene.apache.org
Subject: WikipediaTokenizer for Removing Unnecesary Parts

Hi;

I have indexed wikipedia data with Solr DIH. However when I look data that
is indexed at Solr I something like that as well:

{| style="text-align: left; width: 50%; table-layout: fixed;" border="0"
|- valign="top"
| style="width: 50%"|
:*[[Ubuntu]]
:*[[Fedora]]
:*[[Mandriva]]
:*[[Linux Mint]]
:*[[Debian]]
:*[[OpenSUSE]]
|
*[[Red Hat]]
*[[Mageia]]
*[[Arch Linux]]
*[[PCLinuxOS]]
*[[Slackware]]
|}

However I want to remove them before indexing. I know that there is a
WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as

like links, style, etc..) with Solr?

Re: WikipediaTokenizer for Removing Unnecesary Parts

Reply via email to