Are you actually seeing that output from the WikipediaTokenizerFactory??
Really? Even if you use the Solr Admin UI analysis page?
You should just see the text tokens plus the URLs for links.
-- Jack Krupansky
-----Original Message-----
From: Furkan KAMACI
Sent: Tuesday, July 23, 2013 10:53 AM
To: solr-user@lucene.apache.org
Subject: WikipediaTokenizer for Removing Unnecesary Parts
Hi;
I have indexed wikipedia data with Solr DIH. However when I look data that
is indexed at Solr I something like that as well:
{| style="text-align: left; width: 50%; table-layout: fixed;" border="0"
|- valign="top"
| style="width: 50%"|
:*[[Ubuntu]]
:*[[Fedora]]
:*[[Mandriva]]
:*[[Linux Mint]]
:*[[Debian]]
:*[[OpenSUSE]]
|
*[[Red Hat]]
*[[Mageia]]
*[[Arch Linux]]
*[[PCLinuxOS]]
*[[Slackware]]
|}
However I want to remove them before indexing. I know that there is a
WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as
like links, style, etc..) with Solr?