If you use wikipediatokenizer it will tag different wiki elements with different types (you can see it in the admin UI).
so then followup with typetokenfilter to only filter the types you care about, and i think it will do what you want. On Tue, Jul 23, 2013 at 7:53 AM, Furkan KAMACI <furkankam...@gmail.com>wrote: > Hi; > > I have indexed wikipedia data with Solr DIH. However when I look data that > is indexed at Solr I something like that as well: > > {| style="text-align: left; width: 50%; table-layout: fixed;" border="0" > |- valign="top" > | style="width: 50%"| > :*[[Ubuntu]] > :*[[Fedora]] > :*[[Mandriva]] > :*[[Linux Mint]] > :*[[Debian]] > :*[[OpenSUSE]] > | > *[[Red Hat]] > *[[Mageia]] > *[[Arch Linux]] > *[[PCLinuxOS]] > *[[Slackware]] > |} > > However I want to remove them before indexing. I know that there is a > WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as > like links, style, etc..) with Solr? >