Here is my fieldtype: <fieldType name="text_tr" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WikipediaTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_tr.txt" enablePositionIncrements="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WikipediaTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_tr.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_tr.txt" enablePositionIncrements="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
My input for indexing at analysis section of Solr admin page: {| style="text-align: left; width: 50%; table-layout: fixed;" border="0" |- valign="top" | style="width: 50%"| :*[[Ubuntu]] :*[[Fedora]] :*[[Mandriva]] :*[[Linux Mint]] :*[[Debian]] :*[[OpenSUSE]] | *[[Red Hat]] *[[Mageia]] *[[Arch Linux]] *[[PCLinuxOS]] *[[Slackware]] |} and the output: WT style text align left width 50 table layout fixed border 0 valign top style width 50 Ubuntu Fedora Mandriva Linux Mint Debian OpenSUSE Red Hat Mageia Arch Linux PCLinuxOS Slackware SF style text align left width 50 table layout fixed border 0 valign top style width 50 Ubuntu Fedora Mandriva Linux Mint Debian OpenSUSE Red Hat Mageia Arch Linux PCLinuxOS Slackware LCF style text align left width 50 table layout fixed border 0 valign top style width 50 ubuntu fedora mandriva linux mint debian opensuse red hat mageia arch linux pclinuxos slackware Any ideas? 2013/7/23 Jack Krupansky <j...@basetechnology.com> > Are you actually seeing that output from the WikipediaTokenizerFactory?? > Really? Even if you use the Solr Admin UI analysis page? > > You should just see the text tokens plus the URLs for links. > > -- Jack Krupansky > > -----Original Message----- From: Furkan KAMACI > Sent: Tuesday, July 23, 2013 10:53 AM > To: solr-user@lucene.apache.org > Subject: WikipediaTokenizer for Removing Unnecesary Parts > > > Hi; > > I have indexed wikipedia data with Solr DIH. However when I look data that > is indexed at Solr I something like that as well: > > {| style="text-align: left; width: 50%; table-layout: fixed;" border="0" > |- valign="top" > | style="width: 50%"| > :*[[Ubuntu]] > :*[[Fedora]] > :*[[Mandriva]] > :*[[Linux Mint]] > :*[[Debian]] > :*[[OpenSUSE]] > | > *[[Red Hat]] > *[[Mageia]] > *[[Arch Linux]] > *[[PCLinuxOS]] > *[[Slackware]] > |} > > However I want to remove them before indexing. I know that there is a > WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as > like links, style, etc..) with Solr? >