I've compared the results when using WikipediaTokenizer for index time analyzer but there is no difference?
2014-02-23 3:44 GMT+02:00 Ahmet Arslan <iori...@yahoo.com>: > Hi Furkan, > > There is org.apache.lucene.analysis.wikipedia.WikipediaTokenizer > > Ahmet > > > On Sunday, February 23, 2014 2:22 AM, Furkan KAMACI < > furkankam...@gmail.com> wrote: > Hi; > > I want to run an NLP algorithm for Wikipedia data. I used dataimport > handler for dump data and everything is OK. However there are some texts as > like: > > == Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı > eğitimden yararlanılmaktadır. > > I think that it should be like that: > > Altyapı bilgileri Köyde, ilköğretim okulu yoktur fakat taşımalı eğitimden > yararlanılmaktadır. > > On the other hand this should be removed: > > {| border="0" cellpadding="5" cellspacing="5" |- bgcolor="#aaaaaa" > |'''Seçim Yılı''' |'''Muhtar''' |- bgcolor="#dddddd" |[[2009]] |kazım > güngör |- bgcolor="#dddddd" | |Ömer Gungor |- bgcolor="#dddddd" | |Fazlı > Uzun |- bgcolor="#dddddd" | |Cemal Özden |- bgcolor="#dddddd" | | |} > > Also including titles as like == Altyapı bilgileri == should be optional (I > think that they can be removed for some purposes) > > My question is that. Is there any analyzer combination to clean up > Wikipedia data for Solr? > > Thanks; > Furkan KAMACI >