Re: Wikipedia Data Cleaning at Solr

Furkan KAMACI Sun, 23 Feb 2014 14:35:23 -0800

I've compared the results when using WikipediaTokenizer for  index time
analyzer but there is no difference?



2014-02-23 3:44 GMT+02:00 Ahmet Arslan <iori...@yahoo.com>:

> Hi Furkan,
>
> There is org.apache.lucene.analysis.wikipedia.WikipediaTokenizer
>
> Ahmet
>
>
> On Sunday, February 23, 2014 2:22 AM, Furkan KAMACI <
> furkankam...@gmail.com> wrote:
> Hi;
>
> I want to run an NLP algorithm for Wikipedia data. I used dataimport
> handler for dump data and everything is OK. However there are some texts as
> like:
>
> == Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı
> eğitimden yararlanılmaktadır.
>
> I think that it should be like that:
>
> Altyapı bilgileri Köyde, ilköğretim okulu yoktur fakat taşımalı eğitimden
> yararlanılmaktadır.
>
> On the other hand this should be removed:
>
> {| border="0" cellpadding="5" cellspacing="5" |- bgcolor="#aaaaaa"
> |'''Seçim Yılı''' |'''Muhtar''' |- bgcolor="#dddddd" |[[2009]] |kazım
> güngör |- bgcolor="#dddddd" | |Ömer Gungor |- bgcolor="#dddddd" | |Fazlı
> Uzun |- bgcolor="#dddddd" | |Cemal Özden |- bgcolor="#dddddd" | | |}
>
> Also including titles as like == Altyapı bilgileri == should be optional (I
> think that they can be removed for some purposes)
>
> My question is that. Is there any analyzer combination to clean up
> Wikipedia data for Solr?
>
> Thanks;
> Furkan KAMACI
>

Re: Wikipedia Data Cleaning at Solr

Reply via email to