Re: WikipediaTokenizer for Removing Unnecesary Parts

Furkan KAMACI Tue, 23 Jul 2013 09:04:04 -0700

Here is my fieldtype:

    <fieldType name="text_tr" class="solr.TextField"
positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.WikipediaTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_tr.txt" enablePositionIncrements="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
          <analyzer type="query">
              <tokenizer class="solr.WikipediaTokenizerFactory"/>
              <filter class="solr.SynonymFilterFactory"
synonyms="synonyms_tr.txt" ignoreCase="true" expand="true"/>
              <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_tr.txt" enablePositionIncrements="true"/>
              <filter class="solr.LowerCaseFilterFactory"/>
          </analyzer>
    </fieldType>



My input for indexing at analysis section of Solr admin page:


{| style="text-align: left; width: 50%; table-layout: fixed;" border="0"
|- valign="top"
| style="width: 50%"|
:*[[Ubuntu]]
:*[[Fedora]]
:*[[Mandriva]]
:*[[Linux Mint]]
:*[[Debian]]
:*[[OpenSUSE]]
|
*[[Red Hat]]
*[[Mageia]]
*[[Arch Linux]]
*[[PCLinuxOS]]
*[[Slackware]]
|}

and the output:


WT      style            text            align            left
width            50            table            layout            fixed
        border            0            valign            top
style            width        50            Ubuntu            Fedora
    Mandriva            Linux            Mint            Debian
OpenSUSE            Red            Hat            Mageia            Arch
        Linux      PCLinuxOS            Slackware

SF       style            text            align            left
width            50            table            layout            fixed
        border            0            valign            top
style            width            50        Ubuntu            Fedora
    Mandriva            Linux            Mint            Debian
OpenSUSE            Red            Hat            Mageia            Arch
        Linux         PCLinuxOS            Slackware

LCF     style            text            align            left
width            50            table            layout            fixed
        border            0            valign            top
style            width            50       ubuntu            fedora
    mandriva            linux            mint            debian
opensuse            red            hat            mageia            arch
        linux            pclinuxos            slackware



Any ideas?



2013/7/23 Jack Krupansky <j...@basetechnology.com>

> Are you actually seeing that output from the WikipediaTokenizerFactory??
> Really? Even if you use the Solr Admin UI analysis page?
>
> You should just see the text tokens plus the URLs for links.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Furkan KAMACI
> Sent: Tuesday, July 23, 2013 10:53 AM
> To: solr-user@lucene.apache.org
> Subject: WikipediaTokenizer for Removing Unnecesary Parts
>
>
> Hi;
>
> I have indexed wikipedia data with Solr DIH. However when I look data that
> is indexed at Solr I something like that as well:
>
> {| style="text-align: left; width: 50%; table-layout: fixed;" border="0"
> |- valign="top"
> | style="width: 50%"|
> :*[[Ubuntu]]
> :*[[Fedora]]
> :*[[Mandriva]]
> :*[[Linux Mint]]
> :*[[Debian]]
> :*[[OpenSUSE]]
> |
> *[[Red Hat]]
> *[[Mageia]]
> *[[Arch Linux]]
> *[[PCLinuxOS]]
> *[[Slackware]]
> |}
>
> However I want to remove them before indexing. I know that there is a
> WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as
> like links, style, etc..) with Solr?
>

Re: WikipediaTokenizer for Removing Unnecesary Parts

Reply via email to