Hi,

I am using Solr 6.0 on Ubuntu 14.04.

I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text starting with... body_txt_en: " stream_size 36499 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By...."

And then once it gets to the actual text I get CSS class names appearing that were in <p> or <div> tags etc. e.g. "....the power of calibre3 silence calibre2 and....", where "calibre3" etc are the CSS class names.

All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested in searching for.

Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT document index using the following command
curl "http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true"; -F "content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"

5) HTML document index using following command
curl "http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true"; -F "content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"

6) Query using URL: http://localhost:8983/solr/mycore/select?q=especially&wt=json

Result:

For the txt file, I get the following JSON for the document...

{
    id: "doc1",
    attr_stream_size: [
        "8107"
    ],
    attr_x_parsed_by: [
        "org.apache.tika.parser.DefaultParser",
        "org.apache.tika.parser.txt.TXTParser"
    ],
    attr_stream_content_type: [
        "text/plain"
    ],
    attr_stream_name: [
        "UsingMailingLists.txt"
    ],
    attr_stream_source_info: [
        "content/UsingMailingLists.txt"
    ],
    attr_content_encoding: [
        "ISO-8859-1"
    ],
    attr_content_type: [
        "text/plain; charset=ISO-8859-1"
    ],
body_txt_en: " stream_size 8107 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.txt.TXTParser stream_content_type text/plain stream_name UsingMailingLists.txt stream_source_info content/UsingMailingLists.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 Search: [value ] [Titles] [Text] Solr_Wiki Login ****** UsingMailingLists ****** * FrontPage * RecentChanges...etc",
_version_: 1535398235801124900
}

For the HTML file,  I get the following JSON for the document...

{
    id: "doc2",
        attr_stream_size: [
        "20440"
    ],
    attr_x_parsed_by: [
        "org.apache.tika.parser.DefaultParser",
        "org.apache.tika.parser.html.HtmlParser"
    ],
    attr_stream_content_type: [
        "text/html"
    ],
    attr_stream_name: [
        "UsingMailingLists.html"
    ],
    attr_stream_source_info: [
        "content/UsingMailingLists.html"
    ],
    attr_dc_title: [
        "UsingMailingLists - Solr Wiki"
    ],
    attr_content_encoding: [
        "UTF-8"
    ],
    attr_robots: [
        "index,nofollow"
    ],
    attr_title: [
        "UsingMailingLists - Solr Wiki"
    ],
    attr_content_type: [
        "text/html; charset=utf-8"
    ],
body_txt_en: " stylesheet text/css utf-8 all /wiki/modernized/css/common.css stylesheet text/css utf-8 screen /wiki/modernized/css/screen.css stylesheet text/css utf-8 print /wiki/modernized/css/print.css stylesheet text/css utf-8 projection /wiki/modernized/css/projection.css alternate Solr Wiki: UsingMailingLists /solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1 application/rss+xml Start /solr/FrontPage Alternate Wiki Markup /solr/UsingMailingLists?action=raw Alternate print Print View /solr/UsingMailingLists?action=print Search /solr/FindPage Index /solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html stream_name UsingMailingLists.html stream_source_info...etc",
    _version_: 1535398408383103000
}



Reply via email to