Metadata and HTML ending up in searchable text

Simon Blandford Thu, 26 May 2016 06:49:47 -0700

Hi,

I am using Solr 6.0 on Ubuntu 14.04.


I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text startingwith...body_txt_en: " stream_size 36499 X-Parsed-Byorg.apache.tika.parser.DefaultParser X-Parsed-By...."

And then once it gets to the actual text I get CSS class names appearingthat were in <p> or <div> tags etc.e.g. "....the power of calibre3 silence calibre2 and....", where"calibre3" etc are the CSS class names.


All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested insearching for.


Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT document index using the following command

curl"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true";-F"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"


5) HTML document index using following command

curl"http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true";-F"content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"

6) Query using URL:http://localhost:8983/solr/mycore/select?q=especially&wt=json


Result:

For the txt file, I get the following JSON for the document...

{
    id: "doc1",
    attr_stream_size: [
        "8107"
    ],
    attr_x_parsed_by: [
        "org.apache.tika.parser.DefaultParser",
        "org.apache.tika.parser.txt.TXTParser"
    ],
    attr_stream_content_type: [
        "text/plain"
    ],
    attr_stream_name: [
        "UsingMailingLists.txt"
    ],
    attr_stream_source_info: [
        "content/UsingMailingLists.txt"
    ],
    attr_content_encoding: [
        "ISO-8859-1"
    ],
    attr_content_type: [
        "text/plain; charset=ISO-8859-1"
    ],

body_txt_en: " stream_size 8107 X-Parsed-Byorg.apache.tika.parser.DefaultParser X-Parsed-Byorg.apache.tika.parser.txt.TXTParser stream_content_type text/plainstream_name UsingMailingLists.txt stream_source_infocontent/UsingMailingLists.txt Content-Encoding ISO-8859-1 Content-Typetext/plain; charset=ISO-8859-1 Search: [value ] [Titles] [Text]Solr_Wiki Login ****** UsingMailingLists ****** * FrontPage *RecentChanges...etc",

_version_: 1535398235801124900
}

For the HTML file,  I get the following JSON for the document...

{
    id: "doc2",
        attr_stream_size: [
        "20440"
    ],
    attr_x_parsed_by: [
        "org.apache.tika.parser.DefaultParser",
        "org.apache.tika.parser.html.HtmlParser"
    ],
    attr_stream_content_type: [
        "text/html"
    ],
    attr_stream_name: [
        "UsingMailingLists.html"
    ],
    attr_stream_source_info: [
        "content/UsingMailingLists.html"
    ],
    attr_dc_title: [
        "UsingMailingLists - Solr Wiki"
    ],
    attr_content_encoding: [
        "UTF-8"
    ],
    attr_robots: [
        "index,nofollow"
    ],
    attr_title: [
        "UsingMailingLists - Solr Wiki"
    ],
    attr_content_type: [
        "text/html; charset=utf-8"
    ],

body_txt_en: " stylesheet text/css utf-8 all/wiki/modernized/css/common.css stylesheet text/css utf-8 screen/wiki/modernized/css/screen.css stylesheet text/css utf-8 print/wiki/modernized/css/print.css stylesheet text/css utf-8 projection/wiki/modernized/css/projection.css alternate Solr Wiki:UsingMailingLists/solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1application/rss+xml Start /solr/FrontPage Alternate Wiki Markup/solr/UsingMailingLists?action=raw Alternate print Print View/solr/UsingMailingLists?action=print Search /solr/FindPage Index/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormattingstream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParserX-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_typetext/html stream_name UsingMailingLists.html stream_source_info...etc",

    _version_: 1535398408383103000
}

Metadata and HTML ending up in searchable text

Reply via email to