Hi,
I am using Solr 6.0 on Ubuntu 14.04.
I am ending up with loads of junk in the text body. It starts like,
The JSON entry output of a search result shows the indexed text starting
with...
body_txt_en: " stream_size 36499 X-Parsed-By
org.apache.tika.parser.DefaultParser X-Parsed-By...."
And then once it gets to the actual text I get CSS class names appearing
that were in <p> or <div> tags etc.
e.g. "....the power of calibre3 silence calibre2 and....", where
"calibre3" etc are the CSS class names.
All this junk is searchable and is polluting the index.
I would like to index _only_ the actual content I am interested in
searching for.
Steps to reproduce:
1) Solr installed by untaring solr tgz in /opt.
2) Core created by typing "bin/solr create -c mycore"
3) Solr started with bin/solr start
4) TXT document index using the following command
curl
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true"
-F
"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"
5) HTML document index using following command
curl
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true"
-F
"content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"
6) Query using URL:
http://localhost:8983/solr/mycore/select?q=especially&wt=json
Result:
For the txt file, I get the following JSON for the document...
{
id: "doc1",
attr_stream_size: [
"8107"
],
attr_x_parsed_by: [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.txt.TXTParser"
],
attr_stream_content_type: [
"text/plain"
],
attr_stream_name: [
"UsingMailingLists.txt"
],
attr_stream_source_info: [
"content/UsingMailingLists.txt"
],
attr_content_encoding: [
"ISO-8859-1"
],
attr_content_type: [
"text/plain; charset=ISO-8859-1"
],
body_txt_en: " stream_size 8107 X-Parsed-By
org.apache.tika.parser.DefaultParser X-Parsed-By
org.apache.tika.parser.txt.TXTParser stream_content_type text/plain
stream_name UsingMailingLists.txt stream_source_info
content/UsingMailingLists.txt Content-Encoding ISO-8859-1 Content-Type
text/plain; charset=ISO-8859-1 Search: [value ] [Titles] [Text]
Solr_Wiki Login ****** UsingMailingLists ****** * FrontPage *
RecentChanges...etc",
_version_: 1535398235801124900
}
For the HTML file, I get the following JSON for the document...
{
id: "doc2",
attr_stream_size: [
"20440"
],
attr_x_parsed_by: [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"
],
attr_stream_content_type: [
"text/html"
],
attr_stream_name: [
"UsingMailingLists.html"
],
attr_stream_source_info: [
"content/UsingMailingLists.html"
],
attr_dc_title: [
"UsingMailingLists - Solr Wiki"
],
attr_content_encoding: [
"UTF-8"
],
attr_robots: [
"index,nofollow"
],
attr_title: [
"UsingMailingLists - Solr Wiki"
],
attr_content_type: [
"text/html; charset=utf-8"
],
body_txt_en: " stylesheet text/css utf-8 all
/wiki/modernized/css/common.css stylesheet text/css utf-8 screen
/wiki/modernized/css/screen.css stylesheet text/css utf-8 print
/wiki/modernized/css/print.css stylesheet text/css utf-8 projection
/wiki/modernized/css/projection.css alternate Solr Wiki:
UsingMailingLists
/solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1
application/rss+xml Start /solr/FrontPage Alternate Wiki Markup
/solr/UsingMailingLists?action=raw Alternate print Print View
/solr/UsingMailingLists?action=print Search /solr/FindPage Index
/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting
stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser
X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type
text/html stream_name UsingMailingLists.html stream_source_info...etc",
_version_: 1535398408383103000
}