I'm only minimally familiar with Solr Cell, but...
1) It looks like you aren't setting extractFormat=text. According
to [0]...the default is xhtml which will include a bunch of the metadata.
2) is there an attr_* dynamic field in your index with type="ignored"?
This would strip out the attr_ fields so they wouldn't even be
indexed...if you don't want them.
As for the HTML file, it looks like Tika is failing to strip out the
style section. Try running the file alone with tika-app: java -jar
tika-app.jar -t inputfile.html. If you are finding the noise there.
Please open an issue on our JIRA:
https://issues.apache.org/jira/browse/tika
[0]
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with
+Solr+Cell+using+Apache+Tika
-----Original Message-----
From: Simon Blandford [mailto:simon.blandf...@bkconnect.net]
Sent: Thursday, May 26, 2016 9:49 AM
To: solr-user@lucene.apache.org
Subject: Metadata and HTML ending up in searchable text
Hi,
I am using Solr 6.0 on Ubuntu 14.04.
I am ending up with loads of junk in the text body. It starts like,
The JSON entry output of a search result shows the indexed text
starting with...
body_txt_en: " stream_size 36499 X-Parsed-By
org.apache.tika.parser.DefaultParser X-Parsed-By...."
And then once it gets to the actual text I get CSS class names
appearing that were in <p> or <div> tags etc.
e.g. "....the power of calibre3 silence calibre2 and....", where
"calibre3" etc are the CSS class names.
All this junk is searchable and is polluting the index.
I would like to index _only_ the actual content I am interested in
searching for.
Steps to reproduce:
1) Solr installed by untaring solr tgz in /opt.
2) Core created by typing "bin/solr create -c mycore"
3) Solr started with bin/solr start
4) TXT document index using the following command curl
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true"
-F
"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"
5) HTML document index using following command curl
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true"
-F
"content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"
6) Query using URL:
http://localhost:8983/solr/mycore/select?q=especially&wt=json
Result:
For the txt file, I get the following JSON for the document...
{
id: "doc1",
attr_stream_size: [
"8107"
],
attr_x_parsed_by: [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.txt.TXTParser"
],
attr_stream_content_type: [
"text/plain"
],
attr_stream_name: [
"UsingMailingLists.txt"
],
attr_stream_source_info: [
"content/UsingMailingLists.txt"
],
attr_content_encoding: [
"ISO-8859-1"
],
attr_content_type: [
"text/plain; charset=ISO-8859-1"
],
body_txt_en: " stream_size 8107 X-Parsed-By
org.apache.tika.parser.DefaultParser X-Parsed-By
org.apache.tika.parser.txt.TXTParser stream_content_type text/plain
stream_name UsingMailingLists.txt stream_source_info
content/UsingMailingLists.txt Content-Encoding ISO-8859-1
Content-Type text/plain; charset=ISO-8859-1 Search: [value ]
[Titles] [Text] Solr_Wiki Login ****** UsingMailingLists ****** *
FrontPage * RecentChanges...etc",
_version_: 1535398235801124900
}
For the HTML file, I get the following JSON for the document...
{
id: "doc2",
attr_stream_size: [
"20440"
],
attr_x_parsed_by: [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"
],
attr_stream_content_type: [
"text/html"
],
attr_stream_name: [
"UsingMailingLists.html"
],
attr_stream_source_info: [
"content/UsingMailingLists.html"
],
attr_dc_title: [
"UsingMailingLists - Solr Wiki"
],
attr_content_encoding: [
"UTF-8"
],
attr_robots: [
"index,nofollow"
],
attr_title: [
"UsingMailingLists - Solr Wiki"
],
attr_content_type: [
"text/html; charset=utf-8"
],
body_txt_en: " stylesheet text/css utf-8 all
/wiki/modernized/css/common.css stylesheet text/css utf-8 screen
/wiki/modernized/css/screen.css stylesheet text/css utf-8 print
/wiki/modernized/css/print.css stylesheet text/css utf-8 projection
/wiki/modernized/css/projection.css alternate Solr Wiki:
UsingMailingLists
/solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&pa
ge=UsingMailingLists&ddiffs=1 application/rss+xml Start
/solr/FrontPage Alternate Wiki Markup
/solr/UsingMailingLists?action=raw Alternate print Print View
/solr/UsingMailingLists?action=print Search /solr/FindPage Index
/solr/TitleIndex Glossary /solr/WordIndex Help
/solr/HelpOnFormatting stream_size 20440 X-Parsed-By
org.apache.tika.parser.DefaultParser
X-Parsed-By org.apache.tika.parser.html.HtmlParser
stream_content_type text/html stream_name UsingMailingLists.html
stream_source_info...etc",
_version_: 1535398408383103000 }