I think Solr's layer above Tika was merging in metadata and text all together without a way (that I could see) to separate them.
That's all I remember of my examination of this issue when I run into something similar. Not very helpful, I know. Regards, Alex. ---- Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 27 May 2016 at 23:48, Simon Blandford <simon.blandf...@bkconnect.net> wrote: > Hi Timothy, > > Thanks for responding. > > java -jar tika-app-1.13.jar -t > "/home/user/Documents/library/UsingMailingLists.txt" > ...gives a clean result with no CSS or other nasties in the output. So it > looks like the latest version of tika itself is OK. > > I was basing the test case on this doc page as closely as possible, > including the prefix and content mapping. > https://wiki.apache.org/solr/ExtractingRequestHandler > > From the same page, extractFormat=text only applies when extractOnly is > true, which just shows the output from tika without indexing the document. > Running it in "extractOnly" mode resulting in a XML output. The difference > between selecting "text" or "xml" format is that the escaped document in the > <response> tag is either the original HTML (xml mode) or stripped HTML (text > mode). It seems some Javascript creeps into the text version. (See below) > > Regards, > Simon > > HTML mode sample: > <?xml version="1.0" encoding="UTF-8"?> > <response> > <lst name="responseHeader"><int name="status">0</int><int > name="QTime">51</int></lst><str name="UsingMailingLists.html"><?xml > version="1.0" encoding="UTF-8"?> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <link > rel="stylesheet" type="text/css" charset="utf-8" media="all" > href="/wiki/modernized/css/common.css"/> > <link rel="stylesheet" type="text/css" charset="utf-8" > media="screen" href="/wiki/modernized/css/screen.css"/> > <link rel="stylesheet" type="text/css" charset="utf-8" > media="print" href="/wiki/modernized/css/print.css"/>....... > > TEXT mode (Blank lines stripped): > <response> > <lst name="responseHeader"><int name="status">0</int><int > name="QTime">47</int></lst><str name="UsingMailingLists.html"> > UsingMailingLists - Solr Wiki > Search: > <!--// Initialize search form > var f = document.getElementById('searchform'); > f.getElementsByTagName('label')[0].style.display = 'none'; > var e = document.getElementById('searchinput'); > searchChange(e); > searchBlur(e); > //--> > Solr Wiki > Login > > > > > > > On 27/05/16 13:31, Allison, Timothy B. wrote: >> >> I'm only minimally familiar with Solr Cell, but... >> >> 1) It looks like you aren't setting extractFormat=text. According to >> [0]...the default is xhtml which will include a bunch of the metadata. >> 2) is there an attr_* dynamic field in your index with type="ignored"? >> This would strip out the attr_ fields so they wouldn't even be indexed...if >> you don't want them. >> >> As for the HTML file, it looks like Tika is failing to strip out the style >> section. Try running the file alone with tika-app: java -jar tika-app.jar >> -t inputfile.html. If you are finding the noise there. Please open an >> issue on our JIRA: https://issues.apache.org/jira/browse/tika >> >> >> [0] >> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika >> >> >> -----Original Message----- >> From: Simon Blandford [mailto:simon.blandf...@bkconnect.net] >> Sent: Thursday, May 26, 2016 9:49 AM >> To: solr-user@lucene.apache.org >> Subject: Metadata and HTML ending up in searchable text >> >> Hi, >> >> I am using Solr 6.0 on Ubuntu 14.04. >> >> I am ending up with loads of junk in the text body. It starts like, >> >> The JSON entry output of a search result shows the indexed text starting >> with... >> body_txt_en: " stream_size 36499 X-Parsed-By >> org.apache.tika.parser.DefaultParser X-Parsed-By...." >> >> And then once it gets to the actual text I get CSS class names appearing >> that were in <p> or <div> tags etc. >> e.g. "....the power of calibre3 silence calibre2 and....", where >> "calibre3" etc are the CSS class names. >> >> All this junk is searchable and is polluting the index. >> >> I would like to index _only_ the actual content I am interested in >> searching for. >> >> Steps to reproduce: >> >> 1) Solr installed by untaring solr tgz in /opt. >> >> 2) Core created by typing "bin/solr create -c mycore" >> >> 3) Solr started with bin/solr start >> >> 4) TXT document index using the following command curl >> "http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true" >> -F >> >> "content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt" >> >> 5) HTML document index using following command curl >> "http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true" >> -F >> >> "content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html" >> >> 6) Query using URL: >> http://localhost:8983/solr/mycore/select?q=especially&wt=json >> >> Result: >> >> For the txt file, I get the following JSON for the document... >> >> { >> id: "doc1", >> attr_stream_size: [ >> "8107" >> ], >> attr_x_parsed_by: [ >> "org.apache.tika.parser.DefaultParser", >> "org.apache.tika.parser.txt.TXTParser" >> ], >> attr_stream_content_type: [ >> "text/plain" >> ], >> attr_stream_name: [ >> "UsingMailingLists.txt" >> ], >> attr_stream_source_info: [ >> "content/UsingMailingLists.txt" >> ], >> attr_content_encoding: [ >> "ISO-8859-1" >> ], >> attr_content_type: [ >> "text/plain; charset=ISO-8859-1" >> ], >> body_txt_en: " stream_size 8107 X-Parsed-By >> org.apache.tika.parser.DefaultParser X-Parsed-By >> org.apache.tika.parser.txt.TXTParser stream_content_type text/plain >> stream_name UsingMailingLists.txt stream_source_info >> content/UsingMailingLists.txt Content-Encoding ISO-8859-1 Content-Type >> text/plain; charset=ISO-8859-1 Search: [value ] [Titles] [Text] Solr_Wiki >> Login ****** UsingMailingLists ****** * FrontPage * RecentChanges...etc", >> _version_: 1535398235801124900 >> } >> >> For the HTML file, I get the following JSON for the document... >> >> { >> id: "doc2", >> attr_stream_size: [ >> "20440" >> ], >> attr_x_parsed_by: [ >> "org.apache.tika.parser.DefaultParser", >> "org.apache.tika.parser.html.HtmlParser" >> ], >> attr_stream_content_type: [ >> "text/html" >> ], >> attr_stream_name: [ >> "UsingMailingLists.html" >> ], >> attr_stream_source_info: [ >> "content/UsingMailingLists.html" >> ], >> attr_dc_title: [ >> "UsingMailingLists - Solr Wiki" >> ], >> attr_content_encoding: [ >> "UTF-8" >> ], >> attr_robots: [ >> "index,nofollow" >> ], >> attr_title: [ >> "UsingMailingLists - Solr Wiki" >> ], >> attr_content_type: [ >> "text/html; charset=utf-8" >> ], >> body_txt_en: " stylesheet text/css utf-8 all >> /wiki/modernized/css/common.css stylesheet text/css utf-8 screen >> /wiki/modernized/css/screen.css stylesheet text/css utf-8 print >> /wiki/modernized/css/print.css stylesheet text/css utf-8 projection >> /wiki/modernized/css/projection.css alternate Solr Wiki: >> UsingMailingLists >> >> /solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1 >> application/rss+xml Start /solr/FrontPage Alternate Wiki Markup >> /solr/UsingMailingLists?action=raw Alternate print Print View >> /solr/UsingMailingLists?action=print Search /solr/FindPage Index >> /solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting >> stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser >> X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type >> text/html stream_name UsingMailingLists.html stream_source_info...etc", >> _version_: 1535398408383103000 >> } >> >> >> >