Re: Metadata and HTML ending up in searchable text

Simon Blandford Fri, 27 May 2016 06:49:37 -0700

Hi Timothy,

Thanks for responding.

java -jar tika-app-1.13.jar -t"/home/user/Documents/library/UsingMailingLists.txt"...gives a clean result with no CSS or other nasties in the output. Soit looks like the latest version of tika itself is OK.

I was basing the test case on this doc page as closely as possible,including the prefix and content mapping.

https://wiki.apache.org/solr/ExtractingRequestHandler

From the same page, extractFormat=text only applies when extractOnly istrue, which just shows the output from tika without indexing thedocument. Running it in "extractOnly" mode resulting in a XML output.The difference between selecting "text" or "xml" format is that theescaped document in the <response> tag is either the original HTML (xmlmode) or stripped HTML (text mode). It seems some Javascript creeps intothe text version. (See below)


Regards,
Simon

HTML mode sample:
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader"><int name="status">0</int><intname="QTime">51</int></lst><str name="UsingMailingLists.html"><?xmlversion="1.0" encoding="UTF-8"?>

&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
&lt;head&gt;
&lt;link

rel="stylesheet" type="text/css" charset="utf-8"media="all" href="/wiki/modernized/css/common.css"/>

        &lt;link rel="stylesheet" type="text/css" charset="utf-8"
            media="screen" href="/wiki/modernized/css/screen.css"/&gt;
        &lt;link rel="stylesheet" type="text/css" charset="utf-8"
            media="print" href="/wiki/modernized/css/print.css"/&gt;.......

TEXT mode (Blank lines stripped):
<response>

<lst name="responseHeader"><int name="status">0</int><intname="QTime">47</int></lst><str name="UsingMailingLists.html">

UsingMailingLists - Solr Wiki
Search:
&lt;!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none';
var e = document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//--&gt;
Solr Wiki
Login





On 27/05/16 13:31, Allison, Timothy B. wrote:

I'm only minimally familiar with Solr Cell, but...

1) It looks like you aren't setting extractFormat=text.  According to [0]...the 
default is xhtml which will include a bunch of the metadata.
2) is there an attr_* dynamic field in your index with type="ignored"?  This 
would strip out the attr_ fields so they wouldn't even be indexed...if you don't want 
them.

As for the HTML file, it looks like Tika is failing to strip out the style 
section.  Try running the file alone with tika-app: java -jar tika-app.jar -t 
inputfile.html.  If you are finding the noise there.  Please open an issue on 
our JIRA: https://issues.apache.org/jira/browse/tika


[0] 
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika


-----Original Message-----
From: Simon Blandford [mailto:[email protected]]
Sent: Thursday, May 26, 2016 9:49 AM
To: [email protected]
Subject: Metadata and HTML ending up in searchable text

Hi,

I am using Solr 6.0 on Ubuntu 14.04.

I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text starting with...
body_txt_en: " stream_size 36499 X-Parsed-By org.apache.tika.parser.DefaultParser 
X-Parsed-By...."

And then once it gets to the actual text I get CSS class names appearing that were in 
<p> or <div> tags etc.
e.g. "....the power of calibre3 silence calibre2 and....", where "calibre3" etc 
are the CSS class names.

All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested in searching 
for.

Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT document index using the following command curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true";
-F
"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"

5) HTML document index using following command curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true";
-F
"content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"

6) Query using URL:
http://localhost:8983/solr/mycore/select?q=especially&wt=json

Result:

For the txt file, I get the following JSON for the document...

{
      id: "doc1",
      attr_stream_size: [
          "8107"
      ],
      attr_x_parsed_by: [
          "org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.txt.TXTParser"
      ],
      attr_stream_content_type: [
          "text/plain"
      ],
      attr_stream_name: [
          "UsingMailingLists.txt"
      ],
      attr_stream_source_info: [
          "content/UsingMailingLists.txt"
      ],
      attr_content_encoding: [
          "ISO-8859-1"
      ],
      attr_content_type: [
          "text/plain; charset=ISO-8859-1"
      ],
      body_txt_en: " stream_size 8107 X-Parsed-By 
org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.txt.TXTParser 
stream_content_type text/plain stream_name UsingMailingLists.txt stream_source_info 
content/UsingMailingLists.txt Content-Encoding ISO-8859-1 Content-Type text/plain; 
charset=ISO-8859-1 Search: [value ] [Titles] [Text] Solr_Wiki Login ****** 
UsingMailingLists ****** * FrontPage * RecentChanges...etc",
_version_: 1535398235801124900
}

For the HTML file,  I get the following JSON for the document...

{
      id: "doc2",
          attr_stream_size: [
          "20440"
      ],
      attr_x_parsed_by: [
          "org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.html.HtmlParser"
      ],
      attr_stream_content_type: [
          "text/html"
      ],
      attr_stream_name: [
          "UsingMailingLists.html"
      ],
      attr_stream_source_info: [
          "content/UsingMailingLists.html"
      ],
      attr_dc_title: [
          "UsingMailingLists - Solr Wiki"
      ],
      attr_content_encoding: [
          "UTF-8"
      ],
      attr_robots: [
          "index,nofollow"
      ],
      attr_title: [
          "UsingMailingLists - Solr Wiki"
      ],
      attr_content_type: [
          "text/html; charset=utf-8"
      ],
      body_txt_en: " stylesheet text/css utf-8 all 
/wiki/modernized/css/common.css stylesheet text/css utf-8 screen 
/wiki/modernized/css/screen.css stylesheet text/css utf-8 print 
/wiki/modernized/css/print.css stylesheet text/css utf-8 projection 
/wiki/modernized/css/projection.css alternate Solr Wiki:
UsingMailingLists
/solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1
application/rss+xml Start /solr/FrontPage Alternate Wiki Markup 
/solr/UsingMailingLists?action=raw Alternate print Print View 
/solr/UsingMailingLists?action=print Search /solr/FindPage Index 
/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting 
stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser
X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html 
stream_name UsingMailingLists.html stream_source_info...etc",
      _version_: 1535398408383103000
}

Re: Metadata and HTML ending up in searchable text

Reply via email to