Re: Metadata and HTML ending up in searchable text

Alexandre Rafalovitch Fri, 27 May 2016 12:23:45 -0700

I think Solr's layer above Tika was merging in metadata and text all
together without a way (that I could see) to separate them.


That's all I remember of my examination of this issue when I run into
something similar. Not very helpful, I know.

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 27 May 2016 at 23:48, Simon Blandford <simon.blandf...@bkconnect.net> wrote:
> Hi Timothy,
>
> Thanks for responding.
>
> java -jar tika-app-1.13.jar -t
> "/home/user/Documents/library/UsingMailingLists.txt"
> ...gives a clean result with no CSS or other nasties in the output. So it
> looks like the latest version of tika itself is OK.
>
> I was basing the test case on this doc page as closely as possible,
> including the prefix and content mapping.
> https://wiki.apache.org/solr/ExtractingRequestHandler
>
> From the same page, extractFormat=text only applies when extractOnly is
> true, which just shows the output from tika without indexing the document.
> Running it in "extractOnly" mode resulting in a XML output. The difference
> between selecting "text" or "xml" format is that the escaped document in the
> <response> tag is either the original HTML (xml mode) or stripped HTML (text
> mode). It seems some Javascript creeps into the text version. (See below)
>
> Regards,
> Simon
>
> HTML mode sample:
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">51</int></lst><str name="UsingMailingLists.html">&lt;?xml
> version="1.0" encoding="UTF-8"?&gt;
> &lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
> &lt;head&gt;
> &lt;link
>             rel="stylesheet" type="text/css" charset="utf-8" media="all"
> href="/wiki/modernized/css/common.css"/&gt;
>         &lt;link rel="stylesheet" type="text/css" charset="utf-8"
>             media="screen" href="/wiki/modernized/css/screen.css"/&gt;
>         &lt;link rel="stylesheet" type="text/css" charset="utf-8"
>             media="print" href="/wiki/modernized/css/print.css"/&gt;.......
>
> TEXT mode (Blank lines stripped):
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">47</int></lst><str name="UsingMailingLists.html">
> UsingMailingLists - Solr Wiki
> Search:
> &lt;!--// Initialize search form
> var f = document.getElementById('searchform');
> f.getElementsByTagName('label')[0].style.display = 'none';
> var e = document.getElementById('searchinput');
> searchChange(e);
> searchBlur(e);
> //--&gt;
> Solr Wiki
> Login
>
>
>
>
>
>
> On 27/05/16 13:31, Allison, Timothy B. wrote:
>>
>> I'm only minimally familiar with Solr Cell, but...
>>
>> 1) It looks like you aren't setting extractFormat=text.  According to
>> [0]...the default is xhtml which will include a bunch of the metadata.
>> 2) is there an attr_* dynamic field in your index with type="ignored"?
>> This would strip out the attr_ fields so they wouldn't even be indexed...if
>> you don't want them.
>>
>> As for the HTML file, it looks like Tika is failing to strip out the style
>> section.  Try running the file alone with tika-app: java -jar tika-app.jar
>> -t inputfile.html.  If you are finding the noise there.  Please open an
>> issue on our JIRA: https://issues.apache.org/jira/browse/tika
>>
>>
>> [0]
>> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
>>
>>
>> -----Original Message-----
>> From: Simon Blandford [mailto:simon.blandf...@bkconnect.net]
>> Sent: Thursday, May 26, 2016 9:49 AM
>> To: solr-user@lucene.apache.org
>> Subject: Metadata and HTML ending up in searchable text
>>
>> Hi,
>>
>> I am using Solr 6.0 on Ubuntu 14.04.
>>
>> I am ending up with loads of junk in the text body. It starts like,
>>
>> The JSON entry output of a search result shows the indexed text starting
>> with...
>> body_txt_en: " stream_size 36499 X-Parsed-By
>> org.apache.tika.parser.DefaultParser X-Parsed-By...."
>>
>> And then once it gets to the actual text I get CSS class names appearing
>> that were in <p> or <div> tags etc.
>> e.g. "....the power of calibre3 silence calibre2 and....", where
>> "calibre3" etc are the CSS class names.
>>
>> All this junk is searchable and is polluting the index.
>>
>> I would like to index _only_ the actual content I am interested in
>> searching for.
>>
>> Steps to reproduce:
>>
>> 1) Solr installed by untaring solr tgz in /opt.
>>
>> 2) Core created by typing "bin/solr create -c mycore"
>>
>> 3) Solr started with bin/solr start
>>
>> 4) TXT document index using the following command curl
>> "http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true";
>> -F
>>
>> "content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"
>>
>> 5) HTML document index using following command curl
>> "http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true";
>> -F
>>
>> "content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"
>>
>> 6) Query using URL:
>> http://localhost:8983/solr/mycore/select?q=especially&wt=json
>>
>> Result:
>>
>> For the txt file, I get the following JSON for the document...
>>
>> {
>>       id: "doc1",
>>       attr_stream_size: [
>>           "8107"
>>       ],
>>       attr_x_parsed_by: [
>>           "org.apache.tika.parser.DefaultParser",
>>           "org.apache.tika.parser.txt.TXTParser"
>>       ],
>>       attr_stream_content_type: [
>>           "text/plain"
>>       ],
>>       attr_stream_name: [
>>           "UsingMailingLists.txt"
>>       ],
>>       attr_stream_source_info: [
>>           "content/UsingMailingLists.txt"
>>       ],
>>       attr_content_encoding: [
>>           "ISO-8859-1"
>>       ],
>>       attr_content_type: [
>>           "text/plain; charset=ISO-8859-1"
>>       ],
>>       body_txt_en: " stream_size 8107 X-Parsed-By
>> org.apache.tika.parser.DefaultParser X-Parsed-By
>> org.apache.tika.parser.txt.TXTParser stream_content_type text/plain
>> stream_name UsingMailingLists.txt stream_source_info
>> content/UsingMailingLists.txt Content-Encoding ISO-8859-1 Content-Type
>> text/plain; charset=ISO-8859-1 Search: [value ] [Titles] [Text] Solr_Wiki
>> Login ****** UsingMailingLists ****** * FrontPage * RecentChanges...etc",
>> _version_: 1535398235801124900
>> }
>>
>> For the HTML file,  I get the following JSON for the document...
>>
>> {
>>       id: "doc2",
>>           attr_stream_size: [
>>           "20440"
>>       ],
>>       attr_x_parsed_by: [
>>           "org.apache.tika.parser.DefaultParser",
>>           "org.apache.tika.parser.html.HtmlParser"
>>       ],
>>       attr_stream_content_type: [
>>           "text/html"
>>       ],
>>       attr_stream_name: [
>>           "UsingMailingLists.html"
>>       ],
>>       attr_stream_source_info: [
>>           "content/UsingMailingLists.html"
>>       ],
>>       attr_dc_title: [
>>           "UsingMailingLists - Solr Wiki"
>>       ],
>>       attr_content_encoding: [
>>           "UTF-8"
>>       ],
>>       attr_robots: [
>>           "index,nofollow"
>>       ],
>>       attr_title: [
>>           "UsingMailingLists - Solr Wiki"
>>       ],
>>       attr_content_type: [
>>           "text/html; charset=utf-8"
>>       ],
>>       body_txt_en: " stylesheet text/css utf-8 all
>> /wiki/modernized/css/common.css stylesheet text/css utf-8 screen
>> /wiki/modernized/css/screen.css stylesheet text/css utf-8 print
>> /wiki/modernized/css/print.css stylesheet text/css utf-8 projection
>> /wiki/modernized/css/projection.css alternate Solr Wiki:
>> UsingMailingLists
>>
>> /solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1
>> application/rss+xml Start /solr/FrontPage Alternate Wiki Markup
>> /solr/UsingMailingLists?action=raw Alternate print Print View
>> /solr/UsingMailingLists?action=print Search /solr/FindPage Index
>> /solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting
>> stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser
>> X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type
>> text/html stream_name UsingMailingLists.html stream_source_info...etc",
>>       _version_: 1535398408383103000
>> }
>>
>>
>>
>

Re: Metadata and HTML ending up in searchable text

Reply via email to