le.html. If you are finding the noise there.
Please open an issue on our JIRA:
https://issues.apache.org/jira/browse/tika
[0]
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with
+Solr+Cell+using+Apache+Tika
-Original Message-
From: Simon Blandford [mailto:simon.blandf...@bkconnec
be
indexed...if you don't want them.
As for the HTML file, it looks like Tika is failing to strip out the
style section. Try running the file alone with tika-app: java -jar
tika-app.jar -t inputfile.html. If you are finding the noise there.
Please open an issue on our JIRA:
https://issues.
s xhtml which will include a bunch of the metadata.
>>> 2) is there an attr_* dynamic field in your index with type="ignored"?
>>> This would strip out the attr_ fields so they wouldn't even be
>>> indexed...if you don't want them.
>>>
>>> As
t the style
section. Try running the file alone with tika-app: java -jar tika-app.jar
-t inputfile.html. If you are finding the noise there. Please open an
issue on our JIRA: https://issues.apache.org/jira/browse/tika
[0]
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+So
ault is xhtml which will include a bunch of the metadata.
>> 2) is there an attr_* dynamic field in your index with type="ignored"?
>> This would strip out the attr_ fields so they wouldn't even be indexed...if
>> you don't want them.
>>
>> As for the HTML fi
the noise there. Please open an issue on
our JIRA: https://issues.apache.org/jira/browse/tika
[0]
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
-Original Message-
From: Simon Blandford [mailto:simon.blandf...@bkconnect.net]
Sent:
...@bkconnect.net]
Sent: Thursday, May 26, 2016 9:49 AM
To: solr-user@lucene.apache.org
Subject: Metadata and HTML ending up in searchable text
Hi,
I am using Solr 6.0 on Ubuntu 14.04.
I am ending up with loads of junk in the text body. It starts like,
The JSON entry output of a search result shows the
26, 2016 9:49 AM
To: solr-user@lucene.apache.org
Subject: Metadata and HTML ending up in searchable text
Hi,
I am using Solr 6.0 on Ubuntu 14.04.
I am ending up with loads of junk in the text body. It starts like,
The JSON entry output of a search result shows the indexed text starting with
Hi,
I am using Solr 6.0 on Ubuntu 14.04.
I am ending up with loads of junk in the text body. It starts like,
The JSON entry output of a search result shows the indexed text starting
with...
body_txt_en: " stream_size 36499 X-Parsed-By
org.apache.tika.parser.DefaultParser X-Parsed-By"
An