Re: Metadata and HTML ending up in searchable text

2016-06-02 Thread Simon Blandford
le.html. If you are finding the noise there. Please open an issue on our JIRA: https://issues.apache.org/jira/browse/tika [0] https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with +Solr+Cell+using+Apache+Tika -Original Message- From: Simon Blandford [mailto:simon.blandf...@bkconnec

Re: Metadata and HTML ending up in searchable text

2016-06-01 Thread Simon Blandford
be indexed...if you don't want them. As for the HTML file, it looks like Tika is failing to strip out the style section. Try running the file alone with tika-app: java -jar tika-app.jar -t inputfile.html. If you are finding the noise there. Please open an issue on our JIRA: https://issues.

RE: Metadata and HTML ending up in searchable text

2016-05-31 Thread Allison, Timothy B.
s xhtml which will include a bunch of the metadata. >>> 2) is there an attr_* dynamic field in your index with type="ignored"? >>> This would strip out the attr_ fields so they wouldn't even be >>> indexed...if you don't want them. >>> >>> As

Re: Metadata and HTML ending up in searchable text

2016-05-31 Thread Simon Blandford
t the style section. Try running the file alone with tika-app: java -jar tika-app.jar -t inputfile.html. If you are finding the noise there. Please open an issue on our JIRA: https://issues.apache.org/jira/browse/tika [0] https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+So

Re: Metadata and HTML ending up in searchable text

2016-05-27 Thread Alexandre Rafalovitch
ault is xhtml which will include a bunch of the metadata. >> 2) is there an attr_* dynamic field in your index with type="ignored"? >> This would strip out the attr_ fields so they wouldn't even be indexed...if >> you don't want them. >> >> As for the HTML fi

Re: Metadata and HTML ending up in searchable text

2016-05-27 Thread Simon Blandford
the noise there. Please open an issue on our JIRA: https://issues.apache.org/jira/browse/tika [0] https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika -Original Message- From: Simon Blandford [mailto:simon.blandf...@bkconnect.net] Sent:

RE: Metadata and HTML ending up in searchable text

2016-05-27 Thread Allison, Timothy B.
...@bkconnect.net] Sent: Thursday, May 26, 2016 9:49 AM To: solr-user@lucene.apache.org Subject: Metadata and HTML ending up in searchable text Hi, I am using Solr 6.0 on Ubuntu 14.04. I am ending up with loads of junk in the text body. It starts like, The JSON entry output of a search result shows the

RE: Metadata and HTML ending up in searchable text

2016-05-27 Thread Allison, Timothy B.
26, 2016 9:49 AM To: solr-user@lucene.apache.org Subject: Metadata and HTML ending up in searchable text Hi, I am using Solr 6.0 on Ubuntu 14.04. I am ending up with loads of junk in the text body. It starts like, The JSON entry output of a search result shows the indexed text starting with

Metadata and HTML ending up in searchable text

2016-05-26 Thread Simon Blandford
Hi, I am using Solr 6.0 on Ubuntu 14.04. I am ending up with loads of junk in the text body. It starts like, The JSON entry output of a search result shows the indexed text starting with... body_txt_en: " stream_size 36499 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By" An