I have investigated different Solr versions. I have found that 4.10.3 is
the last version that completely strips the HTML to text as expected.
4.10.4 starts introducing some HTML comments and Javascript and anything
over 5.0 is full of mangled HTML and attribute artefacts such as
"X-Parsed-By".
Thanks Timothy,
Will give the DIH a try. I have submitted a bug report.
Regards,
Simon
On 31/05/16 13:22, Allison, Timothy B. wrote:
From the same page, extractFormat=text only applies when extractOnly
is true, which just shows the output from tika without indexing the document.
Y, sorry.
>> From the same page, extractFormat=text only applies when extractOnly
>> is true, which just shows the output from tika without indexing the document.
Y, sorry. I just looked through the source code. You're right. If you use
DIH (TikaEntityProcessor) instead of Solr Cell (ExtractingDocumen
Hi Alex,
That sounds similar. I am puzzled by what I am seeing because it looks
like a major bug and I am following the docs for curl as closely as
possible, but hardly anyone else seems to have noticed it. To me it is a
show-stopper.
If I convert the docs to txt with html2text first then I
I think Solr's layer above Tika was merging in metadata and text all
together without a way (that I could see) to separate them.
That's all I remember of my examination of this issue when I run into
something similar. Not very helpful, I know.
Regards,
Alex.
Newsletter and resources for S
Hi Timothy,
Thanks for responding.
java -jar tika-app-1.13.jar -t
"/home/user/Documents/library/UsingMailingLists.txt"
...gives a clean result with no CSS or other nasties in the output. So
it looks like the latest version of tika itself is OK.
I was basing the test case on this doc page as
Of course, for greater control over indexing (and for more robust handling of
exceedingly rare (but real) infinite loops/OOM caused by Tika), consider SolrJ:
http://searchhub.org/2012/02/14/indexing-with-solrj/
-Original Message-
From: Simon Blandford [mailto:simon.blandf...@bkconnect.ne
I'm only minimally familiar with Solr Cell, but...
1) It looks like you aren't setting extractFormat=text. According to [0]...the
default is xhtml which will include a bunch of the metadata.
2) is there an attr_* dynamic field in your index with type="ignored"? This
would strip out the attr_ f