>1) the toughest pdfs to identify are those that are partly
searchable (text) and partly not (image-based text). However, I've
found that such documents tend to exist in clusters.
Agreed. We should do something better in Tika to identify image-only pages on
a page-by-page basis, and
Thanks, Tim. A couple of quick comments and a couple of questions:
1) the toughest pdfs to identify are those that are partly
searchable (text) and partly not (image-based text). However, I've
found that such documents tend to exist in clusters.
2) email documents (.eml) are no
To be Waldorf to Erick's Statler (if I may), lots of things can go wrong during
content extraction.[1] I had two big concerns when I heard of your task:
1) image only pdfs, which can parse without problem, but which might yield 0
content.
2) emails (see, e.g. SOLR-12048)
It sounds like yo
ike date, subject, to and from. Other (so-called 'rich text')
>>> documents (like pdfs and Word-type), the metadata is not so useful, but
>>> on the other hand, there's not much consistent structure to the
>>> documents I have to deal with.
>>>
>>
However, there's a premium on precision (and recall) in searches.
>>> Please, oh, please, no matter what you're using for content/text extraction
>>> and/or OCR, run tika-eval[1] on the output to ensure that that you are
>>> getting mostly language-y content ou
you're using for content/text extraction
>> and/or OCR, run tika-eval[1] on the output to ensure that that you are
>> getting mostly language-y content out of your documents. Ping us on the
>> Tika user's list if you have any questions.
>>
>> Bad text,
wiki.apache.org/tika/TikaEval
>
> -Original Message-
> From: Charlie Hull [mailto:char...@flax.co.uk]
> Sent: Tuesday, April 17, 2018 4:17 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Specialized Solr Application
>
> On 16/04/2018 19:48, Terry Steichen wrote:
>&g
ay, April 17, 2018 4:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Specialized Solr Application
On 16/04/2018 19:48, Terry Steichen wrote:
> I have from time-to-time posted questions to this list (and received
> very prompt and helpful responses). But it seems that many of you are
>
On 16/04/2018 19:48, Terry Steichen wrote:
I have from time-to-time posted questions to this list (and received
very prompt and helpful responses). But it seems that many of you are
operating in a very different space from me. The problems (and
lessons-learned) which I encounter are often very