RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Allison, Timothy B. Tue, 20 Jun 2017 05:07:44 -0700

Yeah, Chris knows a thing or two about Tika.  :)

-----Original Message-----
From: ZiYuan [mailto:ziyu...@gmail.com] 
Sent: Tuesday, June 20, 2017 8:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting 
matched text with context


No intention of spamming but I also want to mention tika-python 
<https://github.com/chrismattmann/tika-python> in the toolchain.

Ziyuan

On Tue, Jun 20, 2017 at 2:29 PM, ZiYuan <ziyu...@gmail.com> wrote:

> Dear Erick and Timothy,
>
> I also took a look at the Python clients (say, SolrClient and pysolr) 
> because Python is my main programming language. I have an impression 
> that 1. they send HTTP requests to the server according to the server APIs; 2.
> they are not official and thus possibly not up to date. Does SolrJ 
> talk to the server via HTTP or some other more native ways? Is the 
> main benefit of SolrJ over other clients the official shipment with Solr? 
> Thank you.
>
> Best regards,
> Ziyuan
>
> On Jun 19, 2017 18:43, "ZiYuan" <ziyu...@gmail.com> wrote:
>
>> Dear Erick and Timothy,
>>
>> yes I will parse from the client for all the benefits. I am just 
>> trying to figure out what is going on by indexing one or two PDF files first.
>> Thank you both.
>>
>> Best regards,
>> Ziyuan
>>
>> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson 
>> <erickerick...@gmail.com>
>> wrote:
>>
>>> bq: Hope that there is no side effect of not mapping the PDF
>>>
>>> Well, yes it will have that side effect. You can cure that with a 
>>> copyField directive from content to _text_.
>>>
>>> But do really consider running this as a SolrJ program on the client.
>>> Tim knows in far more painful detail than I do what kinds of 
>>> problems there are when parsing all the different formats so I'd 
>>> _really_ follow his advice.
>>>
>>> Tika pretty much has an impossible job. "Here, try to parse all 
>>> these different formats, implemented by different vendors with 
>>> different versions that more or less follow a spec which really 
>>> isn't a spec in many cases just recommendations using packages that 
>>> may or may not be actively maintained. And by the way, we'll try to 
>>> handle that 1G document that someone sends us, but don't blame us if 
>>> we hit an OOM.....". When Tika is run on the same box as Solr any 
>>> problems in that entire chain can adversely affect your search.
>>>
>>> Not to mention that Tika has to do some heavy lifting, using CPU 
>>> cycles that are unavailable for Solr.
>>>
>>> Extracting Request Handler is a fine way to get started, but for 
>>> production seriously consider a separate client.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <ziyu...@gmail.com> wrote:
>>> > Hi Erick,
>>> >
>>> > Now it is clear. I have to update the request handler of
>>> /update/extract/
>>> > from
>>> > "defaults":{"fmap.content":"_text_"}
>>> > to
>>> > "defaults":{"fmap.content":"content"}
>>> > to fill the field.
>>> >
>>> > Hope that there is no side effect of not mapping the PDF content 
>>> > to
>>> _text_.
>>> > Thank you for the hint.
>>> >
>>> > Best regards,
>>> > Ziyuan
>>> >
>>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher 
>>> > <erik.hatc...@gmail.com>
>>> > wrote:
>>> >
>>> >> Ziyuan -
>>> >>
>>> >> You may be interested in the example/files that ships with Solr too.
>>> It’s
>>> >> got schema and config and even UI for file indexing and searching.
>>>  Check
>>> >> it out README.txt under example/files in your Solr install.
>>> >>
>>> >>         Erik
>>> >>
>>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote:
>>> >> >
>>> >> > Hi Erick,
>>> >> >
>>> >> > thanks very much for the explanations! Clarification for 
>>> >> > question
>>> 2: more
>>> >> > specifically I cannot see the field content in the returned 
>>> >> > JSON,
>>> with
>>> >> the
>>> >> > the same definitions as in the post 
>>> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
>>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika
>>> >> />
>>> >> > :
>>> >> >
>>> >> > <field name="content" type="text_general" indexed="false"
>>> stored="true"/>
>>> >> > <field name="text" type="text_general" multiValued="true"
>>> indexed="true"
>>> >> > stored="false"/>
>>> >> > <copyField source="content" dest="text"/>
>>> >> >
>>> >> > Is it so that Tika does not fill these two fields automatically 
>>> >> > and
>>> I
>>> >> have
>>> >> > to write some client code to fill them?
>>> >> >
>>> >> > Best regards,
>>> >> > Ziyuan
>>> >> >
>>> >> >
>>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>>> erickerick...@gmail.com
>>> >> >
>>> >> > wrote:
>>> >> >
>>> >> >> 1> Yes, you can use your single definition. The author 
>>> >> >> 1> identifies
>>> the
>>> >> >> "text" field as a catch-all. Somewhere in the schema there'll 
>>> >> >> be a copyField directive copying (perhaps) many different 
>>> >> >> fields to the "text" field. That permits simple searches 
>>> >> >> against a single field rather than, say, using edismax to 
>>> >> >> search across multiple separate fields.
>>> >> >>
>>> >> >> 2> The link you referenced is for Data Import Handler, which 
>>> >> >> 2> is
>>> much
>>> >> >> different than just posting files to Solr. See
>>> >> >> ExtractingRequestHandler:
>>> >> >> https://cwiki.apache.org/confluence/display/solr/
>>> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>>> >> >> There are ways to map meta-data fields from the doc into 
>>> >> >> specific fields matching your schema. Be a little careful 
>>> >> >> here. There is no standard across different types of docs as 
>>> >> >> to what meta-data field
>>> is
>>> >> >> included. PDF might have a "last_edited" field. Word might 
>>> >> >> have a "last_modified" field where the two mean the same 
>>> >> >> thing. Here's a
>>> link
>>> >> >> to a SolrJ program that'll dump all the fields:
>>> >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You 
>>> >> >> can
>>> easily
>>> >> >> hack out the DB bits.
>>> >> >>
>>> >> >> BTW, once you get more familiar with processing, I strongly
>>> recommend
>>> >> >> you do the document processing on the client, the reasons are
>>> outlined
>>> >> >> in that article.
>>> >> >>
>>> >> >> bq: even I define the fields as he said I cannot see them in 
>>> >> >> the search results as keys in JSON are the fields set as 
>>> >> >> stored="true"? They must be to be returned in requests 
>>> >> >> (skipping the docValues discussion here).
>>> >> >>
>>> >> >> 3> Yes, the text field is a concatenation of all the other ones.
>>> >> >> Because it has stored=false, you can only search it, you 
>>> >> >> cannot highlight or view. Fields you highlight must have stored=true 
>>> >> >> BTW.
>>> >> >>
>>> >> >> Whether or not you can highlight "Trevor Hastie" depends an a 
>>> >> >> lot
>>> of
>>> >> >> things, most particularly whether that text is ever actually 
>>> >> >> in a field in your index. Just because there's no guarantee 
>>> >> >> that the
>>> name
>>> >> >> of the file is indexed in a searchable/highlightable way.
>>> >> >>
>>> >> >> And the query q=id:Trevor Hastie won't do what you think. 
>>> >> >> It'll be
>>> >> parsed
>>> >> >> as
>>> >> >> id:Trevor _text_:Hastie
>>> >> >> _text_ is the default field, look for a "df" parameter in your
>>> request
>>> >> >> handler in solrconfig.xml (usually "/select" or "/query").
>>> >> >>
>>> >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <ziyu...@gmail.com> wrote:
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> I am new to Solr and I need to implement a full-text search 
>>> >> >>> of
>>> some PDF
>>> >> >>> files. The indexing part works out of the box by using bin/post.
>>> I can
>>> >> >> see
>>> >> >>> search results in the admin UI given some queries, though 
>>> >> >>> without
>>> the
>>> >> >>> matched texts and the context.
>>> >> >>>
>>> >> >>> Now I am reading this post
>>> >> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>>> >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus
>>> -tika/>
>>> >> >>> for the highlighting part. It is for an older version of Solr 
>>> >> >>> when
>>> >> >> managed
>>> >> >>> schema was not available. Before fully understand what it is
>>> doing I
>>> >> have
>>> >> >>> some questions:
>>> >> >>>
>>> >> >>> 1. He defined two fields:
>>> >> >>>
>>> >> >>> <field name="content" type="text_general" indexed="false"
>>> stored="true"
>>> >> >>> multiValued="false"/>
>>> >> >>> <field name="text" type="text_general" indexed="true"
>>> stored="false"
>>> >> >>> multiValued="true"/>
>>> >> >>>
>>> >> >>> But why are there two fields needed? Can I define a field
>>> >> >>>
>>> >> >>> <field name="content" type="text_general" indexed="true"
>>> stored="true"
>>> >> >>> multiValued="true"/>
>>> >> >>>
>>> >> >>> to capture the full text?
>>> >> >>>
>>> >> >>> 2. How are the fields filled? I don't see relevant 
>>> >> >>> information in TikaEntityProcessor's documentation
>>> >> >>> <https://lucene.apache.org/solr/6_6_0/solr-
>>> >> dataimporthandler-extras/org/
>>> >> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>>> >> >> fields.inherited.from.class.org.apache.solr.handler.
>>> >> >> dataimport.EntityProcessorBase>.
>>> >> >>> The current text extractor should already be Tika (I can see
>>> >> >>>
>>> >> >>> "x_parsed_by":
>>> >> >>> ["org.apache.tika.parser.DefaultParser","org.apache.
>>> >> >> tika.parser.pdf.PDFParser"]
>>> >> >>>
>>> >> >>> in the returned JSON of some query). But even I define the 
>>> >> >>> fields
>>> as he
>>> >> >>> said I cannot see them in the search results as keys in JSON.
>>> >> >>>
>>> >> >>> 3. The _text_ field seems a concatenation of other fields, 
>>> >> >>> does it
>>> >> >> contain
>>> >> >>> the full text? Though it does not seem to be accessible by
>>> default.
>>> >> >>>
>>> >> >>> To be brief, using The Elements of Statistical Learning 
>>> >> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>>> >> >> ESLII_print10.pdf>
>>> >> >>> as an example, how to highlight the relevant texts for the 
>>> >> >>> query
>>> "SVM"?
>>> >> >> And
>>> >> >>> if changing the file name into "The Elements of Statistical
>>> Learning -
>>> >> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie"
>>> for
>>> >> the
>>> >> >>> query "id:Trevor Hastie"?
>>> >> >>>
>>> >> >>> Thank you.
>>> >> >>>
>>> >> >>> Best regards,
>>> >> >>> Ziyuan
>>> >> >>
>>> >>
>>> >>
>>>
>>
>>

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Reply via email to