RE: Question about indexing PDFs

Srinivasa Meenavalli Fri, 26 Aug 2016 01:22:28 -0700

Hi Betsey,

I executed some examples in Solr 5.5 from apache Tika Data import handler . 
content/Text was not store by default.
I can see PDF contents with documents when stored="true" enabled .


solr start -e dih

<field name="text" type="text_general" indexed="true" stored="true" 
multiValued="true"/>

/solr/tika/select?q=*%3A*&wt=json&indent=true

<dataConfig>
    <dataSource type="BinFileDataSource" />
    <document>
        <entity name="tika-test" processor="TikaEntityProcessor"
                url="${solr.install.dir}/example/exampledocs/solr-word.pdf" 
format="text">
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/>
        </entity>
    </document>
</dataConfig>

Regards
Srinivas Meenavalli

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, August 26, 2016 3:09 AM
To: solr-user
Subject: Re: Question about indexing PDFs

That is always a dangerous assumption. Are you sure you're searching on the 
proper field? Are you sure it's indexed? Are you sure it's....

The schema browser I indicated above will give you some idea what's actually in 
the field. You can not only see the fields Solr (actually Lucene) see in your 
index, but you can also see what some of the terms are.

Adding &debug=query and looking at the parsed query will show you what fields 
are being searched against. The most common causes of what you're describing 
are:

> not searching against the field you think you are. This
is very easy to do without knowing it.

> not actually having 'indexed="true" set in your schema

> not committing after inserting the doc

Best,
Erick

On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh < betsey.ben...@stresearch.com> 
wrote:

> It looks like the metadata of the PDFs was indexed, but not the
> content (which is what I was interested in).  Searches on terms I know
> exist in the content come up empty.
>
> On 8/25/16, 2:16 PM, "Betsey Benagh" <betsey.ben...@stresearch.com> wrote:
>
> >Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.
> >
> >
> >On 8/25/16, 1:56 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
> >
> >>when you say "I don't see it in the schema for that collection" are
> >>you talking schema.xml? managed_schema? Or actual documents in the index?
> >>Often
> >>these are defined by dynamic fields and the like in the schema files.
> >>
> >>Take a look at the admin UI>>schema browser>>drop down and you'll
> >>see all the actual fields in your index...
> >>
> >>Best,
> >>Erick
> >>
> >>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
> >><betsey.ben...@stresearch.com
> >>> wrote:
> >>
> >>> Following the instructions in the quick start guide, I imported a
> >>>bunch of  PDF documents into my Solr 6.0 instance.  As far as I can
> >>>tell from the  documentation, there should be a 'content' field
> >>>indexing, well, the  content, but I don't see it in the schema for
> >>>that collection.  Is there  something obvious I might have missed?
> >>>
> >>> Thanks!
> >>>
> >>>
> >
>
>
Disclaimer: The contents of this e-mail and attachment(s) thereto are 
confidential and intended for the named recipient(s) only. It shall not attach 
any liability on the originator or Zensar Technologies Limited or its 
affiliates. Any views or opinions presented in this email are solely those of 
the author and may not necessarily reflect the opinions of Zensar Technologies 
Limited or its affiliates. Any form of reproduction, dissemination, copying, 
disclosure, modification, distribution and / or publication of this message 
without the prior written consent of the author of this e-mail is strictly 
prohibited. If you have received this email in error please delete it and 
notify the sender immediately. Before opening any mail and attachments please 
check them for viruses and defect. Zensar Technologies Ltd or its affiliate do 
not accept any liability for virus infected mails.

RE: Question about indexing PDFs

Reply via email to