Hi Betsey, I executed some examples in Solr 5.5 from apache Tika Data import handler . content/Text was not store by default. I can see PDF contents with documents when stored="true" enabled .
solr start -e dih <field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/> /solr/tika/select?q=*%3A*&wt=json&indent=true <dataConfig> <dataSource type="BinFileDataSource" /> <document> <entity name="tika-test" processor="TikaEntityProcessor" url="${solr.install.dir}/example/exampledocs/solr-word.pdf" format="text"> <field column="Author" name="author" meta="true"/> <field column="title" name="title" meta="true"/> <field column="text" name="text"/> </entity> </document> </dataConfig> Regards Srinivas Meenavalli -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, August 26, 2016 3:09 AM To: solr-user Subject: Re: Question about indexing PDFs That is always a dangerous assumption. Are you sure you're searching on the proper field? Are you sure it's indexed? Are you sure it's.... The schema browser I indicated above will give you some idea what's actually in the field. You can not only see the fields Solr (actually Lucene) see in your index, but you can also see what some of the terms are. Adding &debug=query and looking at the parsed query will show you what fields are being searched against. The most common causes of what you're describing are: > not searching against the field you think you are. This is very easy to do without knowing it. > not actually having 'indexed="true" set in your schema > not committing after inserting the doc Best, Erick On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh < betsey.ben...@stresearch.com> wrote: > It looks like the metadata of the PDFs was indexed, but not the > content (which is what I was interested in). Searches on terms I know > exist in the content come up empty. > > On 8/25/16, 2:16 PM, "Betsey Benagh" <betsey.ben...@stresearch.com> wrote: > > >Right, that¹s where I looked. No Œcontent¹. Which is what confused me. > > > > > >On 8/25/16, 1:56 PM, "Erick Erickson" <erickerick...@gmail.com> wrote: > > > >>when you say "I don't see it in the schema for that collection" are > >>you talking schema.xml? managed_schema? Or actual documents in the index? > >>Often > >>these are defined by dynamic fields and the like in the schema files. > >> > >>Take a look at the admin UI>>schema browser>>drop down and you'll > >>see all the actual fields in your index... > >> > >>Best, > >>Erick > >> > >>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh > >><betsey.ben...@stresearch.com > >>> wrote: > >> > >>> Following the instructions in the quick start guide, I imported a > >>>bunch of PDF documents into my Solr 6.0 instance. As far as I can > >>>tell from the documentation, there should be a 'content' field > >>>indexing, well, the content, but I don't see it in the schema for > >>>that collection. Is there something obvious I might have missed? > >>> > >>> Thanks! > >>> > >>> > > > > Disclaimer: The contents of this e-mail and attachment(s) thereto are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or Zensar Technologies Limited or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of Zensar Technologies Limited or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect. Zensar Technologies Ltd or its affiliate do not accept any liability for virus infected mails.