RE: select query does not find indexed pdf document

Bob Sandiford Mon, 12 Sep 2011 10:39:35 -0700

Hi, Michael.

Well, the stock answer is, 'it depends'


For example - would you want to be able to search filename without searching 
file contents, or would you always search both of them together?  If both, then 
copy both the file name and the parsed file content from the pdf into a single 
search field, and you can set that up as the default search field.

Or - what kind of processing / normalizing do you want on this data?  Case 
insensitive?  Accent insensitive?  If a 'word' contains camel case (e.g. 
TheVeryIdea), do you want that split on the case changes?  (but then watch out 
for things like "iPad")  If a 'word' contains numbers, do want them left 
together, or separated?  Do you want stemming (where searching for 'stemming' 
would also find 'stem', 'stemmed', that sort of thing?)  Is this always 
English, or are the other languages involved.  Do you want the text processing 
to be the same for indexing vs searching?  Do you want to be able to find hits 
based on the first few characters of a term?  (ngrams)

Do you want to be able to highlight text segments where the search terms were 
found?

probably you want to read up on the various tokenizers and filters that are 
available.  Do some prototyping and see how it looks.

Here's a starting point: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Basically, there is no 'one size fits all' here.  Part of the power of Solr / 
Lucene is its configurability to achieve the results your business case calls 
for.  Part of the drawback of Solr / Lucene - especially for new folks - is its 
configurability to achieve the results you business case calls for. :)

Anyone got anything else to suggest for Michael?

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com<http://www.sirsidynix.com/>

From: Michael Dockery [mailto:dockeryjava...@yahoo.com]
Sent: Monday, September 12, 2011 1:18 PM
To: Bob Sandiford
Subject: Re: select query does not find indexed pdf document

thank you.  that worked.

Any tips for   very   very  basic setup of the schema xml?
   ....or is the default basic enough?

I basically only want to search search on
        filename   and    file contents


From: Bob Sandiford <bob.sandif...@sirsidynix.com>
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>; Michael 
Dockery <dockeryjava...@yahoo.com>
Sent: Monday, September 12, 2011 10:04 AM
Subject: RE: select query does not find indexed pdf document

Um - looks like you specified your id value as "pdfy", which is reflected in 
the results from the "*:*" query, but your id query is searching for "vpn", 
hence no matches...

What does this query yield?

http://www/SearchApp/select/?q=id:pdfy

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | 
bob.sandif...@sirsidynix.com<mailto:bob.sandif...@sirsidynix.com>
www.sirsidynix.com

> -----Original Message-----
> From: Michael Dockery 
> [mailto:dockeryjava...@yahoo.com<mailto:dockeryjava...@yahoo.com>]
> Sent: Monday, September 12, 2011 9:56 AM
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Subject: Re: select query does not find indexed pdf document
>
> http://www/SearchApp/select/?q=id:vpn
>
> yeilds this:
>   <?xml version="1.0" encoding="UTF-8" ?>
> - <response>
> - <lstname="responseHeader">
>   <intname="status">0</int>
>   <intname="QTime">15</int>
> - <lstname="params">
>   <strname="q">id:vpn</str>
>   </lst>
>   </lst>
>   <result name="response"numFound="0"start="0"/>
>   </response>
>
>
> *****************************************
>
>  http://www/SearchApp/select/?q=*:*
>
> yeilds this:
>
>   <?xml version="1.0" encoding="UTF-8" ?>
> - <response>
> - <lstname="responseHeader">
>   <intname="status">0</int>
>   <intname="QTime">16</int>
> - <lstname="params">
>   <strname="q">*.*</str>
>   </lst>
>   </lst>
> - <resultname="response"numFound="1"start="0">
> - <doc>
>   <strname="author">doc</str>
> - <arrname="content_type">
>   <str>application/pdf</str>
>   </arr>
>   <strname="id">pdfy</str>
>   <datename="last_modified">2011-05-20T02:08:48Z</date>
> - <arrname="title">
>   <str>dmvpndeploy.pdf</str>
>   </arr>
>   </doc>
>   </result>
>   </response>
>
>
> From: Jan Høydahl <jan....@cominvent.com<mailto:jan....@cominvent.com>>
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>; Michael 
> Dockery
> <dockeryjava...@yahoo.com<mailto:dockeryjava...@yahoo.com>>
> Sent: Monday, September 12, 2011 4:59 AM
> Subject: Re: select query does not find indexed pdf document
>
> Hi,
>
> What do you get from a query http://www/SearchApp/select/?q=*:* or
> http://www/SearchApp/select/?q=id:vpn ?
> You may not have mapped the fields correctly to your schema?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 12. sep. 2011, at 02:12, Michael Dockery wrote:
>
> > I am new to solr.
> >
> > I tried to upload a pdf file via curl to my solr webapp (on tomcat)
> >
> > curl
> "http://www/SearchApp/update/extract?stream.file=c:\dmvpn.pdf&stream.co
> ntentType=application/pdf&literal.id=pdfy&commit=true"
> >
> >
> >
> > <?xml version="1.0" encoding="UTF-8"?>
> > <response>
> > <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">860</int></lst>
> > </response>
> >
> >
> > but
> >
> > http://www/SearchApp/select/?q=vpn
> >
> >
> > does not find the document
> >
> >
> > <response>
> > <lst name="responseHeader">
> > <int name="status">0</int>
> > <int name="QTime">0</int>
> > <lst name="params">
> > <str name="q">vpn</str>
> > </lst>
> > </lst>
> > <result name="response" numFound="0" start="0"/>
> > </response>
> >
> >
> > help is appreciated.
> >
> > =================================================
> > fyi
> > I point my test webapp to the index/solr home via mod meta-
> data/context.xml
> > <Context crossContext="true" >
> >    <Environment name="solr/home" type="java.lang.String"
> >  value="c:/solr_home" override="true" />
> >
> > and I had to copy all these jars to my webapp lib dir: (to avoid the
> classnotfound)
> > Solr_download\contrib\extraction\lib
> >  ...in the future i plan to put them in the tomcat/lib dir.
> >
> >
> > Also, I have not modified conf\solrconfig.xml or schema.xml.

RE: select query does not find indexed pdf document

Reply via email to