Re: select query does not find indexed pdf document

Erick Erickson Wed, 14 Sep 2011 11:44:28 -0700

You can use <copyField> to put data from separate fields into a common
search field.


This page will help you get started on what mods you'd need to make on
a <fieldType>
to analyze it as you wish:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

But at a start think about WhitespaceTokenizer followed by
LowerCaseFilterFactory
AsciiFoldingFilterFactory
NGramFilterFactory


Pay attention to the note at the top that directs you to the full
list, the page above contains
a partial list. For instance, NGramFilterFactory isn't that page, it's
on the page that's linked
to: 
http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html

Best
Erick

On Tue, Sep 13, 2011 at 10:46 PM, Michael Dockery
<dockeryjava...@yahoo.com> wrote:
> Thank you for your informative reply.
>
> I would like to start simple by combining both filename and content
>   into the same default search field
>    ...which my default schema xml calls  "text"
> ...
> <defaultSearchField>text</defaultSearchField>
> ...
>
> also:
> -case and accent insensitive
> -no splits on numb3rs
> -no highlights
> -text processing same for index and search
>
> however I do like
> -I like ngrams prerrably (partial/prefix word/token search)
>
>
> what schema mod's would be needed?
>
> also what curl syntax to submit/index a pdf (with filename and content 
> combined into the default search field)?
>
>
>
> ________________________________
> From: Bob Sandiford <bob.sandif...@sirsidynix.com>
> To: Michael Dockery <dockeryjava...@yahoo.com>
> Cc: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Monday, September 12, 2011 1:38 PM
> Subject: RE: select query does not find indexed pdf document
>
> Hi, Michael.
>
> Well, the stock answer is, 'it depends'
>
> For example - would you want to be able to search filename without searching 
> file contents, or would you always search both of them together?  If both, 
> then copy both the file name and the parsed file content from the pdf into a 
> single search field, and you can set that up as the default search field.
>
> Or - what kind of processing / normalizing do you want on this data?  Case 
> insensitive?  Accent insensitive?  If a 'word' contains camel case (e.g. 
> TheVeryIdea), do you want that split on the case changes?  (but then watch 
> out for things like "iPad")  If a 'word' contains numbers, do want them left 
> together, or separated?  Do you want stemming (where searching for 'stemming' 
> would also find 'stem', 'stemmed', that sort of thing?)  Is this always 
> English, or are the other languages involved.  Do you want the text 
> processing to be the same for indexing vs searching?  Do you want to be able 
> to find hits based on the first few characters of a term?  (ngrams)
>
> Do you want to be able to highlight text segments where the search terms were 
> found?
>
> probably you want to read up on the various tokenizers and filters that are 
> available.  Do some prototyping and see how it looks.
>
> Here's a starting point: 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> Basically, there is no 'one size fits all' here.  Part of the power of Solr / 
> Lucene is its configurability to achieve the results your business case calls 
> for.  Part of the drawback of Solr / Lucene - especially for new folks - is 
> its configurability to achieve the results you business case calls for. :)
>
> Anyone got anything else to suggest for Michael?
>
> Bob Sandiford | Lead Software Engineer | SirsiDynix
> P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
> www.sirsidynix.com<http://www.sirsidynix.com/>
>
> From: Michael Dockery [mailto:dockeryjava...@yahoo.com]
> Sent: Monday, September 12, 2011 1:18 PM
> To: Bob Sandiford
> Subject: Re: select query does not find indexed pdf document
>
> thank you.  that worked.
>
> Any tips for   very   very  basic setup of the schema xml?
>    ....or is the default basic enough?
>
> I basically only want to search search on
>         filename   and    file contents
>
>
> From: Bob Sandiford <bob.sandif...@sirsidynix.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>; Michael 
> Dockery <dockeryjava...@yahoo.com>
> Sent: Monday, September 12, 2011 10:04 AM
> Subject: RE: select query does not find indexed pdf document
>
> Um - looks like you specified your id value as "pdfy", which is reflected in 
> the results from the "*:*" query, but your id query is searching for "vpn", 
> hence no matches...
>
> What does this query yield?
>
> http://www/SearchApp/select/?q=id:pdfy
>
> Bob Sandiford | Lead Software Engineer | SirsiDynix
> P: 800.288.8020 X6943 | 
> bob.sandif...@sirsidynix.com<mailto:bob.sandif...@sirsidynix.com>
> www.sirsidynix.com
>
>> -----Original Message-----
>> From: Michael Dockery 
>> [mailto:dockeryjava...@yahoo.com<mailto:dockeryjava...@yahoo.com>]
>> Sent: Monday, September 12, 2011 9:56 AM
>> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> Subject: Re: select query does not find indexed pdf document
>>
>> http://www/SearchApp/select/?q=id:vpn
>>
>> yeilds this:
>>   <?xml version="1.0" encoding="UTF-8" ?>
>> - <response>
>> - <lstname="responseHeader">
>>   <intname="status">0</int>
>>   <intname="QTime">15</int>
>> - <lstname="params">
>>   <strname="q">id:vpn</str>
>>   </lst>
>>   </lst>
>>   <result name="response"numFound="0"start="0"/>
>>   </response>
>>
>>
>> *****************************************
>>
>>  http://www/SearchApp/select/?q=*:*
>>
>> yeilds this:
>>
>>   <?xml version="1.0" encoding="UTF-8" ?>
>> - <response>
>> - <lstname="responseHeader">
>>   <intname="status">0</int>
>>   <intname="QTime">16</int>
>> - <lstname="params">
>>   <strname="q">*.*</str>
>>   </lst>
>>   </lst>
>> - <resultname="response"numFound="1"start="0">
>> - <doc>
>>   <strname="author">doc</str>
>> - <arrname="content_type">
>>   <str>application/pdf</str>
>>   </arr>
>>   <strname="id">pdfy</str>
>>   <datename="last_modified">2011-05-20T02:08:48Z</date>
>> - <arrname="title">
>>   <str>dmvpndeploy.pdf</str>
>>   </arr>
>>   </doc>
>>   </result>
>>   </response>
>>
>>
>> From: Jan Høydahl <jan....@cominvent.com<mailto:jan....@cominvent.com>>
>> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>; Michael 
>> Dockery
>> <dockeryjava...@yahoo.com<mailto:dockeryjava...@yahoo.com>>
>> Sent: Monday, September 12, 2011 4:59 AM
>> Subject: Re: select query does not find indexed pdf document
>>
>> Hi,
>>
>> What do you get from a query http://www/SearchApp/select/?q=*:* or
>> http://www/SearchApp/select/?q=id:vpn ?
>> You may not have mapped the fields correctly to your schema?
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>>
>> On 12. sep. 2011, at 02:12, Michael Dockery wrote:
>>
>> > I am new to solr.
>> >
>> > I tried to upload a pdf file via curl to my solr webapp (on tomcat)
>> >
>> > curl
>> "http://www/SearchApp/update/extract?stream.file=c:\dmvpn.pdf&stream.co
>> ntentType=application/pdf&literal.id=pdfy&commit=true"
>> >
>> >
>> >
>> > <?xml version="1.0" encoding="UTF-8"?>
>> > <response>
>> > <lst name="responseHeader"><int name="status">0</int><int
>> name="QTime">860</int></lst>
>> > </response>
>> >
>> >
>> > but
>> >
>> > http://www/SearchApp/select/?q=vpn
>> >
>> >
>> > does not find the document
>> >
>> >
>> > <response>
>> > <lst name="responseHeader">
>> > <int name="status">0</int>
>> > <int name="QTime">0</int>
>> > <lst name="params">
>> > <str name="q">vpn</str>
>> > </lst>
>> > </lst>
>> > <result name="response" numFound="0" start="0"/>
>> > </response>
>> >
>> >
>> > help is appreciated.
>> >
>> > =================================================
>> > fyi
>> > I point my test webapp to the index/solr home via mod meta-
>> data/context.xml
>> > <Context crossContext="true" >
>> >    <Environment name="solr/home" type="java.lang.String"
>> >  value="c:/solr_home" override="true" />
>> >
>> > and I had to copy all these jars to my webapp lib dir: (to avoid the
>> classnotfound)
>> > Solr_download\contrib\extraction\lib
>> >  ...in the future i plan to put them in the tomcat/lib dir.
>> >
>> >
>> > Also, I have not modified conf\solrconfig.xml or schema.xml.

Re: select query does not find indexed pdf document

Reply via email to