Hi, Michael. Well, the stock answer is, 'it depends'
For example - would you want to be able to search filename without searching file contents, or would you always search both of them together? If both, then copy both the file name and the parsed file content from the pdf into a single search field, and you can set that up as the default search field. Or - what kind of processing / normalizing do you want on this data? Case insensitive? Accent insensitive? If a 'word' contains camel case (e.g. TheVeryIdea), do you want that split on the case changes? (but then watch out for things like "iPad") If a 'word' contains numbers, do want them left together, or separated? Do you want stemming (where searching for 'stemming' would also find 'stem', 'stemmed', that sort of thing?) Is this always English, or are the other languages involved. Do you want the text processing to be the same for indexing vs searching? Do you want to be able to find hits based on the first few characters of a term? (ngrams) Do you want to be able to highlight text segments where the search terms were found? probably you want to read up on the various tokenizers and filters that are available. Do some prototyping and see how it looks. Here's a starting point: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Basically, there is no 'one size fits all' here. Part of the power of Solr / Lucene is its configurability to achieve the results your business case calls for. Part of the drawback of Solr / Lucene - especially for new folks - is its configurability to achieve the results you business case calls for. :) Anyone got anything else to suggest for Michael? Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com<http://www.sirsidynix.com/> From: Michael Dockery [mailto:dockeryjava...@yahoo.com] Sent: Monday, September 12, 2011 1:18 PM To: Bob Sandiford Subject: Re: select query does not find indexed pdf document thank you. that worked. Any tips for very very basic setup of the schema xml? ....or is the default basic enough? I basically only want to search search on filename and file contents From: Bob Sandiford <bob.sandif...@sirsidynix.com> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>; Michael Dockery <dockeryjava...@yahoo.com> Sent: Monday, September 12, 2011 10:04 AM Subject: RE: select query does not find indexed pdf document Um - looks like you specified your id value as "pdfy", which is reflected in the results from the "*:*" query, but your id query is searching for "vpn", hence no matches... What does this query yield? http://www/SearchApp/select/?q=id:pdfy Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com<mailto:bob.sandif...@sirsidynix.com> www.sirsidynix.com > -----Original Message----- > From: Michael Dockery > [mailto:dockeryjava...@yahoo.com<mailto:dockeryjava...@yahoo.com>] > Sent: Monday, September 12, 2011 9:56 AM > To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > Subject: Re: select query does not find indexed pdf document > > http://www/SearchApp/select/?q=id:vpn > > yeilds this: > <?xml version="1.0" encoding="UTF-8" ?> > - <response> > - <lstname="responseHeader"> > <intname="status">0</int> > <intname="QTime">15</int> > - <lstname="params"> > <strname="q">id:vpn</str> > </lst> > </lst> > <result name="response"numFound="0"start="0"/> > </response> > > > ***************************************** > > http://www/SearchApp/select/?q=*:* > > yeilds this: > > <?xml version="1.0" encoding="UTF-8" ?> > - <response> > - <lstname="responseHeader"> > <intname="status">0</int> > <intname="QTime">16</int> > - <lstname="params"> > <strname="q">*.*</str> > </lst> > </lst> > - <resultname="response"numFound="1"start="0"> > - <doc> > <strname="author">doc</str> > - <arrname="content_type"> > <str>application/pdf</str> > </arr> > <strname="id">pdfy</str> > <datename="last_modified">2011-05-20T02:08:48Z</date> > - <arrname="title"> > <str>dmvpndeploy.pdf</str> > </arr> > </doc> > </result> > </response> > > > From: Jan Høydahl <jan....@cominvent.com<mailto:jan....@cominvent.com>> > To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>; Michael > Dockery > <dockeryjava...@yahoo.com<mailto:dockeryjava...@yahoo.com>> > Sent: Monday, September 12, 2011 4:59 AM > Subject: Re: select query does not find indexed pdf document > > Hi, > > What do you get from a query http://www/SearchApp/select/?q=*:* or > http://www/SearchApp/select/?q=id:vpn ? > You may not have mapped the fields correctly to your schema? > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > On 12. sep. 2011, at 02:12, Michael Dockery wrote: > > > I am new to solr. > > > > I tried to upload a pdf file via curl to my solr webapp (on tomcat) > > > > curl > "http://www/SearchApp/update/extract?stream.file=c:\dmvpn.pdf&stream.co > ntentType=application/pdf&literal.id=pdfy&commit=true" > > > > > > > > <?xml version="1.0" encoding="UTF-8"?> > > <response> > > <lst name="responseHeader"><int name="status">0</int><int > name="QTime">860</int></lst> > > </response> > > > > > > but > > > > http://www/SearchApp/select/?q=vpn > > > > > > does not find the document > > > > > > <response> > > <lst name="responseHeader"> > > <int name="status">0</int> > > <int name="QTime">0</int> > > <lst name="params"> > > <str name="q">vpn</str> > > </lst> > > </lst> > > <result name="response" numFound="0" start="0"/> > > </response> > > > > > > help is appreciated. > > > > ================================================= > > fyi > > I point my test webapp to the index/solr home via mod meta- > data/context.xml > > <Context crossContext="true" > > > <Environment name="solr/home" type="java.lang.String" > > value="c:/solr_home" override="true" /> > > > > and I had to copy all these jars to my webapp lib dir: (to avoid the > classnotfound) > > Solr_download\contrib\extraction\lib > > ...in the future i plan to put them in the tomcat/lib dir. > > > > > > Also, I have not modified conf\solrconfig.xml or schema.xml.