You can use <copyField> to put data from separate fields into a common search field.
This page will help you get started on what mods you'd need to make on a <fieldType> to analyze it as you wish: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters But at a start think about WhitespaceTokenizer followed by LowerCaseFilterFactory AsciiFoldingFilterFactory NGramFilterFactory Pay attention to the note at the top that directs you to the full list, the page above contains a partial list. For instance, NGramFilterFactory isn't that page, it's on the page that's linked to: http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html Best Erick On Tue, Sep 13, 2011 at 10:46 PM, Michael Dockery <dockeryjava...@yahoo.com> wrote: > Thank you for your informative reply. > > I would like to start simple by combining both filename and content > into the same default search field > ...which my default schema xml calls "text" > ... > <defaultSearchField>text</defaultSearchField> > ... > > also: > -case and accent insensitive > -no splits on numb3rs > -no highlights > -text processing same for index and search > > however I do like > -I like ngrams prerrably (partial/prefix word/token search) > > > what schema mod's would be needed? > > also what curl syntax to submit/index a pdf (with filename and content > combined into the default search field)? > > > > ________________________________ > From: Bob Sandiford <bob.sandif...@sirsidynix.com> > To: Michael Dockery <dockeryjava...@yahoo.com> > Cc: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> > Sent: Monday, September 12, 2011 1:38 PM > Subject: RE: select query does not find indexed pdf document > > Hi, Michael. > > Well, the stock answer is, 'it depends' > > For example - would you want to be able to search filename without searching > file contents, or would you always search both of them together? If both, > then copy both the file name and the parsed file content from the pdf into a > single search field, and you can set that up as the default search field. > > Or - what kind of processing / normalizing do you want on this data? Case > insensitive? Accent insensitive? If a 'word' contains camel case (e.g. > TheVeryIdea), do you want that split on the case changes? (but then watch > out for things like "iPad") If a 'word' contains numbers, do want them left > together, or separated? Do you want stemming (where searching for 'stemming' > would also find 'stem', 'stemmed', that sort of thing?) Is this always > English, or are the other languages involved. Do you want the text > processing to be the same for indexing vs searching? Do you want to be able > to find hits based on the first few characters of a term? (ngrams) > > Do you want to be able to highlight text segments where the search terms were > found? > > probably you want to read up on the various tokenizers and filters that are > available. Do some prototyping and see how it looks. > > Here's a starting point: > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters > > Basically, there is no 'one size fits all' here. Part of the power of Solr / > Lucene is its configurability to achieve the results your business case calls > for. Part of the drawback of Solr / Lucene - especially for new folks - is > its configurability to achieve the results you business case calls for. :) > > Anyone got anything else to suggest for Michael? > > Bob Sandiford | Lead Software Engineer | SirsiDynix > P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com > www.sirsidynix.com<http://www.sirsidynix.com/> > > From: Michael Dockery [mailto:dockeryjava...@yahoo.com] > Sent: Monday, September 12, 2011 1:18 PM > To: Bob Sandiford > Subject: Re: select query does not find indexed pdf document > > thank you. that worked. > > Any tips for very very basic setup of the schema xml? > ....or is the default basic enough? > > I basically only want to search search on > filename and file contents > > > From: Bob Sandiford <bob.sandif...@sirsidynix.com> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>; Michael > Dockery <dockeryjava...@yahoo.com> > Sent: Monday, September 12, 2011 10:04 AM > Subject: RE: select query does not find indexed pdf document > > Um - looks like you specified your id value as "pdfy", which is reflected in > the results from the "*:*" query, but your id query is searching for "vpn", > hence no matches... > > What does this query yield? > > http://www/SearchApp/select/?q=id:pdfy > > Bob Sandiford | Lead Software Engineer | SirsiDynix > P: 800.288.8020 X6943 | > bob.sandif...@sirsidynix.com<mailto:bob.sandif...@sirsidynix.com> > www.sirsidynix.com > >> -----Original Message----- >> From: Michael Dockery >> [mailto:dockeryjava...@yahoo.com<mailto:dockeryjava...@yahoo.com>] >> Sent: Monday, September 12, 2011 9:56 AM >> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >> Subject: Re: select query does not find indexed pdf document >> >> http://www/SearchApp/select/?q=id:vpn >> >> yeilds this: >> <?xml version="1.0" encoding="UTF-8" ?> >> - <response> >> - <lstname="responseHeader"> >> <intname="status">0</int> >> <intname="QTime">15</int> >> - <lstname="params"> >> <strname="q">id:vpn</str> >> </lst> >> </lst> >> <result name="response"numFound="0"start="0"/> >> </response> >> >> >> ***************************************** >> >> http://www/SearchApp/select/?q=*:* >> >> yeilds this: >> >> <?xml version="1.0" encoding="UTF-8" ?> >> - <response> >> - <lstname="responseHeader"> >> <intname="status">0</int> >> <intname="QTime">16</int> >> - <lstname="params"> >> <strname="q">*.*</str> >> </lst> >> </lst> >> - <resultname="response"numFound="1"start="0"> >> - <doc> >> <strname="author">doc</str> >> - <arrname="content_type"> >> <str>application/pdf</str> >> </arr> >> <strname="id">pdfy</str> >> <datename="last_modified">2011-05-20T02:08:48Z</date> >> - <arrname="title"> >> <str>dmvpndeploy.pdf</str> >> </arr> >> </doc> >> </result> >> </response> >> >> >> From: Jan Høydahl <jan....@cominvent.com<mailto:jan....@cominvent.com>> >> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>; Michael >> Dockery >> <dockeryjava...@yahoo.com<mailto:dockeryjava...@yahoo.com>> >> Sent: Monday, September 12, 2011 4:59 AM >> Subject: Re: select query does not find indexed pdf document >> >> Hi, >> >> What do you get from a query http://www/SearchApp/select/?q=*:* or >> http://www/SearchApp/select/?q=id:vpn ? >> You may not have mapped the fields correctly to your schema? >> >> -- >> Jan Høydahl, search solution architect >> Cominvent AS - www.cominvent.com >> Solr Training - www.solrtraining.com >> >> On 12. sep. 2011, at 02:12, Michael Dockery wrote: >> >> > I am new to solr. >> > >> > I tried to upload a pdf file via curl to my solr webapp (on tomcat) >> > >> > curl >> "http://www/SearchApp/update/extract?stream.file=c:\dmvpn.pdf&stream.co >> ntentType=application/pdf&literal.id=pdfy&commit=true" >> > >> > >> > >> > <?xml version="1.0" encoding="UTF-8"?> >> > <response> >> > <lst name="responseHeader"><int name="status">0</int><int >> name="QTime">860</int></lst> >> > </response> >> > >> > >> > but >> > >> > http://www/SearchApp/select/?q=vpn >> > >> > >> > does not find the document >> > >> > >> > <response> >> > <lst name="responseHeader"> >> > <int name="status">0</int> >> > <int name="QTime">0</int> >> > <lst name="params"> >> > <str name="q">vpn</str> >> > </lst> >> > </lst> >> > <result name="response" numFound="0" start="0"/> >> > </response> >> > >> > >> > help is appreciated. >> > >> > ================================================= >> > fyi >> > I point my test webapp to the index/solr home via mod meta- >> data/context.xml >> > <Context crossContext="true" > >> > <Environment name="solr/home" type="java.lang.String" >> > value="c:/solr_home" override="true" /> >> > >> > and I had to copy all these jars to my webapp lib dir: (to avoid the >> classnotfound) >> > Solr_download\contrib\extraction\lib >> > ...in the future i plan to put them in the tomcat/lib dir. >> > >> > >> > Also, I have not modified conf\solrconfig.xml or schema.xml.