Ryan (and others who need something to put them so sleep :) ) Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool -- I just was not at all thinking the SOLR/Lucene way.
I need to rethink my whole approach now that I understand (from reviewing the schema.xml closer and playing with the Analyser) how compatible index and query policies can be applied automatically on a field by field basis by SOLR at both index and query time. I still may have a stumper here, but I need to give it some thought, and may return again with another question: The problem is that my text is book text (fairly large) that ooks very much like one would expect: <book> <chapter> <para><sen>...</sen><sen>....</sen></para> <para><sen>...</sen><sen>....</sen></para> <para><sen>...</sen><sen>...</sen></para> </chapter> </book The search results need to return exact sentences or paragraphs with their exact page:line numbers (which is available in the embedded markup in the text). There were previous responses by others, suggesting I look into payloads, but I did not fully understand that -- I may have to re-read those e-mails now that I am getting a clearer picture of SOLR/Lucene. However, the reason I resorted to indexing each paragraph as a single document, and then redundantly indexing each sentence as a single document, is because I was planning on pre-parsing the text myself (outside of SOLR) -- and feeding separate <doc> elements to the <add> because in that way I could produce the page:line reference in the pre-parsing (again outside of SOLR) and feed it in as explict field in the <doc> elements of the <add> requests. Therefore at query time, I will have the exact page:line corresponding to the start of the paragraph or sentence. But I am beginning to suspect, I was planning to do a lot of work that SOLR can do for me. I will continue to study this and respond when I am a bit clearer, but the closer I could get to just submitting the books a chapter at a time -- and letting SOLR do the work, the better (cause I have all the books in well formed xml at chapter levels). However, I don't see yet how I could get par/sen granular search result hits, along with their exact page:line coordinates unless I approach it by explicitly indexing the pars and sens as single documents, not chapters hits, and also return the entire text of the sen or par, and highlight the keywords within (for the search result hit). Once a search result hit is selected, it would then act as expected and position into the chapter, at the selected reference, highlight again the key words, but this time in the context of an entire chapter (the whole document to the user's mind). Even with my new understanding you (and others) have given me, which I can use to certainly improve my approach -- it still seems to me that because multi-valued fields concatenate text -- even if you use the positionGapIncrment feature to prohibit unwanted phrase matches, how do you produce a well definied search result hit, bounded by the exact sen or par, unless you index them as single documents? Should I still read up on the payload discussion? Dave ----- Original Message ---- From: Ryan McKinley <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 5:00:43 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) David Neubert wrote: > Ryan, > > Thanks for your response. I infer from your response that you can have a different analyzer for each field yes! each field can have its own indexing strategy. > I believe that the Analyzer approach you suggested requires the use > of the same Analzyer at query time that was used during indexing. it does not require the *same* Analyzer - it just requires one that generates compatiable tokens. That is, you may want the indexing to split the input into sentences, but the query time analyzer keeps the input as a single token. check the example schema.xml file -- the 'text' field type applies synonyms at index time, but does at query time. re searching acrross multiple fields, don't worry, lucene handles this well. You may want to do that explicitly or with the dismax handler. I'd suggest you play around with indexing some data. check the analysis.jsp in the admin section. It is a great tool to help figure out what analyzers do at index vs query time. ryan __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com