Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

David Neubert Sat, 10 Nov 2007 21:44:46 -0800

Ryan (and others who need something to put them so sleep :) )

Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool 
-- I just was not at all thinking the SOLR/Lucene way.

I need to rethink my whole approach now that I understand (from reviewing the 
schema.xml closer and playing with the Analyser) how compatible index and query 
policies can be applied automatically on a field by field basis by SOLR at both 
index and query time.

I still may have a stumper here, but I need to give it some thought, and may 
return again with another question:

The problem is that my text is book text (fairly large) that ooks very much 
like one would expect:
<book>
<chapter>
<para><sen>...</sen><sen>....</sen></para>
<para><sen>...</sen><sen>....</sen></para>
<para><sen>...</sen><sen>...</sen></para>
</chapter>
</book

The search results need to return exact sentences or paragraphs with their 
exact page:line numbers (which is available in the embedded markup in the text).

There were previous responses by others, suggesting I look into payloads, but I 
did not fully understand that -- I may have to re-read those e-mails now that I 
am getting a clearer picture of SOLR/Lucene.

However, the reason I resorted to indexing each paragraph as a single document, 
and then redundantly indexing each sentence as a single document, is because I 
was planning on pre-parsing the text myself (outside of SOLR) -- and feeding 
separate <doc> elements to the <add> because in that way I could produce the 
page:line reference in the pre-parsing (again outside of SOLR) and feed it in 
as explict field in the <doc> elements of the <add> requests.  Therefore at 
query time, I will have the exact page:line corresponding to the start of the 
paragraph or sentence.

But I am beginning to suspect, I was planning to do a lot of work that SOLR can 
do for me.

I will continue to study this and respond when I am a bit clearer, but the 
closer I could get to just submitting the books a chapter at a time -- and 
letting SOLR do the work, the better (cause I have all the books in well formed 
xml at chapter levels).  However, I don't  see yet how I could get par/sen 
granular search result hits, along with their exact page:line coordinates 
unless I approach it by explicitly indexing the pars and sens as single 
documents, not chapters hits, and also return the entire text of the sen or 
par, and highlight the keywords within (for the search result hit).  Once a 
search result hit is selected, it would then act as expected and position into 
the chapter, at the selected reference, highlight again the key words, but this 
time in the context of an entire chapter (the whole document to the user's 
mind).

Even with my new understanding you (and others) have given me, which I can use 
to certainly improve my approach -- it still seems to me that because 
multi-valued fields concatenate text -- even if you use the positionGapIncrment 
feature to prohibit unwanted phrase matches, how do you produce a well definied 
search result hit, bounded by the exact sen or par, unless you index them as 
single documents?

Should I still read up on the payload discussion?

Dave

----- Original Message ----
From: Ryan McKinley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Saturday, November 10, 2007 5:00:43 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)

David Neubert wrote:
> Ryan,
> 
> Thanks for your response.  I infer from your response that you can
 have a different analyzer for each field

yes!  each field can have its own indexing strategy.

> I believe that the Analyzer approach you suggested requires the use 
> of the same Analzyer at query time that was used during indexing.  

it does not require the *same* Analyzer - it just requires one that 
generates compatiable tokens.  That is, you may want the indexing to 
split the input into sentences, but the query time analyzer keeps the 
input as a single token.

check the example schema.xml file -- the 'text' field type applies 
synonyms at index time, but does at query time.

re searching acrross multiple fields, don't worry, lucene handles this 
well.  You may want to do that explicitly or with the dismax handler.

I'd suggest you play around with indexing some data.  check the 
analysis.jsp in the admin section.  It is a great tool to help figure 
out what analyzers do at index vs query time.

ryan

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

Reply via email to