Re: Structured Lucene documents

2007-10-12 Thread pgwillia

Hi All,

The Structured (or Multi-Page, Multi-Part) document problem is a problem
I've been thinking about for a while.  A couple of years ago when the
project I was working on was using Lucene only (no Solr), we solved this
problem in several steps.  At the point of ingestion we created a custom
analyzer and surrounding Java code that created a mapping for positions to
which page it is on (recall that analyzers tokenize the terms in a given
field and mark the position of the token).  This mapping was stored outside
of the Lucene index.  At query time, we used home built Java to pull the
position hits matching the query from the index and augmented the results
generated by Lucene.  At presentation time the results were molded into xml
and then transformed by several xsl sheets, one of which translated the
position hits to the page they were on using the information gleamed from
the ingestion stage. 

When we moved to Solr, we created a custom QueryResponseWriter in order to
get the position locations into the xml results and kept the same
transformation to obtain the page level hits.  The ingestion stage stays the
same -- so really we're using Lucene to build the index, but Solr sits on
top of it to serve results.

I admit this is an awkward hack.  Peter Binkley ([EMAIL PROTECTED])
who I worked with on the project made this suggested improvement:



> 
> "Paged-Text" FieldType for Solr
> 
> A chance to dig into the guts of Solr. The problem: If we index a
> monograph in Solr, there's no way to convert search results into
> page-level hits. The solution: have a "paged-text" fieldtype which keeps
> track of page divisions as it indexes, and reports page-level hits in the
> search results.
> 
> The input would contain page milestones: . As Solr
> processed the tokens (using its standard tokenizers and filters), it would
> concurrently build a structural map of the item, indicating which term
> position marked the beginning of which page:  firstterm="14324"/>. This map would be stored in an unindexed field in
> some efficient format.
> 
> At search time, Solr would retrieve term positions for all hits that are
> returned in the current request, and use the stored map to determine page
> ids for each term position. The results would imitate the results for
> highlighting, something like:
> 
> 
> 
> 234
> 236
> 
> 
> 19
> 
> 
> 
> 
> 
> 14325
> 
> 
> ...
> 
> 
> We have some code that does something like this in a Lucene context, which
> could form the basis for a Solr fieldtype; but it would probably be just
> as easy to start fresh.
> 
> 

My current project would like to have some meta data about each sub-part of
the document also included.  For example: each page would have a url, and/or
a title associated with the content.  This becomes  meaningful when we index
things like newspapers and monographs which may have page, chapter, or
section level content.So a solution would ideally have taken this into
consideration.
 
Does anyone with more experience know if this is a reasonable approach? 
Does an issue exist for this feature request?  Other comments or questions?

Thanks,
Tricia


Pierre-Yves LANDRON wrote:
> 
> Hello,Is it possible to structure lucene documents via Solr, so one
> document coud fit into another one ?What I would like to do, for example
> :I want to retrieve full text articles, that fit on several pages for each
> of them. Results must take in account both the pages and the article from
> wich the search terms are from. I can create a lucene document for each
> pages of the article AND the article itself, and do two requests to get my
> results, but it would duplicate the full text in the index, and will not
> be too efficient. Ideally, what I would like to do is to create a document
> for indexing the text of each pages of the article, and group these
> documents in one document that describe the article : this way, when
> Lucene retrieve a requested term, i'll get the article and the page that
> contains the term.I wonder if there's a way to emulate elegantly this
> behavior with Solr ?Kind Regards,Pierre-Yves Landron
> 

-- 
View this message in context: 
http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a13185053
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Payloads in Solr

2008-04-09 Thread pgwillia

I started this thread back in November.  Recall that I'm indexing xml and
storing the xpath as a payload in each token.  I am not encoding or mapping
the xpath but storing the text directly as String.getBytes().  We're not
using this to query in any way, just to add context to our search results. 
Presently, I'm ready to bounce around some more ideas about encoding xpath
or strings in general.

Back in the day Grant said:

 
> From what I understand from Michael Busch, you can store the path at  
> each token, but this doesn't seem efficient to me.  I would think you  
> may want to come up with some more efficient encoding.  I am cc'ing  
> Michael on this thread to see if he is able to add any light to the  
> subject (he may not be able to b/c of employer reasons).   If he  
> can't, then we can brainstorm a bit more on how to do it most  
> efficiently.
> 

The word "encoding" in Grant's response brings to mind Huffman coding
(http://en.wikipedia.org/wiki/Huffman_coding).  This would not solve the
query on payload problem that Yonik pointed out because the encoding would
be document centric, but could reduce the amount of total bytes that I need
to store. 

Any ideas?

Tricia
-- 
View this message in context: 
http://www.nabble.com/Payloads-in-Solr-tp13812560p16599300.html
Sent from the Solr - User mailing list archive at Nabble.com.