Re: AW: What is the best way to index xml data preserving the mark up?

Tricia Williams Thu, 08 Nov 2007 14:34:43 -0800

Hi Dave,

This sounds like what I've been trying to work out withhttps://issues.apache.org/jira/browse/SOLR-380. The idea that I'mrunning with right now is indexing the xml and storing the data in thexml tags as a Payload. Payload is a relatively new idea from Lucene.A custom SolrHighlighter provides position hits (our need for this ishighlighting on an image while searching the OCR text of the image) andsome context to where they appear in the document using the stored Payload.


Tricia

David Neubert wrote:

Chris

I'll try to track down your Jira issue.

(2) sounds very helpful -- I am only 2 days old in SOLR/Lucene experience, but know 
what I need -- and basically its to search by the main granules in an xml document, 
with usually turn out to be for books" book (rarley), chapter (more often), 
paragraph: (often) sentence: (often).  Then there are niceties like chapter title, 
headings, etc. but I can live without that -- but it seems like if you can exploit 
the text nodes of arbitrary XML you are looking good, if not, you gotta a lot of 
machination in front of you.

Seems like Lucene/SOLR is geared to take record and non-xml-oriented content 
and put it into XML format for ingest -- but really can't digest XML content 
itself at all without significant setup and constraints.  I am surprised -- but 
I could really use it for my project big time.

Another problem I am having related (which I will probably repost separately) 
is boolean searches across fields with multiple values.  At this point, because 
of my work arounds for Lucene (to this point) I am indexing paragraphs as 
single documents with multiple fields, thinking I could copy the sentences to 
text.  In that way, I can search field text (for the paragraph) -- and search 
field sentence -- for sentence granularity.  The problem is that a search for 
sentence:foo AND sentence:bar is matching if foo matches in any sentence of the 
paragraph, and bar also matches in any sentence of the paragraph.  I need it to 
match only if foo and bar are found in the same sentence. If this can't be do, 
looks like I will have to index paragraphs as documents, and redundantly index 
sentences as unique documents. Again, I will post this question separately 
immediately.

Thanks,

Dave

Re: AW: What is the best way to index xml data preserving the mark up?

Reply via email to