Re: complex XML structure problem

Otis Gospodnetic Fri, 03 Oct 2008 09:31:35 -0700

Hola Saša,


You don't have to recreate logic for proximity (I assume that by that you mean 
proximity of words/terms for phrase queries), if you have a text field with all 
your content.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Saša Mutić <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, October 2, 2008 3:43:33 PM
> Subject: Re: complex XML structure problem
> 
> Bok Otis,
> 
> I was thinking about this approach, but was wondering if there is more
> elegant approach where I wouldn't have to recreate logic for proximity and
> quoted complex queries (identification of neighbor hits and quote queries
> for highlighting and positioning on image).
> 
> If nobody comes up with better approach, I will use something similar as you
> described.
> 
> Thanks for fast response :)
> 
> Kind Regards,
> Saša
> 
> 
> On Thu, Oct 2, 2008 at 5:51 PM, Otis Gospodnetic 
> > wrote:
> 
> > Bok Saša,
> >
> > It sounds like you need to keep per-word metadata, plus the raw content so
> > you can full-text search it.
> > If so, consider keeping the meta data elsewhere - e.g. different index,
> > external DB, etc.
> > For full-text search you probably want to index the full content, something
> > like:
> >
> > article
> > Une date..........
> > 123
> >
> >
> > You could create another index with words and each word Document have an ID
> > of their "parent" (e.g. the article's ID), so you do a query against the
> > above index, get the IDs of matches, and then get words for those matches.
> >  Of course, you can also use a RDBMS or some other storage for the second
> > part.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: Saša Mutić 
> > > To: solr-user@lucene.apache.org
> > > Sent: Thursday, October 2, 2008 6:14:14 AM
> > > Subject: complex XML structure problem
> > >
> > > Hello,
> > >
> > > I would appreciate any suggestions on solving following problem:
> > >
> > > I'm trying to index newspaper. After processing logical structure and
> > > articles, I have similar structure to this...
> > >
> > >
> > > date="18560301">
> > >
> > > type="TEXT" cont="0"/>
> > >
> > > type="TEXT" cont="0"/>
> > >
> > > type="TEXT" cont="0"/>
> > > ...
> > >
> > > date="18560301">
> > >
> > > type="ADVERTISEMENT" cont="0"/>
> > > ...
> > >
> > > Obviously, I would like to have all the benefits of full-text search with
> > > proximity and other advanced options.
> > > After going through SCHEMA.XML and docs, I can see that I should split
> > each
> > > "word" into something like this...
> > >
> > >         ARTICLE
> > >         201
> > >         5
> > >         6
> > >         18560301
> > >         Une
> > >         1137
> > >         147
> > >         1665
> > >         951
> > >         1
> > >         TEXT
> > >         0
> > >
> > >
> > > However, if I use this approach, it seems like I lost some core
> > > functionality of search...
> > >
> > > - multiword searching ? For example searching for "Une date" ? Since each
> > > word is treated as standalone document ?
> > >
> > > - Proximity search ?
> > >
> > > ... and so on.
> > >
> > > So I guess this approach isn't solution to my goal. Does anyone have some
> > > recommendations on how to solve this ?
> > >
> > > Goal would be to receive results that would have mentioned "attributes"
> > for
> > > each hit...so for previous example "Une date", I would receive hits with
> > all
> > > attributes that would allow me to correctly position them on image
> > (t,l,b,r
> > > as coordinates for example).
> > >
> > > Kind Regards,
> > >
> > > Sasha
> >
> >

Re: complex XML structure problem

Reply via email to