Re: complex XML structure problem

Saša Mutić Thu, 02 Oct 2008 12:44:05 -0700

Bok Otis,

I was thinking about this approach, but was wondering if there is more
elegant approach where I wouldn't have to recreate logic for proximity and
quoted complex queries (identification of neighbor hits and quote queries
for highlighting and positioning on image).


If nobody comes up with better approach, I will use something similar as you
described.

Thanks for fast response :)

Kind Regards,
Saša


On Thu, Oct 2, 2008 at 5:51 PM, Otis Gospodnetic <[EMAIL PROTECTED]
> wrote:

> Bok Saša,
>
> It sounds like you need to keep per-word metadata, plus the raw content so
> you can full-text search it.
> If so, consider keeping the meta data elsewhere - e.g. different index,
> external DB, etc.
> For full-text search you probably want to index the full content, something
> like:
>
> <field name="type">article</field>
> <field name="content">Une date..........</field>
> <field name="id">123</field>
>
>
> You could create another index with words and each word Document have an ID
> of their "parent" (e.g. the article's ID), so you do a query against the
> above index, get the IDs of matches, and then get words for those matches.
>  Of course, you can also use a RDBMS or some other storage for the second
> part.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Saša Mutić <[EMAIL PROTECTED]>
> > To: [email protected]
> > Sent: Thursday, October 2, 2008 6:14:14 AM
> > Subject: complex XML structure problem
> >
> > Hello,
> >
> > I would appreciate any suggestions on solving following problem:
> >
> > I'm trying to index newspaper. After processing logical structure and
> > articles, I have similar structure to this...
> >
> >
> > date="18560301">
> >
> > type="TEXT" cont="0"/>
> >
> > type="TEXT" cont="0"/>
> >
> > type="TEXT" cont="0"/>
> > ...
> >
> > date="18560301">
> >
> > type="ADVERTISEMENT" cont="0"/>
> > ...
> >
> > Obviously, I would like to have all the benefits of full-text search with
> > proximity and other advanced options.
> > After going through SCHEMA.XML and docs, I can see that I should split
> each
> > "word" into something like this...
> >
> >         ARTICLE
> >         201
> >         5
> >         6
> >         18560301
> >         Une
> >         1137
> >         147
> >         1665
> >         951
> >         1
> >         TEXT
> >         0
> >
> >
> > However, if I use this approach, it seems like I lost some core
> > functionality of search...
> >
> > - multiword searching ? For example searching for "Une date" ? Since each
> > word is treated as standalone document ?
> >
> > - Proximity search ?
> >
> > ... and so on.
> >
> > So I guess this approach isn't solution to my goal. Does anyone have some
> > recommendations on how to solve this ?
> >
> > Goal would be to receive results that would have mentioned "attributes"
> for
> > each hit...so for previous example "Une date", I would receive hits with
> all
> > attributes that would allow me to correctly position them on image
> (t,l,b,r
> > as coordinates for example).
> >
> > Kind Regards,
> >
> > Sasha
>
>

Re: complex XML structure problem

Reply via email to