Bok Otis, I was thinking about this approach, but was wondering if there is more elegant approach where I wouldn't have to recreate logic for proximity and quoted complex queries (identification of neighbor hits and quote queries for highlighting and positioning on image).
If nobody comes up with better approach, I will use something similar as you described. Thanks for fast response :) Kind Regards, Saša On Thu, Oct 2, 2008 at 5:51 PM, Otis Gospodnetic <[EMAIL PROTECTED] > wrote: > Bok Saša, > > It sounds like you need to keep per-word metadata, plus the raw content so > you can full-text search it. > If so, consider keeping the meta data elsewhere - e.g. different index, > external DB, etc. > For full-text search you probably want to index the full content, something > like: > > <field name="type">article</field> > <field name="content">Une date..........</field> > <field name="id">123</field> > > > You could create another index with words and each word Document have an ID > of their "parent" (e.g. the article's ID), so you do a query against the > above index, get the IDs of matches, and then get words for those matches. > Of course, you can also use a RDBMS or some other storage for the second > part. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- > > From: Saša Mutić <[EMAIL PROTECTED]> > > To: solr-user@lucene.apache.org > > Sent: Thursday, October 2, 2008 6:14:14 AM > > Subject: complex XML structure problem > > > > Hello, > > > > I would appreciate any suggestions on solving following problem: > > > > I'm trying to index newspaper. After processing logical structure and > > articles, I have similar structure to this... > > > > > > date="18560301"> > > > > type="TEXT" cont="0"/> > > > > type="TEXT" cont="0"/> > > > > type="TEXT" cont="0"/> > > ... > > > > date="18560301"> > > > > type="ADVERTISEMENT" cont="0"/> > > ... > > > > Obviously, I would like to have all the benefits of full-text search with > > proximity and other advanced options. > > After going through SCHEMA.XML and docs, I can see that I should split > each > > "word" into something like this... > > > > ARTICLE > > 201 > > 5 > > 6 > > 18560301 > > Une > > 1137 > > 147 > > 1665 > > 951 > > 1 > > TEXT > > 0 > > > > > > However, if I use this approach, it seems like I lost some core > > functionality of search... > > > > - multiword searching ? For example searching for "Une date" ? Since each > > word is treated as standalone document ? > > > > - Proximity search ? > > > > ... and so on. > > > > So I guess this approach isn't solution to my goal. Does anyone have some > > recommendations on how to solve this ? > > > > Goal would be to receive results that would have mentioned "attributes" > for > > each hit...so for previous example "Une date", I would receive hits with > all > > attributes that would allow me to correctly position them on image > (t,l,b,r > > as coordinates for example). > > > > Kind Regards, > > > > Sasha > >