Bok Saša,

It sounds like you need to keep per-word metadata, plus the raw content so you 
can full-text search it.
If so, consider keeping the meta data elsewhere - e.g. different index, 
external DB, etc.
For full-text search you probably want to index the full content, something 
like:

<field name="type">article</field>
<field name="content">Une date..........</field>
<field name="id">123</field>


You could create another index with words and each word Document have an ID of 
their "parent" (e.g. the article's ID), so you do a query against the above 
index, get the IDs of matches, and then get words for those matches.  Of 
course, you can also use a RDBMS or some other storage for the second part.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Saša Mutić <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, October 2, 2008 6:14:14 AM
> Subject: complex XML structure problem
> 
> Hello,
> 
> I would appreciate any suggestions on solving following problem:
> 
> I'm trying to index newspaper. After processing logical structure and
> articles, I have similar structure to this...
> 
> 
> date="18560301">
>   
> type="TEXT" cont="0"/>
>   
> type="TEXT" cont="0"/>
>  
> type="TEXT" cont="0"/>
> ...
> 
> date="18560301">
>   
> type="ADVERTISEMENT" cont="0"/>
> ...
> 
> Obviously, I would like to have all the benefits of full-text search with
> proximity and other advanced options.
> After going through SCHEMA.XML and docs, I can see that I should split each
> "word" into something like this...
>     
>         ARTICLE
>         201
>         5
>         6
>         18560301
>         Une
>         1137
>         147
>         1665
>         951
>         1
>         TEXT
>         0
>     
> 
> However, if I use this approach, it seems like I lost some core
> functionality of search...
> 
> - multiword searching ? For example searching for "Une date" ? Since each
> word is treated as standalone document ?
> 
> - Proximity search ?
> 
> ... and so on.
> 
> So I guess this approach isn't solution to my goal. Does anyone have some
> recommendations on how to solve this ?
> 
> Goal would be to receive results that would have mentioned "attributes" for
> each hit...so for previous example "Une date", I would receive hits with all
> attributes that would allow me to correctly position them on image (t,l,b,r
> as coordinates for example).
> 
> Kind Regards,
> 
> Sasha

Reply via email to