Hello, I would appreciate any suggestions on solving following problem:
I'm trying to index newspaper. After processing logical structure and articles, I have similar structure to this... <article id="201" article_type="ARTICLE" pub_id="5" iss_id="6" date="18560301"> <word t="1137" l="147" b="1665" r="951" content="Une" page="1" type="TEXT" cont="0"/> <word t="1136" l="213" b="1664" r="1017" content="date" page="1" type="TEXT" cont="0"/> <word t="1133" l="292" b="1661" r="1096" content="nouvelle" page="1" type="TEXT" cont="0"/> ... <article id="207" article_type="ADVERTISEMENT" pub_id="5" iss_id="6" date="18560301"> <word t="1749" l="1094" b="1825" r="1731" content="INTÉRIEUR" page="4" type="ADVERTISEMENT" cont="0"/> ... Obviously, I would like to have all the benefits of full-text search with proximity and other advanced options. After going through SCHEMA.XML and docs, I can see that I should split each "word" into something like this... <doc> <field name="type">ARTICLE</field> <field name="id">201</field> <field name="pub_id">5</field> <field name="iss_id">6</field> <field name="date">18560301</field> <field name="content">Une</field> <field name="t">1137</field> <field name="l">147</field> <field name="b">1665</field> <field name="r">951</field> <field name="page">1</field> <field name="wordttype">TEXT</field> <field name="cont">0</field> </doc> However, if I use this approach, it seems like I lost some core functionality of search... - multiword searching ? For example searching for "Une date" ? Since each word is treated as standalone document ? - Proximity search ? ... and so on. So I guess this approach isn't solution to my goal. Does anyone have some recommendations on how to solve this ? Goal would be to receive results that would have mentioned "attributes" for each hit...so for previous example "Une date", I would receive hits with all attributes that would allow me to correctly position them on image (t,l,b,r as coordinates for example). Kind Regards, Sasha